Confusion Matrix Visualizer

Paste predictions (actual, predicted) or enter a matrix directly. Visualizes a color-coded heatmap and calculates accuracy, precision, recall, F1 score per class, macro avg, and weighted avg.

Load example:

Matrix (comma-separated rows)

Class Names (comma-separated)

One name per class, matching the number of rows/columns in the matrix. If omitted, classes are auto-labeled Class 0, Class 1…

Confusion Matrix Heatmap

Predicted →
Actual →
spam
ham
spam
ham
55
5
8
32
Correct (diagonal)
Misclassified

Overall Accuracy

87.00%

87 correct

100 total

Per-Class Metrics

ClassPrecisionRecallF1 ScoreSupport
spam0.8730.9170.89460
ham0.8650.8000.83140
Macro avg0.8690.8580.863
Weighted avg0.8700.8700.869
F1 > 0.8Good
F1 > 0.6Moderate
F1 ≤ 0.6Needs improvement

How a Confusion Matrix Works

A confusion matrix places actual classes on the rows and predicted classes on the columns. Each cell (i, j) counts how many samples of true class i were predicted as class j. Diagonal cells represent correct predictions. Off-diagonal cells reveal the specific pattern of errors — which classes the model is confusing for which.

For a binary classifier with classes Positive and Negative, the four cells have classic names: True Positive (top-left), False Negative (top-right), False Positive (bottom-left), True Negative (bottom-right). All classification metrics — accuracy, precision, recall, F1, specificity — are derived from these four counts.

MetricFormulaInterpretation
Accuracy(TP + TN) / TotalOverall fraction of correct predictions
PrecisionTP / (TP + FP)Of all predicted positives, how many are truly positive
RecallTP / (TP + FN)Of all actual positives, how many did the model find
F1 Score2 × P × R / (P + R)Harmonic mean of precision and recall

Generating a Confusion Matrix with scikit-learn

If you have model predictions as Python arrays, use scikit-learn to generate the matrix and export it as CSV for this tool.

From sklearn predictions

from sklearn.metrics import confusion_matrix, classification_report
import pandas as pd

y_true = ["cat", "dog", "cat", "bird", "dog"]
y_pred = ["cat", "cat", "cat", "bird", "dog"]

classes = sorted(set(y_true) | set(y_pred))
cm = confusion_matrix(y_true, y_pred, labels=classes)

# Print matrix CSV (paste into Matrix Input tab)
for row in cm:
    print(",".join(map(str, row)))

# Print class names (paste into Class Names field)
print(",".join(classes))

# Or get predictions CSV (paste into Predictions Input tab)
df = pd.DataFrame({"actual": y_true, "predicted": y_pred})
print(df.to_csv(index=False))

When to Use Precision vs Recall vs F1

The right metric depends on the cost structure of your problem. In most real-world ML deployments, false positives and false negatives have different consequences and you must choose accordingly.

Optimize Precision

False positives are expensive

Examples: Spam filter, fraud detection, ad targeting

A false alarm wastes resources or harms user trust.

Optimize Recall

False negatives are dangerous

Examples: Cancer screening, defect detection, safety alerts

Missing a true positive can have severe consequences.

Optimize F1

Need a single balanced metric

Examples: NLP tasks, leaderboard comparisons, benchmarking

Use macro F1 for class-imbalanced datasets.

Frequently Asked Questions

What is a confusion matrix?

A confusion matrix is a table that summarizes the performance of a classification model. Each row represents the actual (true) class and each column represents the predicted class. The diagonal cells contain correct predictions (true positives for each class); off-diagonal cells represent errors (false positives and false negatives). By reading the matrix you can immediately see not just how many predictions were wrong, but which classes are being confused with which.

What is the difference between precision, recall, and F1 score?

Precision answers 'of all items the model labeled as class X, how many actually were X?' It measures false positive rate. Recall answers 'of all actual class X items, how many did the model correctly find?' It measures false negative rate. F1 score is the harmonic mean of precision and recall — it balances both. Use precision when false positives are costly (e.g., spam detection). Use recall when false negatives are costly (e.g., cancer screening). Use F1 when you need a single balanced metric.

What is the difference between macro and weighted average?

Macro average computes the metric independently for each class and then takes an unweighted mean — every class contributes equally regardless of how many samples it has. This highlights underperformance on minority classes. Weighted average weights each class's metric by the number of true instances of that class (its support), so majority classes dominate the average. Use macro average to diagnose class imbalance issues; use weighted average when you care more about overall prediction volume.

How do I use my model's predictions with this tool?

Switch to the Predictions Input tab and paste two columns: actual,predicted — one pair per line. You can include an optional header row (actual,predicted) which will be skipped automatically. Class names are discovered from the data and the confusion matrix is built programmatically. Alternatively, if you already have a computed matrix (e.g., from sklearn.metrics.confusion_matrix), switch to Matrix Input and paste the N×N comma-separated values along with class names.

What is a good F1 score for a classification model?

This depends heavily on the task and class balance. For balanced binary classification, F1 above 0.85 is generally considered strong. For multi-class problems with imbalanced classes, F1 above 0.75 (macro average) is often acceptable. Medical or safety-critical applications demand F1 near or above 0.95. For context: random guessing on a balanced binary problem gives F1 around 0.5. This tool color-codes F1: green for above 0.8, yellow for 0.6–0.8, and red for below 0.6.