Confusion Matrix Visualizer
Paste predictions (actual, predicted) or enter a matrix directly. Visualizes a color-coded heatmap and calculates accuracy, precision, recall, F1 score per class, macro avg, and weighted avg.
Matrix (comma-separated rows)
Class Names (comma-separated)
One name per class, matching the number of rows/columns in the matrix. If omitted, classes are auto-labeled Class 0, Class 1…
Confusion Matrix Heatmap
Overall Accuracy
87.00%
87 correct
100 total
Per-Class Metrics
| Class | Precision | Recall | F1 Score | Support |
|---|---|---|---|---|
| spam | 0.873 | 0.917 | 0.894 | 60 |
| ham | 0.865 | 0.800 | 0.831 | 40 |
| Macro avg | 0.869 | 0.858 | 0.863 | — |
| Weighted avg | 0.870 | 0.870 | 0.869 | — |
How a Confusion Matrix Works
A confusion matrix places actual classes on the rows and predicted classes on the columns. Each cell (i, j) counts how many samples of true class i were predicted as class j. Diagonal cells represent correct predictions. Off-diagonal cells reveal the specific pattern of errors — which classes the model is confusing for which.
For a binary classifier with classes Positive and Negative, the four cells have classic names: True Positive (top-left), False Negative (top-right), False Positive (bottom-left), True Negative (bottom-right). All classification metrics — accuracy, precision, recall, F1, specificity — are derived from these four counts.
| Metric | Formula | Interpretation |
|---|---|---|
| Accuracy | (TP + TN) / Total | Overall fraction of correct predictions |
| Precision | TP / (TP + FP) | Of all predicted positives, how many are truly positive |
| Recall | TP / (TP + FN) | Of all actual positives, how many did the model find |
| F1 Score | 2 × P × R / (P + R) | Harmonic mean of precision and recall |
Generating a Confusion Matrix with scikit-learn
If you have model predictions as Python arrays, use scikit-learn to generate the matrix and export it as CSV for this tool.
From sklearn predictions
from sklearn.metrics import confusion_matrix, classification_report
import pandas as pd
y_true = ["cat", "dog", "cat", "bird", "dog"]
y_pred = ["cat", "cat", "cat", "bird", "dog"]
classes = sorted(set(y_true) | set(y_pred))
cm = confusion_matrix(y_true, y_pred, labels=classes)
# Print matrix CSV (paste into Matrix Input tab)
for row in cm:
print(",".join(map(str, row)))
# Print class names (paste into Class Names field)
print(",".join(classes))
# Or get predictions CSV (paste into Predictions Input tab)
df = pd.DataFrame({"actual": y_true, "predicted": y_pred})
print(df.to_csv(index=False))When to Use Precision vs Recall vs F1
The right metric depends on the cost structure of your problem. In most real-world ML deployments, false positives and false negatives have different consequences and you must choose accordingly.
Optimize Precision
False positives are expensive
Examples: Spam filter, fraud detection, ad targeting
A false alarm wastes resources or harms user trust.
Optimize Recall
False negatives are dangerous
Examples: Cancer screening, defect detection, safety alerts
Missing a true positive can have severe consequences.
Optimize F1
Need a single balanced metric
Examples: NLP tasks, leaderboard comparisons, benchmarking
Use macro F1 for class-imbalanced datasets.
Frequently Asked Questions
What is a confusion matrix?
A confusion matrix is a table that summarizes the performance of a classification model. Each row represents the actual (true) class and each column represents the predicted class. The diagonal cells contain correct predictions (true positives for each class); off-diagonal cells represent errors (false positives and false negatives). By reading the matrix you can immediately see not just how many predictions were wrong, but which classes are being confused with which.
What is the difference between precision, recall, and F1 score?
Precision answers 'of all items the model labeled as class X, how many actually were X?' It measures false positive rate. Recall answers 'of all actual class X items, how many did the model correctly find?' It measures false negative rate. F1 score is the harmonic mean of precision and recall — it balances both. Use precision when false positives are costly (e.g., spam detection). Use recall when false negatives are costly (e.g., cancer screening). Use F1 when you need a single balanced metric.
What is the difference between macro and weighted average?
Macro average computes the metric independently for each class and then takes an unweighted mean — every class contributes equally regardless of how many samples it has. This highlights underperformance on minority classes. Weighted average weights each class's metric by the number of true instances of that class (its support), so majority classes dominate the average. Use macro average to diagnose class imbalance issues; use weighted average when you care more about overall prediction volume.
How do I use my model's predictions with this tool?
Switch to the Predictions Input tab and paste two columns: actual,predicted — one pair per line. You can include an optional header row (actual,predicted) which will be skipped automatically. Class names are discovered from the data and the confusion matrix is built programmatically. Alternatively, if you already have a computed matrix (e.g., from sklearn.metrics.confusion_matrix), switch to Matrix Input and paste the N×N comma-separated values along with class names.
What is a good F1 score for a classification model?
This depends heavily on the task and class balance. For balanced binary classification, F1 above 0.85 is generally considered strong. For multi-class problems with imbalanced classes, F1 above 0.75 (macro average) is often acceptable. Medical or safety-critical applications demand F1 near or above 0.95. For context: random guessing on a balanced binary problem gives F1 around 0.5. This tool color-codes F1: green for above 0.8, yellow for 0.6–0.8, and red for below 0.6.