Confusion Matrix Calculator

Compute accuracy, precision, recall, F1, MCC, kappa, and 15+ classification metrics from TP/FP/FN/TN values with visual matrix and performance bars.

Confusion Matrix

Predicted +Predicted −Total
Actual +90595
Actual −10895905
Total1009001000
Accuracy
98.50%
(TP + TN) / Total
Precision (PPV)
90.00%
TP / (TP + FP)
Recall (Sensitivity)
94.74%
TP / (TP + FN)
Specificity (TNR)
98.90%
TN / (TN + FP)
F1 Score
0.9231
2 × Precision × Recall / (P + R)
MCC
0.9151
Matthews Correlation Coefficient

Extended Metrics

MetricValueFormula/Description
Balanced Accuracy96.82%(Sensitivity + Specificity) / 2
Cohen's Kappa (κ)0.9148Agreement beyond chance
Youden's J0.9363Sensitivity + Specificity − 1
False Positive Rate1.10%FP / (FP + TN)
False Negative Rate5.26%FN / (TP + FN)
NPV99.44%TN / (TN + FN)
Prevalence9.50%(TP + FN) / Total
LR+85.74Sensitivity / FPR
LR−0.0532FNR / Specificity
F0.5 Score0.9091Weights precision higher
F2 Score0.9375Weights recall higher

Performance Bars

Accuracy
98.5%
Precision
90.0%
Recall
94.7%
Specificity
98.9%
F1 Score
92.3%
Planning notes, formulas, and examples

About the Confusion Matrix Calculator

A confusion matrix is the foundational tool for evaluating binary classification models. It breaks down predictions into four categories: True Positives (correctly identified positives), False Positives (incorrectly flagged as positive), False Negatives (missed positives), and True Negatives (correctly identified negatives). From these four numbers, a wealth of performance metrics can be derived.

This calculator computes over 15 classification metrics including accuracy, precision, recall (sensitivity), specificity, F1 score, Matthews Correlation Coefficient (MCC), Cohen's kappa, balanced accuracy, Youden's J index, likelihood ratios, F0.5, F2, and more. Each metric captures a different aspect of classifier performance, and no single metric tells the whole story.

Understanding when to prioritize which metric is crucial. In medical diagnosis, you want high recall (don't miss sick patients) even at the cost of some false positives. In spam filtering, high precision matters (don't send real email to spam). The presets demonstrate common scenarios across medical testing, spam filtering, fraud detection, and image recognition.

When This Page Helps

Evaluating classification models requires more than just accuracy. It gives a comprehensive dashboard of 15+ metrics from four simple inputs, saving time and reducing calculation errors. The visual confusion matrix and performance bars make it easy to spot strengths and weaknesses at a glance.

It's invaluable for machine learning practitioners, medical researchers evaluating diagnostic tests, students learning classification evaluation, and anyone who needs to communicate model performance clearly. The presets covering different domains help build intuition about which metrics matter in which context.

How to Use the Inputs

  1. Enter the four confusion matrix values: True Positives (TP), False Positives (FP), False Negatives (FN), and True Negatives (TN).
  2. Use presets for common scenarios like medical testing, spam filtering, or fraud detection.
  3. Review the color-coded confusion matrix visualization showing the four quadrants.
  4. Examine the primary output cards: accuracy, precision, recall, specificity, F1, and MCC.
  5. Check the extended metrics table for comprehensive evaluation including kappa and likelihood ratios.
  6. Compare metrics visually using the performance bars at the bottom.
Formula used
Accuracy = (TP + TN) / (TP + FP + FN + TN) Precision = TP / (TP + FP) Recall = TP / (TP + FN) Specificity = TN / (TN + FP) F1 = 2 × Precision × Recall / (Precision + Recall) MCC = (TP×TN − FP×FN) / √((TP+FP)(TP+FN)(TN+FP)(TN+FN))

Example Calculation

Result: Accuracy: 98.5%, Precision: 90%, Recall: 94.7%, F1: 0.923

A medical test correctly identifies 90 of 95 positive cases (94.7% recall) with only 10 false alarms among 905 negatives (98.9% specificity). The F1 score of 0.923 reflects strong overall positive-class performance.

Tips & Best Practices

  • Never rely on accuracy alone — especially with imbalanced classes.
  • MCC is often considered the single best metric for binary classification overall performance.
  • High recall is critical in medical screening; high precision matters in precision-critical tasks like legal search.
  • F1 score only considers the positive class — use balanced accuracy or MCC for a fuller picture.
  • Prevalence affects PPV and NPV dramatically — the same test performs differently in high- vs. low-prevalence populations.
  • Try the "Perfect Classifier" preset to see what ideal metrics look like, then compare with your model.

Understanding the Confusion Matrix

The confusion matrix is organized with actual classes as rows and predicted classes as columns (though conventions vary). True Positives and True Negatives sit on the main diagonal — these are correct predictions. False Positives (type I errors) and False Negatives (type II errors) are the off-diagonal cells representing mistakes.

Every classification metric derives from these four counts. Accuracy uses all four; precision and recall focus on the positive class; specificity focuses on the negative class. The choice of which metric to optimize depends on the costs of different types of errors.

The Precision-Recall Trade-off

Precision and recall are inversely related in practice. Making a classifier more conservative (requiring stronger evidence for positive predictions) increases precision but decreases recall — fewer false positives, but more missed positives. Conversely, a more liberal threshold catches more positives (higher recall) but at the cost of more false alarms (lower precision). The F1 score represents the harmonic mean of both, penalizing extreme imbalances between them.

Matthews Correlation Coefficient

MCC, introduced by biochemist Brian Matthews in 1975, is increasingly recognized as the most informative single metric for binary classification. Unlike accuracy, it accounts for all four quadrants. Unlike F1, it doesn't ignore true negatives. An MCC of 0 indicates random prediction; +1 is perfect; −1 is total disagreement. Several studies have shown MCC to be the most reliable metric when classes are imbalanced.

Sources & Methodology

Last updated:

Frequently Asked Questions

  • If 99% of cases are negative, a model that always predicts negative achieves 99% accuracy while catching zero positives. Precision, recall, F1, and MCC are more informative for imbalanced data because they focus on the minority class performance.