Accuracy, Precision, Recall, and F1 Score with Scikit-learn

Accuracy, precision, recall, and F1 score are the core classification metrics in Scikit-learn used to evaluate the performance of a machine learning model. Each of these metrics offers a different perspective on model performance, especially in imbalanced classification problems.

Key Characteristics

  • Accuracy: Measures overall correctness
  • Precision: Focuses on positive predictive value
  • Recall: Focuses on sensitivity or true positive rate
  • F1 Score: Harmonic mean of precision and recall

Basic Rules

  • Use accuracy_score for balanced datasets.
  • Use precision_score when false positives are costly.
  • Use recall_score when false negatives are costly.
  • Use f1_score to balance precision and recall.

Syntax Table

SL NO Metric Function Name Syntax Example Description
1 Accuracy accuracy_score accuracy_score(y_true, y_pred) Proportion of correct predictions
2 Precision precision_score precision_score(y_true, y_pred) TP / (TP + FP)
3 Recall recall_score recall_score(y_true, y_pred) TP / (TP + FN)
4 F1 Score f1_score f1_score(y_true, y_pred) Harmonic mean of precision and recall

Syntax Explanation

1. Accuracy

What is it? Overall proportion of correct predictions made by the model.

from sklearn.metrics import accuracy_score
y_true = [0, 1, 1, 0, 1]
y_pred = [0, 1, 0, 0, 1]
print(accuracy_score(y_true, y_pred))

Explanation:

  • Compares how many predictions match the true values.
  • Formula: (TP + TN) / (TP + TN + FP + FN)
  • Best used when the dataset is balanced and classes occur with similar frequencies.
  • Example output: 0.8 means 80% predictions were correct.

2. Precision

What is it? Measures the accuracy of positive predictions.

from sklearn.metrics import precision_score
y_true = [0, 1, 1, 0, 1]
y_pred = [0, 1, 0, 0, 1]
print(precision_score(y_true, y_pred))

Explanation:

  • Formula: TP / (TP + FP)
  • Answers the question: “Of all items labeled positive, how many were truly positive?”
  • High precision is critical in applications like spam detection, where false positives are undesirable.
  • Example output: 1.0 means every predicted positive was actually positive.

3. Recall

What is it? Measures the completeness of positive predictions.

from sklearn.metrics import recall_score
y_true = [0, 1, 1, 0, 1]
y_pred = [0, 1, 0, 0, 1]
print(recall_score(y_true, y_pred))

Explanation:

  • Formula: TP / (TP + FN)
  • Tells us how many actual positives were correctly identified.
  • Important in medical testing or fraud detection where missing positives is costly.
  • Example output: 0.666 means ~66.6% of all real positives were correctly predicted.

4. F1 Score

What is it? Combines precision and recall into a single score.

from sklearn.metrics import f1_score
y_true = [0, 1, 1, 0, 1]
y_pred = [0, 1, 0, 0, 1]
print(f1_score(y_true, y_pred))

Explanation:

  • Formula: 2 * (precision * recall) / (precision + recall)
  • Provides a balanced metric in cases where you care equally about precision and recall.
  • Especially useful for datasets with class imbalance.
  • Example output: 0.8 means the model has a good balance of precision and recall.
  • Can be macro, micro, or weighted averaged in multiclass settings using the average parameter.

Real-Life Project: Evaluate a Classifier on Imbalanced Dataset

Objective

Use F1 and recall to assess model performance on skewed data.

Code Example

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Load data
data = pd.read_csv('imbalanced_classification.csv')
X = data.drop('target', axis=1)
y = data['target']

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train model
model = LogisticRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

# Evaluate
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Precision:", precision_score(y_test, y_pred))
print("Recall:", recall_score(y_test, y_pred))
print("F1 Score:", f1_score(y_test, y_pred))

Expected Output

  • Printed scores for all four metrics.
  • Clear understanding of model behavior with imbalanced data.

Common Mistakes

  • ❌ Using accuracy on imbalanced datasets.
  • ❌ Not considering the business context when selecting metrics.
  • ❌ Ignoring precision-recall trade-offs.

Further Reading

📘 Hands-On Python and Scikit-Learn: A Practical Guide to Machine Learning by Sarful Hassan

🔗 Available on Amazon