Model Evaluation Metrics Overview in Scikit-learn

Evaluating machine learning models is crucial for understanding their performance and guiding model improvement. Scikit-learn provides a wide range of metrics for classification, regression, and clustering tasks.

Key Characteristics

  • Task-Specific Metrics: Classification, regression, clustering
  • Supports Binary, Multiclass, Multilabel Problems
  • Customizable Scoring Options
  • Easy Integration with GridSearchCV and cross_val_score

Basic Rules

  • Choose metrics aligned with business or scientific goals.
  • For imbalanced classes, use metrics beyond accuracy.
  • For regression, evaluate both error and fit.
  • Use make_scorer to create custom scoring functions.

Syntax Table

SL NO Metric Type Function Name Syntax Example Description
1 Classification accuracy_score accuracy_score(y_true, y_pred) Proportion of correct predictions
2 Classification precision_score precision_score(y_true, y_pred) True positives / predicted positives
3 Classification recall_score recall_score(y_true, y_pred) True positives / actual positives
4 Classification f1_score f1_score(y_true, y_pred) Harmonic mean of precision and recall
5 Regression mean_squared_error mean_squared_error(y_true, y_pred) Average of squared errors
6 Regression r2_score r2_score(y_true, y_pred) Coefficient of determination
7 Clustering silhouette_score silhouette_score(X, labels) How well-separated the clusters are

Syntax Explanation

1. Accuracy Score

What is it? Basic metric to evaluate classification.

from sklearn.metrics import accuracy_score
y_true = [0, 1, 1, 0]
y_pred = [0, 0, 1, 1]
print(accuracy_score(y_true, y_pred))

Explanation:

  • Measures fraction of correctly classified instances.
  • Not reliable for imbalanced datasets.

2. Precision Score

What is it? Measures exactness in classification.

from sklearn.metrics import precision_score
print(precision_score(y_true, y_pred))

Explanation:

  • High precision means few false positives.
  • Useful when false positives are costly.

3. Recall Score

What is it? Measures completeness of classification.

from sklearn.metrics import recall_score
print(recall_score(y_true, y_pred))

Explanation:

  • High recall means few false negatives.
  • Important when missing positives is costly.

4. F1 Score

What is it? Combines precision and recall.

from sklearn.metrics import f1_score
print(f1_score(y_true, y_pred))

Explanation:

  • Useful when precision and recall are equally important.
  • Balances false positives and false negatives.

5. Mean Squared Error (MSE)

What is it? Common metric for regression.

from sklearn.metrics import mean_squared_error
y_true = [3.0, -0.5, 2.0, 7.0]
y_pred = [2.5, 0.0, 2.1, 7.8]
print(mean_squared_error(y_true, y_pred))

Explanation:

  • Penalizes larger errors more severely.
  • Sensitive to outliers.

6. R^2 Score

What is it? Measures goodness of fit.

from sklearn.metrics import r2_score
print(r2_score(y_true, y_pred))

Explanation:

  • Value between 0 and 1 for regression fit.
  • Closer to 1 means better prediction.

7. Silhouette Score (Clustering)

What is it? Evaluates cohesion and separation.

from sklearn.metrics import silhouette_score
silhouette_score(X, labels)

Explanation:

  • Measures how well each point fits into its cluster.
  • Value near 1 indicates good clustering.

Real-Life Project: Evaluate a Classifier with F1 and Accuracy

Objective

Train and evaluate a logistic regression model using classification metrics.

Code Example

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, f1_score

# Load dataset
data = pd.read_csv('binary_classification.csv')
X = data.drop('target', axis=1)
y = data['target']

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Model training
model = LogisticRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

# Evaluation
print("Accuracy:", accuracy_score(y_test, y_pred))
print("F1 Score:", f1_score(y_test, y_pred))

Expected Output

  • Classification metrics such as accuracy and F1 score.
  • Model performance report ready for review.

Common Mistakes

  • ❌ Using accuracy alone on imbalanced data.
  • ❌ Confusing precision and recall.
  • ❌ Ignoring regression vs classification context.

Further Reading

📘 Hands-On Python and Scikit-Learn: A Practical Guide to Machine Learning by Sarful Hassan

🔗 Available on Amazon