Model Overfitting and Underfitting with Scikit-learn

Overfitting and underfitting are two common problems in machine learning that affect model performance and generalization. Scikit-learn provides tools and techniques to detect and address these issues effectively.

Key Characteristics

  • Overfitting: Model learns noise; performs well on training but poorly on unseen data.
  • Underfitting: Model is too simple; fails to capture data patterns.
  • Requires balance via tuning and validation.
  • Can be visualized using learning curves.

Basic Rules

  • Monitor both training and validation performance.
  • Use cross-validation to detect generalization issues.
  • Apply regularization or model simplification to reduce overfitting.
  • Increase model complexity or add features to reduce underfitting.

Syntax Table

SL NO Technique/Tool Syntax Example Description
1 Learning Curve learning_curve(estimator, X, y) Measures train/validation performance vs size
2 Validation Curve validation_curve(estimator, X, y, ...) Shows performance vs parameter values
3 Regularization (Ridge) Ridge(alpha=1.0) Reduces model complexity
4 Polynomial Features PolynomialFeatures(degree=3) Adds complexity to combat underfitting

Syntax Explanation

1. Learning Curve

from sklearn.model_selection import learning_curve
train_sizes, train_scores, test_scores = learning_curve(model, X, y, cv=5)

Explanation:

  • Plots training and validation scores as training size increases.
  • Gap between train/test curves indicates overfitting.
  • Flat curves indicate underfitting.

2. Validation Curve

from sklearn.model_selection import validation_curve
param_range = [0.001, 0.01, 0.1, 1, 10]
train_scores, test_scores = validation_curve(model, X, y, param_name='alpha', param_range=param_range, cv=5)

Explanation:

  • Evaluates model performance for different values of a hyperparameter.
  • Detects under- or overfitting trends based on score patterns.

3. Ridge Regularization

from sklearn.linear_model import Ridge
model = Ridge(alpha=1.0)

Explanation:

  • Adds penalty to large coefficients.
  • Helps simplify overly complex models and reduce overfitting.

4. Polynomial Features

from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(degree=3)
X_poly = poly.fit_transform(X)

Explanation:

  • Adds higher-order terms to input features.
  • Allows simple models (like linear regression) to capture nonlinear relationships.

Real-Life Project: Detecting Overfitting with Learning Curves

Objective

Compare training vs validation scores to detect overfitting in a regression model.

Code Example

import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import learning_curve
from sklearn.datasets import make_regression

# Generate synthetic data
X, y = make_regression(n_samples=500, n_features=1, noise=10, random_state=42)
model = LinearRegression()

# Compute learning curves
train_sizes, train_scores, test_scores = learning_curve(model, X, y, cv=5)
train_mean = np.mean(train_scores, axis=1)
test_mean = np.mean(test_scores, axis=1)

# Plot
plt.plot(train_sizes, train_mean, label='Training score')
plt.plot(train_sizes, test_mean, label='Cross-validation score')
plt.xlabel('Training Set Size')
plt.ylabel('Score')
plt.title('Learning Curve Example')
plt.legend()
plt.grid(True)
plt.show()

Expected Output

  • Two curves showing model performance.
  • Wide gap = overfitting; both low = underfitting; both high = good fit.

Common Mistakes

  • ❌ Not using validation data to detect overfitting.
  • ❌ Confusing poor training performance with overfitting.
  • ❌ Using overly complex models on small datasets.

Further Reading

📘 Hands-On Python and Scikit-Learn: A Practical Guide to Machine Learning by Sarful Hassan

🔗 Available on Amazon