Model Overfitting and Underfitting with Scikit-learn

Overfitting and underfitting are two common problems in machine learning that affect model performance and generalization. Scikit-learn provides tools and techniques to detect and address these issues effectively.

Key Characteristics

Overfitting: Model learns noise; performs well on training but poorly on unseen data.
Underfitting: Model is too simple; fails to capture data patterns.
Requires balance via tuning and validation.
Can be visualized using learning curves.

Basic Rules

Monitor both training and validation performance.
Use cross-validation to detect generalization issues.
Apply regularization or model simplification to reduce overfitting.
Increase model complexity or add features to reduce underfitting.

Syntax Table

SL NO	Technique/Tool	Syntax Example	Description
1	Learning Curve	`learning_curve(estimator, X, y)`	Measures train/validation performance vs size
2	Validation Curve	`validation_curve(estimator, X, y, ...)`	Shows performance vs parameter values
3	Regularization (Ridge)	`Ridge(alpha=1.0)`	Reduces model complexity
4	Polynomial Features	`PolynomialFeatures(degree=3)`	Adds complexity to combat underfitting

Syntax Explanation

1. Learning Curve

from sklearn.model_selection import learning_curve
train_sizes, train_scores, test_scores = learning_curve(model, X, y, cv=5)

Explanation:

Plots training and validation scores as training size increases.
Gap between train/test curves indicates overfitting.
Flat curves indicate underfitting.

2. Validation Curve

from sklearn.model_selection import validation_curve
param_range = [0.001, 0.01, 0.1, 1, 10]
train_scores, test_scores = validation_curve(model, X, y, param_name='alpha', param_range=param_range, cv=5)

Explanation:

Evaluates model performance for different values of a hyperparameter.
Detects under- or overfitting trends based on score patterns.

3. Ridge Regularization

from sklearn.linear_model import Ridge
model = Ridge(alpha=1.0)

Explanation:

Adds penalty to large coefficients.
Helps simplify overly complex models and reduce overfitting.

4. Polynomial Features

from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(degree=3)
X_poly = poly.fit_transform(X)

Explanation:

Adds higher-order terms to input features.
Allows simple models (like linear regression) to capture nonlinear relationships.

Real-Life Project: Detecting Overfitting with Learning Curves

Objective

Compare training vs validation scores to detect overfitting in a regression model.

Code Example

import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import learning_curve
from sklearn.datasets import make_regression

# Generate synthetic data
X, y = make_regression(n_samples=500, n_features=1, noise=10, random_state=42)
model = LinearRegression()

# Compute learning curves
train_sizes, train_scores, test_scores = learning_curve(model, X, y, cv=5)
train_mean = np.mean(train_scores, axis=1)
test_mean = np.mean(test_scores, axis=1)

# Plot
plt.plot(train_sizes, train_mean, label='Training score')
plt.plot(train_sizes, test_mean, label='Cross-validation score')
plt.xlabel('Training Set Size')
plt.ylabel('Score')
plt.title('Learning Curve Example')
plt.legend()
plt.grid(True)
plt.show()

Expected Output

Two curves showing model performance.
Wide gap = overfitting; both low = underfitting; both high = good fit.

Common Mistakes

❌ Not using validation data to detect overfitting.
❌ Confusing poor training performance with overfitting.
❌ Using overly complex models on small datasets.

📘 Hands-On Python and Scikit-Learn: A Practical Guide to Machine Learning by Sarful Hassan

🔗 Available on Amazon

Key Characteristics

Basic Rules

Syntax Table

Syntax Explanation

1. Learning Curve

2. Validation Curve

3. Ridge Regularization

4. Polynomial Features

Real-Life Project: Detecting Overfitting with Learning Curves

Objective

Code Example

Expected Output

Common Mistakes

Further Reading

📘 Hands-On Python and Scikit-Learn: A Practical Guide to Machine Learning by Sarful Hassan

Login