Overfitting and underfitting are two common problems in machine learning that affect model performance and generalization. Scikit-learn provides tools and techniques to detect and address these issues effectively.
Key Characteristics
- Overfitting: Model learns noise; performs well on training but poorly on unseen data.
- Underfitting: Model is too simple; fails to capture data patterns.
- Requires balance via tuning and validation.
- Can be visualized using learning curves.
Basic Rules
- Monitor both training and validation performance.
- Use cross-validation to detect generalization issues.
- Apply regularization or model simplification to reduce overfitting.
- Increase model complexity or add features to reduce underfitting.
Syntax Table
SL NO | Technique/Tool | Syntax Example | Description |
---|---|---|---|
1 | Learning Curve | learning_curve(estimator, X, y) |
Measures train/validation performance vs size |
2 | Validation Curve | validation_curve(estimator, X, y, ...) |
Shows performance vs parameter values |
3 | Regularization (Ridge) | Ridge(alpha=1.0) |
Reduces model complexity |
4 | Polynomial Features | PolynomialFeatures(degree=3) |
Adds complexity to combat underfitting |
Syntax Explanation
1. Learning Curve
from sklearn.model_selection import learning_curve
train_sizes, train_scores, test_scores = learning_curve(model, X, y, cv=5)
Explanation:
- Plots training and validation scores as training size increases.
- Gap between train/test curves indicates overfitting.
- Flat curves indicate underfitting.
2. Validation Curve
from sklearn.model_selection import validation_curve
param_range = [0.001, 0.01, 0.1, 1, 10]
train_scores, test_scores = validation_curve(model, X, y, param_name='alpha', param_range=param_range, cv=5)
Explanation:
- Evaluates model performance for different values of a hyperparameter.
- Detects under- or overfitting trends based on score patterns.
3. Ridge Regularization
from sklearn.linear_model import Ridge
model = Ridge(alpha=1.0)
Explanation:
- Adds penalty to large coefficients.
- Helps simplify overly complex models and reduce overfitting.
4. Polynomial Features
from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(degree=3)
X_poly = poly.fit_transform(X)
Explanation:
- Adds higher-order terms to input features.
- Allows simple models (like linear regression) to capture nonlinear relationships.
Real-Life Project: Detecting Overfitting with Learning Curves
Objective
Compare training vs validation scores to detect overfitting in a regression model.
Code Example
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import learning_curve
from sklearn.datasets import make_regression
# Generate synthetic data
X, y = make_regression(n_samples=500, n_features=1, noise=10, random_state=42)
model = LinearRegression()
# Compute learning curves
train_sizes, train_scores, test_scores = learning_curve(model, X, y, cv=5)
train_mean = np.mean(train_scores, axis=1)
test_mean = np.mean(test_scores, axis=1)
# Plot
plt.plot(train_sizes, train_mean, label='Training score')
plt.plot(train_sizes, test_mean, label='Cross-validation score')
plt.xlabel('Training Set Size')
plt.ylabel('Score')
plt.title('Learning Curve Example')
plt.legend()
plt.grid(True)
plt.show()
Expected Output
- Two curves showing model performance.
- Wide gap = overfitting; both low = underfitting; both high = good fit.
Common Mistakes
- ❌ Not using validation data to detect overfitting.
- ❌ Confusing poor training performance with overfitting.
- ❌ Using overly complex models on small datasets.