Cross-validation is a statistical technique used to evaluate and improve the performance of machine learning models. It helps assess how the results of a model will generalize to an independent dataset. Scikit-learn provides multiple cross-validation strategies for different tasks and dataset types.
Key Characteristics
- Estimates Model Generalization
- Reduces Overfitting
- Supports Hyperparameter Tuning
- Provides Robust Evaluation Metrics
Basic Rules
- Always shuffle the data before applying cross-validation unless the order is meaningful.
- Use stratified sampling for imbalanced classification datasets.
- Choose the method that fits the dataset size and learning objective.
Syntax Table
| SL NO | Method | Syntax Example | Description |
|---|---|---|---|
| 1 | K-Fold | KFold(n_splits=5) |
Splits data into k equal folds |
| 2 | Stratified K-Fold | StratifiedKFold(n_splits=5) |
Maintains class proportions across folds |
| 3 | Leave-One-Out | LeaveOneOut() |
Each sample is used once as a test set |
| 4 | ShuffleSplit | ShuffleSplit(n_splits=10, test_size=0.25) |
Randomly shuffles and splits data multiple times |
| 5 | TimeSeriesSplit | TimeSeriesSplit(n_splits=5) |
For ordered time series data |
Syntax Explanation
1. K-Fold
What is it? Divides the dataset into k folds and rotates training/testing across them.
Syntax:
from sklearn.model_selection import KFold
kf = KFold(n_splits=5)
Explanation:
- Splits the data into 5 folds.
- Each fold is used once as a test set.
- Provides more reliable evaluation than a single train/test split.
2. Stratified K-Fold
What is it? Ensures each fold has the same class distribution as the full dataset.
Syntax:
from sklearn.model_selection import StratifiedKFold
skf = StratifiedKFold(n_splits=5)
Explanation:
- Best suited for classification tasks with imbalanced classes.
- Prevents bias due to class distribution shifts.
3. Leave-One-Out (LOO)
What is it? Uses a single data point as the test set, and the rest as training.
Syntax:
from sklearn.model_selection import LeaveOneOut
loo = LeaveOneOut()
Explanation:
- Useful for small datasets.
- Very computationally expensive.
4. ShuffleSplit
What is it? Randomly shuffles the dataset and splits into train/test sets repeatedly.
Syntax:
from sklearn.model_selection import ShuffleSplit
ss = ShuffleSplit(n_splits=10, test_size=0.25)
Explanation:
- Ensures randomness in train/test partitions.
- Each split is independent.
5. TimeSeriesSplit
What is it? Preserves temporal order by making training sets that are prior to the test sets.
Syntax:
from sklearn.model_selection import TimeSeriesSplit
tss = TimeSeriesSplit(n_splits=5)
Explanation:
- Prevents data leakage for time series data.
- Suitable for forecasting and temporal validation.
Real-Life Project: Cross-Validating a Classification Model
Objective
Evaluate the accuracy of a decision tree classifier using different cross-validation strategies.
Code Example
import pandas as pd
from sklearn.model_selection import cross_val_score, StratifiedKFold
from sklearn.tree import DecisionTreeClassifier
from sklearn.preprocessing import StandardScaler
# Load dataset
data = pd.read_csv('classification_data.csv')
X = data.drop('target', axis=1)
y = data['target']
# Preprocessing
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Model and CV strategy
model = DecisionTreeClassifier()
skf = StratifiedKFold(n_splits=5)
# Cross-validation
scores = cross_val_score(model, X_scaled, y, cv=skf)
print("Cross-validated scores:", scores)
print("Mean Accuracy:", scores.mean())
Expected Output
- Individual accuracy scores for each fold.
- Mean accuracy as an estimate of model generalization.
Common Mistakes
- ❌ Using simple K-Fold on imbalanced datasets.
- ❌ Not scaling data consistently across folds.
- ❌ Applying time series split to shuffled data.
Further Reading
- Scikit-learn Cross-Validation Guide
- How to Use Cross-Validation Correctly (Blog)
- Cross-Validation Examples on Kaggle
