Cross-Validation Methods in Scikit-learn

Cross-validation is a statistical technique used to evaluate and improve the performance of machine learning models. It helps assess how the results of a model will generalize to an independent dataset. Scikit-learn provides multiple cross-validation strategies for different tasks and dataset types.

Key Characteristics

  • Estimates Model Generalization
  • Reduces Overfitting
  • Supports Hyperparameter Tuning
  • Provides Robust Evaluation Metrics

Basic Rules

  • Always shuffle the data before applying cross-validation unless the order is meaningful.
  • Use stratified sampling for imbalanced classification datasets.
  • Choose the method that fits the dataset size and learning objective.

Syntax Table

SL NO Method Syntax Example Description
1 K-Fold KFold(n_splits=5) Splits data into k equal folds
2 Stratified K-Fold StratifiedKFold(n_splits=5) Maintains class proportions across folds
3 Leave-One-Out LeaveOneOut() Each sample is used once as a test set
4 ShuffleSplit ShuffleSplit(n_splits=10, test_size=0.25) Randomly shuffles and splits data multiple times
5 TimeSeriesSplit TimeSeriesSplit(n_splits=5) For ordered time series data

Syntax Explanation

1. K-Fold

What is it? Divides the dataset into k folds and rotates training/testing across them.

Syntax:

from sklearn.model_selection import KFold
kf = KFold(n_splits=5)

Explanation:

  • Splits the data into 5 folds.
  • Each fold is used once as a test set.
  • Provides more reliable evaluation than a single train/test split.

2. Stratified K-Fold

What is it? Ensures each fold has the same class distribution as the full dataset.

Syntax:

from sklearn.model_selection import StratifiedKFold
skf = StratifiedKFold(n_splits=5)

Explanation:

  • Best suited for classification tasks with imbalanced classes.
  • Prevents bias due to class distribution shifts.

3. Leave-One-Out (LOO)

What is it? Uses a single data point as the test set, and the rest as training.

Syntax:

from sklearn.model_selection import LeaveOneOut
loo = LeaveOneOut()

Explanation:

  • Useful for small datasets.
  • Very computationally expensive.

4. ShuffleSplit

What is it? Randomly shuffles the dataset and splits into train/test sets repeatedly.

Syntax:

from sklearn.model_selection import ShuffleSplit
ss = ShuffleSplit(n_splits=10, test_size=0.25)

Explanation:

  • Ensures randomness in train/test partitions.
  • Each split is independent.

5. TimeSeriesSplit

What is it? Preserves temporal order by making training sets that are prior to the test sets.

Syntax:

from sklearn.model_selection import TimeSeriesSplit
tss = TimeSeriesSplit(n_splits=5)

Explanation:

  • Prevents data leakage for time series data.
  • Suitable for forecasting and temporal validation.

Real-Life Project: Cross-Validating a Classification Model

Objective

Evaluate the accuracy of a decision tree classifier using different cross-validation strategies.

Code Example

import pandas as pd
from sklearn.model_selection import cross_val_score, StratifiedKFold
from sklearn.tree import DecisionTreeClassifier
from sklearn.preprocessing import StandardScaler

# Load dataset
data = pd.read_csv('classification_data.csv')
X = data.drop('target', axis=1)
y = data['target']

# Preprocessing
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Model and CV strategy
model = DecisionTreeClassifier()
skf = StratifiedKFold(n_splits=5)

# Cross-validation
scores = cross_val_score(model, X_scaled, y, cv=skf)
print("Cross-validated scores:", scores)
print("Mean Accuracy:", scores.mean())

Expected Output

  • Individual accuracy scores for each fold.
  • Mean accuracy as an estimate of model generalization.

Common Mistakes

  • ❌ Using simple K-Fold on imbalanced datasets.
  • ❌ Not scaling data consistently across folds.
  • ❌ Applying time series split to shuffled data.

Further Reading

📘 Hands-On Python and Scikit-Learn: A Practical Guide to Machine Learning by Sarful Hassan

🔗 Available on Amazon