Time Series Cross-Validation in Scikit-learn

Traditional k-fold cross-validation is not suitable for time series data due to the temporal dependency between observations. Instead, Scikit-learn provides TimeSeriesSplit, a strategy that preserves order and prevents leakage by ensuring that the training set always precedes the test set chronologically.

Key Characteristics

Maintains chronological order in splits
Avoids training on future data
Useful for evaluating time-based model stability
Supports consistent model validation in rolling or expanding windows

Basic Rules

Always split sequentially—not randomly
Training set must precede test set
Use consistent time intervals
Ideal for univariate and multivariate time series tasks

Syntax Table

SL NO	Technique	Syntax Example	Description
1	Initialize split	`TimeSeriesSplit(n_splits=5)`	Creates time-ordered cross-validation sets
2	Access splits	`for train_idx, test_idx in tscv.split(X): ...`	Iterates through each CV fold
3	Train model	`model.fit(X[train_idx], y[train_idx])`	Trains on training portion of each split
4	Evaluate model	`model.predict(X[test_idx])`	Evaluates on time-valid test data
5	Visualization	`plt.plot(train_idx), plt.plot(test_idx)`	Useful to understand fold composition

Syntax Explanation

1. Initialize TimeSeriesSplit

What is it?
Creates a cross-validator that provides train/test indices in time-ordered folds.

Syntax:

from sklearn.model_selection import TimeSeriesSplit
tscv = TimeSeriesSplit(n_splits=5)

Explanation:

n_splits=5 will divide the data into 5 sequential folds.
Each new fold adds more data to the training set.
Ideal for rolling window validation to mimic real-time prediction environments.
You can also customize the max_train_size parameter to limit how large the training set grows.

2. Access Splits

What is it?
Extracts the train and test indices from each fold using a for loop.

Syntax:

for train_idx, test_idx in tscv.split(X):
    print("Train indices:", train_idx, "Test indices:", test_idx)

Explanation:

Iterates over each fold and provides integer indices for slicing.
Ensures test data follows training data in time.
Very helpful for debugging, logging, and visualizing the sequence of training and testing.
Each iteration updates the model using increasingly more historical data.

3. Train Model

What is it?
Fits the model to the current training set for the current fold.

Syntax:

model.fit(X[train_idx], y[train_idx])

Explanation:

Ensures training is only done using past observations.
This loop enables robust evaluation of model performance across various time splits.
Supports all Scikit-learn estimators (LinearRegression, SVR, Ridge, etc.)
For pipelines, use: pipeline.fit(X[train_idx], y[train_idx])

4. Evaluate Model

What is it?
Generates predictions on the future (test) portion of the time series.

Syntax:

y_pred = model.predict(X[test_idx])

Explanation:

Makes one-step ahead (or multi-step if structured) predictions.
Should compare predictions with y[test_idx] using evaluation metrics like RMSE, MAE, or MAPE.
Important for simulating how a deployed model would perform on unseen data.
You can log each fold’s score or average them at the end.

5. Visualization of Splits

What is it?
Optional step to plot how splits are formed over time.

Syntax:

import matplotlib.pyplot as plt
plt.plot(train_idx, label='Train')
plt.plot(test_idx, label='Test')
plt.legend()

Explanation:

Great for checking how data is partitioned visually.
Confirms model never sees future data during training.
Helps ensure correct fold structure for reproducibility.
Can reveal issues like short test folds or improper sequences.

Real-Life Project: Time Series CV for Stock Forecasting

Project Name

Sequential Cross-Validation for Stock Returns

Project Overview

Use time series split to validate a regression model predicting next-day stock returns.

Project Goal

Implement walk-forward validation using TimeSeriesSplit.

Code for This Project

import numpy as np
import pandas as pd
from sklearn.linear_model import Ridge
from sklearn.model_selection import TimeSeriesSplit
from sklearn.metrics import mean_squared_error

# Simulated data
df = pd.DataFrame({
    'feature1': np.random.randn(100),
    'feature2': np.random.randn(100),
    'target': np.random.randn(100)
})

X = df[['feature1', 'feature2']].values
y = df['target'].values

# Initialize split
tscv = TimeSeriesSplit(n_splits=5)

# Run CV
for fold, (train_idx, test_idx) in enumerate(tscv.split(X)):
    model = Ridge()
    model.fit(X[train_idx], y[train_idx])
    y_pred = model.predict(X[test_idx])
    print(f"Fold {fold + 1} MSE:", mean_squared_error(y[test_idx], y_pred))

Expected Output

Fold-wise MSE printed
Performance consistency across time folds

Common Mistakes to Avoid

❌ Using KFold instead of TimeSeriesSplit
❌ Training on future data (data leakage)
❌ Not scaling after train/test split (if needed)

📘 Hands-On Python and Scikit-Learn: A Practical Guide to Machine Learning by Sarful Hassan

🔗 Available on Amazon

Key Characteristics

Basic Rules

Syntax Table

Syntax Explanation

1. Initialize TimeSeriesSplit

2. Access Splits

3. Train Model

4. Evaluate Model

5. Visualization of Splits

Real-Life Project: Time Series CV for Stock Forecasting

Project Name

Project Overview

Project Goal

Code for This Project

Expected Output

Common Mistakes to Avoid

Further Reading Recommendation

📘 Hands-On Python and Scikit-Learn: A Practical Guide to Machine Learning by Sarful Hassan

Login