Time Series Cross-Validation in Scikit-learn

Traditional k-fold cross-validation is not suitable for time series data due to the temporal dependency between observations. Instead, Scikit-learn provides TimeSeriesSplit, a strategy that preserves order and prevents leakage by ensuring that the training set always precedes the test set chronologically.

Key Characteristics

  • Maintains chronological order in splits
  • Avoids training on future data
  • Useful for evaluating time-based model stability
  • Supports consistent model validation in rolling or expanding windows

Basic Rules

  • Always split sequentially—not randomly
  • Training set must precede test set
  • Use consistent time intervals
  • Ideal for univariate and multivariate time series tasks

Syntax Table

SL NO Technique Syntax Example Description
1 Initialize split TimeSeriesSplit(n_splits=5) Creates time-ordered cross-validation sets
2 Access splits for train_idx, test_idx in tscv.split(X): ... Iterates through each CV fold
3 Train model model.fit(X[train_idx], y[train_idx]) Trains on training portion of each split
4 Evaluate model model.predict(X[test_idx]) Evaluates on time-valid test data
5 Visualization plt.plot(train_idx), plt.plot(test_idx) Useful to understand fold composition

Syntax Explanation

1. Initialize TimeSeriesSplit

What is it?
Creates a cross-validator that provides train/test indices in time-ordered folds.

Syntax:

from sklearn.model_selection import TimeSeriesSplit
tscv = TimeSeriesSplit(n_splits=5)

Explanation:

  • n_splits=5 will divide the data into 5 sequential folds.
  • Each new fold adds more data to the training set.
  • Ideal for rolling window validation to mimic real-time prediction environments.
  • You can also customize the max_train_size parameter to limit how large the training set grows.

2. Access Splits

What is it?
Extracts the train and test indices from each fold using a for loop.

Syntax:

for train_idx, test_idx in tscv.split(X):
    print("Train indices:", train_idx, "Test indices:", test_idx)

Explanation:

  • Iterates over each fold and provides integer indices for slicing.
  • Ensures test data follows training data in time.
  • Very helpful for debugging, logging, and visualizing the sequence of training and testing.
  • Each iteration updates the model using increasingly more historical data.

3. Train Model

What is it?
Fits the model to the current training set for the current fold.

Syntax:

model.fit(X[train_idx], y[train_idx])

Explanation:

  • Ensures training is only done using past observations.
  • This loop enables robust evaluation of model performance across various time splits.
  • Supports all Scikit-learn estimators (LinearRegression, SVR, Ridge, etc.)
  • For pipelines, use: pipeline.fit(X[train_idx], y[train_idx])

4. Evaluate Model

What is it?
Generates predictions on the future (test) portion of the time series.

Syntax:

y_pred = model.predict(X[test_idx])

Explanation:

  • Makes one-step ahead (or multi-step if structured) predictions.
  • Should compare predictions with y[test_idx] using evaluation metrics like RMSE, MAE, or MAPE.
  • Important for simulating how a deployed model would perform on unseen data.
  • You can log each fold’s score or average them at the end.

5. Visualization of Splits

What is it?
Optional step to plot how splits are formed over time.

Syntax:

import matplotlib.pyplot as plt
plt.plot(train_idx, label='Train')
plt.plot(test_idx, label='Test')
plt.legend()

Explanation:

  • Great for checking how data is partitioned visually.
  • Confirms model never sees future data during training.
  • Helps ensure correct fold structure for reproducibility.
  • Can reveal issues like short test folds or improper sequences.

Real-Life Project: Time Series CV for Stock Forecasting

Project Name

Sequential Cross-Validation for Stock Returns

Project Overview

Use time series split to validate a regression model predicting next-day stock returns.

Project Goal

Implement walk-forward validation using TimeSeriesSplit.

Code for This Project

import numpy as np
import pandas as pd
from sklearn.linear_model import Ridge
from sklearn.model_selection import TimeSeriesSplit
from sklearn.metrics import mean_squared_error

# Simulated data
df = pd.DataFrame({
    'feature1': np.random.randn(100),
    'feature2': np.random.randn(100),
    'target': np.random.randn(100)
})

X = df[['feature1', 'feature2']].values
y = df['target'].values

# Initialize split
tscv = TimeSeriesSplit(n_splits=5)

# Run CV
for fold, (train_idx, test_idx) in enumerate(tscv.split(X)):
    model = Ridge()
    model.fit(X[train_idx], y[train_idx])
    y_pred = model.predict(X[test_idx])
    print(f"Fold {fold + 1} MSE:", mean_squared_error(y[test_idx], y_pred))

Expected Output

  • Fold-wise MSE printed
  • Performance consistency across time folds

Common Mistakes to Avoid

  • ❌ Using KFold instead of TimeSeriesSplit
  • ❌ Training on future data (data leakage)
  • ❌ Not scaling after train/test split (if needed)

Further Reading Recommendation

📘 Hands-On Python and Scikit-Learn: A Practical Guide to Machine Learning by Sarful Hassan

🔗 Available on Amazon