Traditional k-fold cross-validation is not suitable for time series data due to the temporal dependency between observations. Instead, Scikit-learn provides TimeSeriesSplit
, a strategy that preserves order and prevents leakage by ensuring that the training set always precedes the test set chronologically.
Key Characteristics
- Maintains chronological order in splits
- Avoids training on future data
- Useful for evaluating time-based model stability
- Supports consistent model validation in rolling or expanding windows
Basic Rules
- Always split sequentially—not randomly
- Training set must precede test set
- Use consistent time intervals
- Ideal for univariate and multivariate time series tasks
Syntax Table
SL NO | Technique | Syntax Example | Description |
---|---|---|---|
1 | Initialize split | TimeSeriesSplit(n_splits=5) |
Creates time-ordered cross-validation sets |
2 | Access splits | for train_idx, test_idx in tscv.split(X): ... |
Iterates through each CV fold |
3 | Train model | model.fit(X[train_idx], y[train_idx]) |
Trains on training portion of each split |
4 | Evaluate model | model.predict(X[test_idx]) |
Evaluates on time-valid test data |
5 | Visualization | plt.plot(train_idx), plt.plot(test_idx) |
Useful to understand fold composition |
Syntax Explanation
1. Initialize TimeSeriesSplit
What is it?
Creates a cross-validator that provides train/test indices in time-ordered folds.
Syntax:
from sklearn.model_selection import TimeSeriesSplit
tscv = TimeSeriesSplit(n_splits=5)
Explanation:
n_splits=5
will divide the data into 5 sequential folds.- Each new fold adds more data to the training set.
- Ideal for rolling window validation to mimic real-time prediction environments.
- You can also customize the
max_train_size
parameter to limit how large the training set grows.
2. Access Splits
What is it?
Extracts the train and test indices from each fold using a for loop.
Syntax:
for train_idx, test_idx in tscv.split(X):
print("Train indices:", train_idx, "Test indices:", test_idx)
Explanation:
- Iterates over each fold and provides integer indices for slicing.
- Ensures test data follows training data in time.
- Very helpful for debugging, logging, and visualizing the sequence of training and testing.
- Each iteration updates the model using increasingly more historical data.
3. Train Model
What is it?
Fits the model to the current training set for the current fold.
Syntax:
model.fit(X[train_idx], y[train_idx])
Explanation:
- Ensures training is only done using past observations.
- This loop enables robust evaluation of model performance across various time splits.
- Supports all Scikit-learn estimators (LinearRegression, SVR, Ridge, etc.)
- For pipelines, use:
pipeline.fit(X[train_idx], y[train_idx])
4. Evaluate Model
What is it?
Generates predictions on the future (test) portion of the time series.
Syntax:
y_pred = model.predict(X[test_idx])
Explanation:
- Makes one-step ahead (or multi-step if structured) predictions.
- Should compare predictions with
y[test_idx]
using evaluation metrics like RMSE, MAE, or MAPE. - Important for simulating how a deployed model would perform on unseen data.
- You can log each fold’s score or average them at the end.
5. Visualization of Splits
What is it?
Optional step to plot how splits are formed over time.
Syntax:
import matplotlib.pyplot as plt
plt.plot(train_idx, label='Train')
plt.plot(test_idx, label='Test')
plt.legend()
Explanation:
- Great for checking how data is partitioned visually.
- Confirms model never sees future data during training.
- Helps ensure correct fold structure for reproducibility.
- Can reveal issues like short test folds or improper sequences.
Real-Life Project: Time Series CV for Stock Forecasting
Project Name
Sequential Cross-Validation for Stock Returns
Project Overview
Use time series split to validate a regression model predicting next-day stock returns.
Project Goal
Implement walk-forward validation using TimeSeriesSplit.
Code for This Project
import numpy as np
import pandas as pd
from sklearn.linear_model import Ridge
from sklearn.model_selection import TimeSeriesSplit
from sklearn.metrics import mean_squared_error
# Simulated data
df = pd.DataFrame({
'feature1': np.random.randn(100),
'feature2': np.random.randn(100),
'target': np.random.randn(100)
})
X = df[['feature1', 'feature2']].values
y = df['target'].values
# Initialize split
tscv = TimeSeriesSplit(n_splits=5)
# Run CV
for fold, (train_idx, test_idx) in enumerate(tscv.split(X)):
model = Ridge()
model.fit(X[train_idx], y[train_idx])
y_pred = model.predict(X[test_idx])
print(f"Fold {fold + 1} MSE:", mean_squared_error(y[test_idx], y_pred))
Expected Output
- Fold-wise MSE printed
- Performance consistency across time folds
Common Mistakes to Avoid
- ❌ Using
KFold
instead ofTimeSeriesSplit
- ❌ Training on future data (data leakage)
- ❌ Not scaling after train/test split (if needed)
Further Reading Recommendation
- TimeSeriesSplit Scikit-learn Docs
- Walk-Forward Validation Guide
- Temporal Cross-Validation in Time Series