Working with Time Series Data in Scikit-learn

Scikit-learn is primarily designed for tabular data and doesn’t natively support time series analysis. However, it can be adapted for time series forecasting and classification by carefully managing data splits and feature engineering. For advanced time series tasks, integration with pandas, statsmodels, or sktime is common.

Key Characteristics

  • Supports time series prediction using supervised learning format
  • Requires lag feature creation
  • Must avoid data leakage with proper temporal splits
  • Compatible with scikit-learn pipelines

Basic Rules

  • Never randomly split time series data (use chronological split)
  • Create lagged features to convert time series to supervised format
  • Always scale after splitting to avoid leakage
  • Consider time-aware cross-validation (e.g., TimeSeriesSplit)

Syntax Table

SL NO Technique Syntax Example Description
1 Chronological split train_test_split(data, shuffle=False) Maintains time order
2 Lag feature creation df['lag1'] = df['value'].shift(1) Adds lagged version of a feature
3 TimeSeriesSplit TimeSeriesSplit(n_splits=5) Splits time series for cross-validation
4 Model training model.fit(X_train, y_train) Trains model on lagged features
5 Forecasting model.predict(X_test) Predicts future values from past features

Syntax Explanation

1. Chronological Split

What is it?
Splits data in a way that respects temporal order. This is critical to prevent data leakage and ensure the model doesn’t learn from future information.

Syntax:

from sklearn.model_selection import train_test_split
X_train, X_test = train_test_split(X, shuffle=False)

Explanation:

  • Ensures training happens only on past data
  • Maintains sequence integrity for forecasting
  • Avoids shuffling which would break time relationships

2. Lag Feature Creation

What is it?
Creates columns with shifted values of the original time series to simulate previous time steps.

Syntax:

df['lag1'] = df['value'].shift(1)

Explanation:

  • Converts time series into a supervised learning dataset
  • Lagged values act as predictors
  • Can add multiple lags (lag2, lag3) for better context
  • Be sure to drop NaN rows created by shifting

3. Time Series Cross-Validation

What is it?
Implements k-fold validation where the folds maintain time sequence. Useful when testing model consistency over time.

Syntax:

from sklearn.model_selection import TimeSeriesSplit
tscv = TimeSeriesSplit(n_splits=5)
for train_idx, test_idx in tscv.split(X):
    model.fit(X[train_idx], y[train_idx])
    predictions = model.predict(X[test_idx])

Explanation:

  • Each split trains on older data and tests on newer data
  • No future data leaks into the past
  • Useful for evaluating stability across time periods

4. Model Training

What is it?
Fits a model on the lagged feature training set to learn time-based patterns.

Syntax:

model.fit(X_train, y_train)

Explanation:

  • Trains on features representing past values
  • Supports any supervised model: LinearRegression, RandomForest, etc.
  • Model learns how current outcomes relate to previous inputs

5. Forecasting Future Values

What is it?
Uses the trained model to generate predictions for future steps.

Syntax:

y_pred = model.predict(X_test)

Explanation:

  • Produces future values based on previously observed patterns
  • Often evaluated using metrics like MSE, MAE, RMSE
  • Can be extended for multi-step forecasting using recursive methods

Real-Life Project: Time Series Forecasting with Lag Features

Project Name

Daily Temperature Prediction

Project Overview

Forecast the next day’s temperature using previous day temperatures.

Project Goal

Train a linear regression model using lag features.

Code for This Project

import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

# Simulate time series data
dates = pd.date_range(start='2023-01-01', periods=100)
temps = np.random.normal(loc=25, scale=3, size=100)
df = pd.DataFrame({'date': dates, 'temp': temps})
df['lag1'] = df['temp'].shift(1)
df = df.dropna()

X = df[['lag1']]
y = df['temp']
X_train, X_test, y_train, y_test = train_test_split(X, y, shuffle=False)

model = LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print("MSE:", mean_squared_error(y_test, y_pred))

Expected Output

  • Mean Squared Error for one-step ahead forecasting
  • Demonstrates lag-based supervised learning pipeline

Common Mistakes to Avoid

  • ❌ Random shuffling of time series data
  • ❌ Using future information in lag features
  • ❌ Ignoring stationarity assumptions

Further Reading Recommendation

📘 Hands-On Python and Scikit-Learn: A Practical Guide to Machine Learning by Sarful Hassan

🔗 Available on Amazon