Working with Time Series Data in Scikit-learn

Scikit-learn is primarily designed for tabular data and doesn’t natively support time series analysis. However, it can be adapted for time series forecasting and classification by carefully managing data splits and feature engineering. For advanced time series tasks, integration with pandas, statsmodels, or sktime is common.

Key Characteristics

Supports time series prediction using supervised learning format
Requires lag feature creation
Must avoid data leakage with proper temporal splits
Compatible with scikit-learn pipelines

Basic Rules

Never randomly split time series data (use chronological split)
Create lagged features to convert time series to supervised format
Always scale after splitting to avoid leakage
Consider time-aware cross-validation (e.g., TimeSeriesSplit)

Syntax Table

SL NO	Technique	Syntax Example	Description
1	Chronological split	`train_test_split(data, shuffle=False)`	Maintains time order
2	Lag feature creation	`df['lag1'] = df['value'].shift(1)`	Adds lagged version of a feature
3	TimeSeriesSplit	`TimeSeriesSplit(n_splits=5)`	Splits time series for cross-validation
4	Model training	`model.fit(X_train, y_train)`	Trains model on lagged features
5	Forecasting	`model.predict(X_test)`	Predicts future values from past features

Syntax Explanation

1. Chronological Split

What is it?
Splits data in a way that respects temporal order. This is critical to prevent data leakage and ensure the model doesn’t learn from future information.

Syntax:

from sklearn.model_selection import train_test_split
X_train, X_test = train_test_split(X, shuffle=False)

Explanation:

Ensures training happens only on past data
Maintains sequence integrity for forecasting
Avoids shuffling which would break time relationships

2. Lag Feature Creation

What is it?
Creates columns with shifted values of the original time series to simulate previous time steps.

Syntax:

df['lag1'] = df['value'].shift(1)

Explanation:

Converts time series into a supervised learning dataset
Lagged values act as predictors
Can add multiple lags (lag2, lag3) for better context
Be sure to drop NaN rows created by shifting

3. Time Series Cross-Validation

What is it?
Implements k-fold validation where the folds maintain time sequence. Useful when testing model consistency over time.

Syntax:

from sklearn.model_selection import TimeSeriesSplit
tscv = TimeSeriesSplit(n_splits=5)
for train_idx, test_idx in tscv.split(X):
    model.fit(X[train_idx], y[train_idx])
    predictions = model.predict(X[test_idx])

Explanation:

Each split trains on older data and tests on newer data
No future data leaks into the past
Useful for evaluating stability across time periods

4. Model Training

What is it?
Fits a model on the lagged feature training set to learn time-based patterns.

Syntax:

model.fit(X_train, y_train)

Explanation:

Trains on features representing past values
Supports any supervised model: LinearRegression, RandomForest, etc.
Model learns how current outcomes relate to previous inputs

5. Forecasting Future Values

What is it?
Uses the trained model to generate predictions for future steps.

Syntax:

y_pred = model.predict(X_test)

Explanation:

Produces future values based on previously observed patterns
Often evaluated using metrics like MSE, MAE, RMSE
Can be extended for multi-step forecasting using recursive methods

Real-Life Project: Time Series Forecasting with Lag Features

Project Name

Daily Temperature Prediction

Project Overview

Forecast the next day’s temperature using previous day temperatures.

Project Goal

Train a linear regression model using lag features.

Code for This Project

import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

# Simulate time series data
dates = pd.date_range(start='2023-01-01', periods=100)
temps = np.random.normal(loc=25, scale=3, size=100)
df = pd.DataFrame({'date': dates, 'temp': temps})
df['lag1'] = df['temp'].shift(1)
df = df.dropna()

X = df[['lag1']]
y = df['temp']
X_train, X_test, y_train, y_test = train_test_split(X, y, shuffle=False)

model = LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print("MSE:", mean_squared_error(y_test, y_pred))

Expected Output

Mean Squared Error for one-step ahead forecasting
Demonstrates lag-based supervised learning pipeline

Common Mistakes to Avoid

❌ Random shuffling of time series data
❌ Using future information in lag features
❌ Ignoring stationarity assumptions

📘 Hands-On Python and Scikit-Learn: A Practical Guide to Machine Learning by Sarful Hassan

🔗 Available on Amazon

Key Characteristics

Basic Rules

Syntax Table

Syntax Explanation

1. Chronological Split

2. Lag Feature Creation

3. Time Series Cross-Validation

4. Model Training

5. Forecasting Future Values

Real-Life Project: Time Series Forecasting with Lag Features

Project Name

Project Overview

Project Goal

Code for This Project

Expected Output

Common Mistakes to Avoid

Further Reading Recommendation

📘 Hands-On Python and Scikit-Learn: A Practical Guide to Machine Learning by Sarful Hassan

Login