Scikit-learn is primarily designed for tabular data and doesn’t natively support time series analysis. However, it can be adapted for time series forecasting and classification by carefully managing data splits and feature engineering. For advanced time series tasks, integration with pandas
, statsmodels
, or sktime
is common.
Key Characteristics
- Supports time series prediction using supervised learning format
- Requires lag feature creation
- Must avoid data leakage with proper temporal splits
- Compatible with scikit-learn pipelines
Basic Rules
- Never randomly split time series data (use chronological split)
- Create lagged features to convert time series to supervised format
- Always scale after splitting to avoid leakage
- Consider time-aware cross-validation (e.g.,
TimeSeriesSplit
)
Syntax Table
SL NO | Technique | Syntax Example | Description |
---|---|---|---|
1 | Chronological split | train_test_split(data, shuffle=False) |
Maintains time order |
2 | Lag feature creation | df['lag1'] = df['value'].shift(1) |
Adds lagged version of a feature |
3 | TimeSeriesSplit | TimeSeriesSplit(n_splits=5) |
Splits time series for cross-validation |
4 | Model training | model.fit(X_train, y_train) |
Trains model on lagged features |
5 | Forecasting | model.predict(X_test) |
Predicts future values from past features |
Syntax Explanation
1. Chronological Split
What is it?
Splits data in a way that respects temporal order. This is critical to prevent data leakage and ensure the model doesn’t learn from future information.
Syntax:
from sklearn.model_selection import train_test_split
X_train, X_test = train_test_split(X, shuffle=False)
Explanation:
- Ensures training happens only on past data
- Maintains sequence integrity for forecasting
- Avoids shuffling which would break time relationships
2. Lag Feature Creation
What is it?
Creates columns with shifted values of the original time series to simulate previous time steps.
Syntax:
df['lag1'] = df['value'].shift(1)
Explanation:
- Converts time series into a supervised learning dataset
- Lagged values act as predictors
- Can add multiple lags (
lag2
,lag3
) for better context - Be sure to drop
NaN
rows created by shifting
3. Time Series Cross-Validation
What is it?
Implements k-fold validation where the folds maintain time sequence. Useful when testing model consistency over time.
Syntax:
from sklearn.model_selection import TimeSeriesSplit
tscv = TimeSeriesSplit(n_splits=5)
for train_idx, test_idx in tscv.split(X):
model.fit(X[train_idx], y[train_idx])
predictions = model.predict(X[test_idx])
Explanation:
- Each split trains on older data and tests on newer data
- No future data leaks into the past
- Useful for evaluating stability across time periods
4. Model Training
What is it?
Fits a model on the lagged feature training set to learn time-based patterns.
Syntax:
model.fit(X_train, y_train)
Explanation:
- Trains on features representing past values
- Supports any supervised model: LinearRegression, RandomForest, etc.
- Model learns how current outcomes relate to previous inputs
5. Forecasting Future Values
What is it?
Uses the trained model to generate predictions for future steps.
Syntax:
y_pred = model.predict(X_test)
Explanation:
- Produces future values based on previously observed patterns
- Often evaluated using metrics like MSE, MAE, RMSE
- Can be extended for multi-step forecasting using recursive methods
Real-Life Project: Time Series Forecasting with Lag Features
Project Name
Daily Temperature Prediction
Project Overview
Forecast the next day’s temperature using previous day temperatures.
Project Goal
Train a linear regression model using lag features.
Code for This Project
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
# Simulate time series data
dates = pd.date_range(start='2023-01-01', periods=100)
temps = np.random.normal(loc=25, scale=3, size=100)
df = pd.DataFrame({'date': dates, 'temp': temps})
df['lag1'] = df['temp'].shift(1)
df = df.dropna()
X = df[['lag1']]
y = df['temp']
X_train, X_test, y_train, y_test = train_test_split(X, y, shuffle=False)
model = LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print("MSE:", mean_squared_error(y_test, y_pred))
Expected Output
- Mean Squared Error for one-step ahead forecasting
- Demonstrates lag-based supervised learning pipeline
Common Mistakes to Avoid
- ❌ Random shuffling of time series data
- ❌ Using future information in lag features
- ❌ Ignoring stationarity assumptions