Lag features and rolling statistics are powerful tools in time series forecasting. While Scikit-learn doesn’t provide these natively, they can be engineered using pandas before feeding into models. These features help capture temporal dependencies, seasonality, and trends.
Key Characteristics
- Lag features represent past observations
- Rolling means smooth short-term fluctuations
- Used in feature engineering for regression and classification
- Improves model context over time
Basic Rules
- Always shift or roll before training to prevent leakage
- Drop NaNs after applying lag/rolling
- Combine multiple lags and windows for better performance
- Can be used in pipeline with
FunctionTransformer
Syntax Table
SL NO | Technique | Syntax Example | Description |
---|---|---|---|
1 | Lag feature | df['lag1'] = df['value'].shift(1) |
Adds previous time step as feature |
2 | Multiple lags | df['lag3'] = df['value'].shift(3) |
Adds value from 3 steps back |
3 | Rolling mean | df['roll_mean_3'] = df['value'].rolling(3).mean() |
Computes 3-step moving average |
4 | Rolling std | df['roll_std_5'] = df['value'].rolling(5).std() |
Rolling standard deviation |
5 | Drop NaNs | df = df.dropna() |
Removes rows with missing values |
Syntax Explanation
1. Lag Feature
What is it?
Adds a new column that contains the value from one time step ago. This helps the model learn temporal dependencies between observations.
Syntax:
df['lag1'] = df['value'].shift(1)
Explanation:
- Shifts the original series by 1 row to align each observation with its prior value.
- Essential for converting time series to supervised learning.
- Can stack several lags to build memory into the model.
- Watch out for
NaN
at the start, which must be removed before training.
2. Multiple Lags
What is it?
Creates additional lagged features with greater gaps to capture longer temporal effects.
Syntax:
df['lag3'] = df['value'].shift(3)
Explanation:
- Offers deeper historical context.
- Improves learning of cyclic or weekly patterns.
- Combine
lag1
,lag3
,lag7
, etc., to capture short-term and seasonal behavior. - Enables models to use multi-step historical dependencies as features.
3. Rolling Mean
What is it?
Smooths out time series by averaging values over a sliding window.
Syntax:
df['roll_mean_3'] = df['value'].rolling(3).mean()
Explanation:
- Calculates the average of current and previous 2 values (window=3).
- Useful for trend extraction and smoothing noise.
- Reduces impact of short-term fluctuations and sharp jumps.
- Can be used directly or in combination with other features.
4. Rolling Standard Deviation
What is it?
Quantifies the variation or volatility within a sliding window of observations.
Syntax:
df['roll_std_5'] = df['value'].rolling(5).std()
Explanation:
- Measures the degree of deviation from the mean within a window.
- Helpful in modeling uncertainty and market volatility.
- Can highlight periods of instability or abnormal behavior.
- Use different window sizes to capture short- or long-term volatility.
5. Drop NaNs
What is it?
Removes all rows that contain NaN
values, typically introduced by lag or rolling computations.
Syntax:
df = df.dropna()
Explanation:
- Necessary cleanup step before model training.
- Avoids errors when passing data to Scikit-learn models.
- Drop only after all lag and rolling features have been added.
- Alternatively, impute missing values if losing rows is unacceptable.
Real-Life Project: Temperature Forecasting with Lag + Rolling
Project Name
Enhanced Daily Temperature Forecast
Project Overview
Add lag and rolling mean features to improve prediction of daily temperatures.
Project Goal
Use lag and rolling statistics in a linear regression model.
Code for This Project
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
# Simulate data
dates = pd.date_range(start='2023-01-01', periods=100)
temps = np.random.normal(loc=25, scale=3, size=100)
df = pd.DataFrame({'date': dates, 'temp': temps})
# Create features
df['lag1'] = df['temp'].shift(1)
df['roll_mean_3'] = df['temp'].rolling(3).mean()
df['roll_std_3'] = df['temp'].rolling(3).std()
df = df.dropna()
X = df[['lag1', 'roll_mean_3', 'roll_std_3']]
y = df['temp']
X_train, X_test, y_train, y_test = train_test_split(X, y, shuffle=False)
model = LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print("MSE:", mean_squared_error(y_test, y_pred))
Expected Output
- Improved accuracy over lag-only model
- Highlights benefit of combining lag and rolling features
Common Mistakes to Avoid
- ❌ Using rolling without handling NaNs
- ❌ Leaking future information by shifting incorrectly
- ❌ Applying rolling after splitting data (must be before!)