Lag Features and Rolling Means with Scikit-learn

Lag features and rolling statistics are powerful tools in time series forecasting. While Scikit-learn doesn’t provide these natively, they can be engineered using pandas before feeding into models. These features help capture temporal dependencies, seasonality, and trends.

Key Characteristics

Lag features represent past observations
Rolling means smooth short-term fluctuations
Used in feature engineering for regression and classification
Improves model context over time

Basic Rules

Always shift or roll before training to prevent leakage
Drop NaNs after applying lag/rolling
Combine multiple lags and windows for better performance
Can be used in pipeline with FunctionTransformer

Syntax Table

SL NO	Technique	Syntax Example	Description
1	Lag feature	`df['lag1'] = df['value'].shift(1)`	Adds previous time step as feature
2	Multiple lags	`df['lag3'] = df['value'].shift(3)`	Adds value from 3 steps back
3	Rolling mean	`df['roll_mean_3'] = df['value'].rolling(3).mean()`	Computes 3-step moving average
4	Rolling std	`df['roll_std_5'] = df['value'].rolling(5).std()`	Rolling standard deviation
5	Drop NaNs	`df = df.dropna()`	Removes rows with missing values

Syntax Explanation

1. Lag Feature

What is it?
Adds a new column that contains the value from one time step ago. This helps the model learn temporal dependencies between observations.

Syntax:

df['lag1'] = df['value'].shift(1)

Explanation:

Shifts the original series by 1 row to align each observation with its prior value.
Essential for converting time series to supervised learning.
Can stack several lags to build memory into the model.
Watch out for NaN at the start, which must be removed before training.

2. Multiple Lags

What is it?
Creates additional lagged features with greater gaps to capture longer temporal effects.

Syntax:

df['lag3'] = df['value'].shift(3)

Explanation:

Offers deeper historical context.
Improves learning of cyclic or weekly patterns.
Combine lag1, lag3, lag7, etc., to capture short-term and seasonal behavior.
Enables models to use multi-step historical dependencies as features.

3. Rolling Mean

What is it?
Smooths out time series by averaging values over a sliding window.

Syntax:

df['roll_mean_3'] = df['value'].rolling(3).mean()

Explanation:

Calculates the average of current and previous 2 values (window=3).
Useful for trend extraction and smoothing noise.
Reduces impact of short-term fluctuations and sharp jumps.
Can be used directly or in combination with other features.

4. Rolling Standard Deviation

What is it?
Quantifies the variation or volatility within a sliding window of observations.

Syntax:

df['roll_std_5'] = df['value'].rolling(5).std()

Explanation:

Measures the degree of deviation from the mean within a window.
Helpful in modeling uncertainty and market volatility.
Can highlight periods of instability or abnormal behavior.
Use different window sizes to capture short- or long-term volatility.

5. Drop NaNs

What is it?
Removes all rows that contain NaN values, typically introduced by lag or rolling computations.

Syntax:

df = df.dropna()

Explanation:

Necessary cleanup step before model training.
Avoids errors when passing data to Scikit-learn models.
Drop only after all lag and rolling features have been added.
Alternatively, impute missing values if losing rows is unacceptable.

Real-Life Project: Temperature Forecasting with Lag + Rolling

Project Name

Enhanced Daily Temperature Forecast

Project Overview

Add lag and rolling mean features to improve prediction of daily temperatures.

Project Goal

Use lag and rolling statistics in a linear regression model.

Code for This Project

import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

# Simulate data
dates = pd.date_range(start='2023-01-01', periods=100)
temps = np.random.normal(loc=25, scale=3, size=100)
df = pd.DataFrame({'date': dates, 'temp': temps})

# Create features
df['lag1'] = df['temp'].shift(1)
df['roll_mean_3'] = df['temp'].rolling(3).mean()
df['roll_std_3'] = df['temp'].rolling(3).std()
df = df.dropna()

X = df[['lag1', 'roll_mean_3', 'roll_std_3']]
y = df['temp']
X_train, X_test, y_train, y_test = train_test_split(X, y, shuffle=False)

model = LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print("MSE:", mean_squared_error(y_test, y_pred))

Expected Output

Improved accuracy over lag-only model
Highlights benefit of combining lag and rolling features

Common Mistakes to Avoid

❌ Using rolling without handling NaNs
❌ Leaking future information by shifting incorrectly
❌ Applying rolling after splitting data (must be before!)

📘 Hands-On Python and Scikit-Learn: A Practical Guide to Machine Learning by Sarful Hassan

🔗 Available on Amazon

Key Characteristics

Basic Rules

Syntax Table

Syntax Explanation

1. Lag Feature

2. Multiple Lags

3. Rolling Mean

4. Rolling Standard Deviation

5. Drop NaNs

Real-Life Project: Temperature Forecasting with Lag + Rolling

Project Name

Project Overview

Project Goal

Code for This Project

Expected Output

Common Mistakes to Avoid

Further Reading Recommendation

📘 Hands-On Python and Scikit-Learn: A Practical Guide to Machine Learning by Sarful Hassan

Login