Multi-Class Classification Strategies in Scikit-learn

Multi-class classification is a supervised learning task where the goal is to assign each input sample to one of three or more classes. Scikit-learn provides several strategies to handle multi-class problems, including One-vs-Rest (OvR), One-vs-One (OvO), and native multiclass classifiers like RandomForestClassifier or LogisticRegression.

Key Characteristics

  • Handles more than two class labels
  • Supports One-vs-Rest (OvR) and One-vs-One (OvO) strategies
  • Can use native classifiers or meta-estimators
  • Works with both linear and nonlinear models

Basic Rules

  • Choose OvR for high efficiency on large datasets
  • OvO may work better with models sensitive to class boundaries
  • Evaluate confusion matrix to understand per-class performance
  • Use stratified train-test split to ensure balanced class distribution

Syntax Table

SL NO Technique Syntax Example Description
1 One-vs-Rest OneVsRestClassifier(LogisticRegression()) Trains one classifier per class vs all others
2 One-vs-One OneVsOneClassifier(SVC()) Trains one classifier per class pair
3 Native Support RandomForestClassifier() Native support for multi-class
4 Fit Model model.fit(X_train, y_train) Trains the chosen classifier
5 Predict Classes model.predict(X_test) Returns predicted class labels

Syntax Explanation

1. One-vs-Rest (OvR)

What is it?
A strategy that fits one classifier per class, where each classifier distinguishes a class from all others.

Syntax:

from sklearn.multiclass import OneVsRestClassifier
from sklearn.linear_model import LogisticRegression
model = OneVsRestClassifier(LogisticRegression())

Explanation:

  • Suitable for linear models
  • Efficient on large datasets
  • Each classifier outputs a confidence score; highest score wins

2. One-vs-One (OvO)

What is it?
A strategy that fits one classifier per class pair.

Syntax:

from sklearn.multiclass import OneVsOneClassifier
from sklearn.svm import SVC
model = OneVsOneClassifier(SVC())

Explanation:

  • Builds N(N-1)/2 classifiers for N classes
  • Each classifier votes, and majority class wins
  • Effective when class boundaries are complex

3. Native Multi-Class Classifier

What is it?
Classifiers like Random Forest and Logistic Regression inherently support multi-class classification.

Syntax:

from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier()

Explanation:

  • No need to wrap with OvR or OvO
  • Handles class imbalance and non-linearity well
  • Straightforward integration

4. Fit Model

What is it?
Trains the selected model on the labeled dataset.

Syntax:

model.fit(X_train, y_train)

Explanation:

  • Accepts feature matrix and label vector
  • Learns the decision boundaries between classes
  • Can be combined with grid search or pipelines

5. Predict Classes

What is it?
Predicts the class labels for unseen test data.

Syntax:

predictions = model.predict(X_test)

Explanation:

  • Produces an array of predicted class labels
  • Useful for accuracy, confusion matrix, or F1 score evaluations
  • Can be used in real-time prediction systems

Real-Life Project: Digit Recognition (MNIST)

Project Overview

Use multi-class classification strategies to identify handwritten digits (0–9) from images.

Code Example

from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

# Load data
digits = load_digits()
X_train, X_test, y_train, y_test = train_test_split(digits.data, digits.target, test_size=0.3, random_state=42)

# Train model with native multi-class support
model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)

# Predict and evaluate
y_pred = model.predict(X_test)
print(classification_report(y_test, y_pred))

Expected Output

  • Per-class precision, recall, and F1-scores
  • Overall accuracy of multi-class classifier

Common Mistakes to Avoid

  • ❌ Ignoring label distribution in train/test splits
  • ❌ Using binary classifiers without wrapping them in OvR or OvO
  • ❌ Not evaluating model using class-specific metrics

Further Reading Recommendation

πŸ“˜ Hands-On Python and Scikit-Learn: A Practical Guide to Machine Learning by Sarful Hassan

πŸ”— Available on Amazon

One-Class SVM in Scikit-learn

One-Class Support Vector Machine (One-Class SVM) is an unsupervised anomaly detection algorithm that learns a decision function to separate normal data points from outliers. It is particularly effective for problems where only normal data is available for training.

Key Characteristics

  • Learns a boundary around normal data in feature space
  • Based on SVM principles with a special loss formulation
  • Assumes data is centered around the origin in feature space
  • Works well with smaller and moderately sized datasets

Basic Rules

  • Always scale data before fitting the model
  • Suitable when only normal examples are present in training
  • Use nu to control the proportion of outliers
  • kernel selection is crucial for performance (e.g., ‘rbf’)

Syntax Table

SL NO Technique Syntax Example Description
1 Initialize Model OneClassSVM(kernel='rbf', nu=0.05) Creates SVM with RBF kernel and anomaly ratio
2 Fit Model model.fit(X_train) Learns the boundary from normal samples
3 Predict model.predict(X_test) Returns -1 for anomalies, 1 for inliers
4 Score Samples model.decision_function(X_test) Computes distance from decision boundary
5 Use in Pipeline Pipeline([...]) Wraps One-Class SVM with preprocessing

Syntax Explanation

1. Initialize Model

What is it?
Creates a One-Class SVM model to detect anomalies using a kernel function.

Syntax:

from sklearn.svm import OneClassSVM
model = OneClassSVM(kernel='rbf', nu=0.05)

Explanation:

  • OneClassSVM() initializes the model using support vector machine formulation adapted for unsupervised outlier detection.
  • kernel='rbf' means the model uses a radial basis function kernel for nonlinear separation.
  • nu is a regularization parameter: it defines an upper bound on the fraction of anomalies and a lower bound on the fraction of support vectors.
  • Higher nu values allow more points to be considered outliers.
  • Kernel options like 'linear', 'sigmoid', and 'poly' can also be explored depending on the data shape.
  • This parameter influences the model’s sensitivity to outliers and the generalization ability.

2. Fit Model

What is it?
Trains the One-Class SVM on a dataset containing only inliers.

Syntax:

model.fit(X_train)

Explanation:

  • fit() builds the SVM model that defines the decision function separating normal from anomalous instances.
  • The method assumes that X_train consists only of normal observations.
  • Internally, the model tries to find a hypersphere or hyperplane that contains most of the data.
  • Proper scaling is crucial as SVMs are sensitive to feature magnitudes.
  • Always preprocess using StandardScaler or MinMaxScaler before fitting.
  • Training only on normal data makes the model learn the support boundary of normal class distribution.

3. Predict

What is it?
Uses the trained model to classify new data points as either normal or anomalous.

Syntax:

predictions = model.predict(X_test)

Explanation:

  • Predicts each sample in X_test as either 1 (normal) or -1 (anomaly).
  • Output can be used to trigger alerts, logs, or further inspection.
  • You can convert results to a binary fraud format using a simple list comprehension.
  • Helps in integrating with anomaly filtering pipelines or dashboards.
  • Effective in real-time monitoring systems.

4. Score Samples

What is it?
Measures how far each test instance is from the learned boundary.

Syntax:

scores = model.decision_function(X_test)

Explanation:

  • This function returns real-valued scores: the more negative, the more likely a point is anomalous.
  • Use scores to set a threshold rather than relying on predict() for strict binary labels.
  • This allows fine-tuning for specific recall or precision targets in deployment.
  • Ideal for visualization (e.g., histograms of anomaly scores).
  • Helps build custom logic for business-critical anomaly definitions.

5. Use in Pipeline

What is it?
Embeds One-Class SVM into a pipeline along with preprocessing steps like scaling.

Syntax:

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

pipeline = Pipeline([
    ('scale', StandardScaler()),
    ('svm', OneClassSVM(kernel='rbf', nu=0.05))
])
pipeline.fit(X_train)

Explanation:

  • Helps standardize the modeling workflow and reduce error risk.
  • Ensures the same scaler is applied during both training and inference.
  • Use in cross-validation or GridSearchCV to optimize parameters.
  • Simplifies deployment and automation for production environments.
  • Can integrate with more steps like PCA, feature selection, etc.

Real-Life Project: Credit Card Fraud Detection

Project Overview

Use One-Class SVM to identify fraudulent transactions in a credit card dataset.

Code Example

import pandas as pd
from sklearn.svm import OneClassSVM
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report

# Load dataset
data = pd.read_csv('credit_card.csv')
X = data.drop(columns=['Class'])
y = data['Class']  # 0 = normal, 1 = fraud

# Use only normal transactions for training
X_train = X[y == 0]
X_test = X

# Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train model
model = OneClassSVM(kernel='rbf', nu=0.05)
model.fit(X_train_scaled)

# Predict
y_pred = model.predict(X_test_scaled)
y_pred = [1 if i == -1 else 0 for i in y_pred]  # Convert -1 (anomaly) to 1 (fraud)

print(classification_report(y, y_pred))

Expected Output

  • Improved detection of fraud class (1) using unsupervised method
  • Precision/recall varies based on nu and dataset size

Common Mistakes to Avoid

  • ❌ Forgetting to scale input data
  • ❌ Misunderstanding nu as contamination rate (it’s an upper bound)
  • ❌ Using in supervised settings with labeled data (better to use classifiers there)

Further Reading Recommendation

πŸ“˜ Hands-On Python and Scikit-Learn: A Practical Guide to Machine Learning by Sarful Hassan

πŸ”— Available on Amazon

One-Class SVM in Scikit-learn

One-Class Support Vector Machine (One-Class SVM) is an unsupervised anomaly detection algorithm that learns a decision function to separate normal data points from outliers. It is particularly effective for problems where only normal data is available for training.

Key Characteristics

  • Learns a boundary around normal data in feature space
  • Based on SVM principles with a special loss formulation
  • Assumes data is centered around the origin in feature space
  • Works well with smaller and moderately sized datasets

Basic Rules

  • Always scale data before fitting the model
  • Suitable when only normal examples are present in training
  • Use nu to control the proportion of outliers
  • kernel selection is crucial for performance (e.g., ‘rbf’)

Syntax Table

SL NO Technique Syntax Example Description
1 Initialize Model OneClassSVM(kernel='rbf', nu=0.05) Creates SVM with RBF kernel and anomaly ratio
2 Fit Model model.fit(X_train) Learns the boundary from normal samples
3 Predict model.predict(X_test) Returns -1 for anomalies, 1 for inliers
4 Score Samples model.decision_function(X_test) Computes distance from decision boundary
5 Use in Pipeline Pipeline([...]) Wraps One-Class SVM with preprocessing

Syntax Explanation

1. Initialize Model

What is it?
Creates a One-Class SVM model to detect anomalies using a kernel function.

Syntax:

from sklearn.svm import OneClassSVM
model = OneClassSVM(kernel='rbf', nu=0.05)

Explanation:

  • OneClassSVM() initializes the model using support vector machine formulation adapted for unsupervised outlier detection.
  • kernel='rbf' means the model uses a radial basis function kernel for nonlinear separation.
  • nu is a regularization parameter: it defines an upper bound on the fraction of anomalies and a lower bound on the fraction of support vectors.
  • Higher nu values allow more points to be considered outliers.
  • Kernel options like 'linear', 'sigmoid', and 'poly' can also be explored depending on the data shape.
  • This parameter influences the model’s sensitivity to outliers and the generalization ability.

2. Fit Model

What is it?
Trains the One-Class SVM on a dataset containing only inliers.

Syntax:

model.fit(X_train)

Explanation:

  • fit() builds the SVM model that defines the decision function separating normal from anomalous instances.
  • The method assumes that X_train consists only of normal observations.
  • Internally, the model tries to find a hypersphere or hyperplane that contains most of the data.
  • Proper scaling is crucial as SVMs are sensitive to feature magnitudes.
  • Always preprocess using StandardScaler or MinMaxScaler before fitting.
  • Training only on normal data makes the model learn the support boundary of normal class distribution.

3. Predict

What is it?
Uses the trained model to classify new data points as either normal or anomalous.

Syntax:

predictions = model.predict(X_test)

Explanation:

  • Predicts each sample in X_test as either 1 (normal) or -1 (anomaly).
  • Output can be used to trigger alerts, logs, or further inspection.
  • You can convert results to a binary fraud format using a simple list comprehension.
  • Helps in integrating with anomaly filtering pipelines or dashboards.
  • Effective in real-time monitoring systems.

4. Score Samples

What is it?
Measures how far each test instance is from the learned boundary.

Syntax:

scores = model.decision_function(X_test)

Explanation:

  • This function returns real-valued scores: the more negative, the more likely a point is anomalous.
  • Use scores to set a threshold rather than relying on predict() for strict binary labels.
  • This allows fine-tuning for specific recall or precision targets in deployment.
  • Ideal for visualization (e.g., histograms of anomaly scores).
  • Helps build custom logic for business-critical anomaly definitions.

5. Use in Pipeline

What is it?
Embeds One-Class SVM into a pipeline along with preprocessing steps like scaling.

Syntax:

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

pipeline = Pipeline([
    ('scale', StandardScaler()),
    ('svm', OneClassSVM(kernel='rbf', nu=0.05))
])
pipeline.fit(X_train)

Explanation:

  • Helps standardize the modeling workflow and reduce error risk.
  • Ensures the same scaler is applied during both training and inference.
  • Use in cross-validation or GridSearchCV to optimize parameters.
  • Simplifies deployment and automation for production environments.
  • Can integrate with more steps like PCA, feature selection, etc.

Real-Life Project: Credit Card Fraud Detection

Project Overview

Use One-Class SVM to identify fraudulent transactions in a credit card dataset.

Code Example

import pandas as pd
from sklearn.svm import OneClassSVM
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report

# Load dataset
data = pd.read_csv('credit_card.csv')
X = data.drop(columns=['Class'])
y = data['Class']  # 0 = normal, 1 = fraud

# Use only normal transactions for training
X_train = X[y == 0]
X_test = X

# Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train model
model = OneClassSVM(kernel='rbf', nu=0.05)
model.fit(X_train_scaled)

# Predict
y_pred = model.predict(X_test_scaled)
y_pred = [1 if i == -1 else 0 for i in y_pred]  # Convert -1 (anomaly) to 1 (fraud)

print(classification_report(y, y_pred))

Expected Output

  • Improved detection of fraud class (1) using unsupervised method
  • Precision/recall varies based on nu and dataset size

Common Mistakes to Avoid

  • ❌ Forgetting to scale input data
  • ❌ Misunderstanding nu as contamination rate (it’s an upper bound)
  • ❌ Using in supervised settings with labeled data (better to use classifiers there)

Further Reading Recommendation

πŸ“˜ Hands-On Python and Scikit-Learn: A Practical Guide to Machine Learning by Sarful Hassan

πŸ”— Available on Amazon

Anomaly Detection Techniques using Scikit-learn

Anomaly detection is the process of identifying rare items, events, or observations that differ significantly from the majority of the data. Scikit-learn provides several models to perform unsupervised anomaly detection, including One-Class SVM, Isolation Forest, and Elliptic Envelope.

Key Characteristics

  • Detects outliers or rare events in datasets
  • Often used in fraud detection, network security, and monitoring systems
  • Works in unsupervised settings (without labeled data)
  • Sensitive to feature scaling and data distribution

Basic Rules

  • Normalize or standardize features before applying models
  • Use domain knowledge to validate detected anomalies
  • Evaluate using precision-recall or domain-specific metrics
  • Suitable for high-dimensional data when models are properly tuned

Syntax Table

SL NO Technique Syntax Example Description
1 One-Class SVM OneClassSVM(kernel='rbf', nu=0.1) Learns a decision function for outliers
2 Isolation Forest IsolationForest(contamination=0.1) Isolates anomalies based on tree splits
3 Elliptic Envelope EllipticEnvelope(contamination=0.1) Assumes Gaussian distribution
4 Fit Model model.fit(X) Trains on normal (inlier) data
5 Predict Outliers model.predict(X) # -1 = outlier, 1 = inlier Identifies anomalies in the dataset

Syntax Explanation

1. One-Class SVM

What is it? Learns a boundary that surrounds the inliers in feature space.

from sklearn.svm import OneClassSVM
model = OneClassSVM(kernel='rbf', nu=0.05)
  • nu controls the fraction of outliers
  • Sensitive to kernel choice and scaling

2. Isolation Forest

What is it? Randomly partitions data and isolates anomalies with fewer splits.

from sklearn.ensemble import IsolationForest
model = IsolationForest(contamination=0.1)
  • Works well with high-dimensional data
  • Fast and efficient for large datasets

3. Elliptic Envelope

What is it? Fits a Gaussian distribution to the dataset and detects outliers as points far from the center.

from sklearn.covariance import EllipticEnvelope
model = EllipticEnvelope(contamination=0.1)
  • Best for data with normal distribution
  • Requires features to be normally distributed

4. Fit the Model

model.fit(X_train)
  • Trains the model assuming X_train contains only inliers (normal cases)

5. Predict Outliers

pred = model.predict(X_test)  # -1 = anomaly, 1 = normal
  • Use output for further investigation or alert systems

Real-Life Project: Detecting Fraudulent Transactions

Project Overview

Identify potentially fraudulent financial transactions using Isolation Forest.

Code Example

import pandas as pd
from sklearn.ensemble import IsolationForest
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report

# Load dataset
df = pd.read_csv('transactions.csv')
X = df.drop(columns=['is_fraud'])
y = df['is_fraud']

# Scale features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Fit model
model = IsolationForest(contamination=0.05, random_state=42)
model.fit(X_scaled)

# Predict
y_pred = model.predict(X_scaled)
y_pred = [1 if p == -1 else 0 for p in y_pred]  # convert -1 to 1 (fraud)

print(classification_report(y, y_pred))

Expected Output

  • High recall for fraud class (1)
  • Balanced precision depending on contamination setting

Common Mistakes to Avoid

  • ❌ Using raw unscaled data
  • ❌ Ignoring contamination rate tuning
  • ❌ Applying to labeled supervised data (better handled with classifiers)

Further Reading Recommendation

πŸ“˜ Hands-On Python and Scikit-Learn: A Practical Guide to Machine Learning by Sarful Hassan

πŸ”— Available on Amazon

SMOTE for Data Balancing in Scikit-learn

SMOTE (Synthetic Minority Oversampling Technique) is a popular method to address class imbalance by generating synthetic examples of the minority class. Unlike random oversampling, which duplicates data, SMOTE synthesizes new, plausible samples by interpolating between existing minority class samples.

Key Characteristics

  • Generates synthetic minority class samples
  • Reduces risk of overfitting compared to random oversampling
  • Enhances model performance on imbalanced datasets
  • Works only on numeric features (requires preprocessing for categorical data)

Basic Rules

  • Use after train/test split to avoid data leakage
  • Scale features if base model is sensitive to distance metrics
  • Combine with under-sampling or ensemble models for better results
  • Evaluate performance using recall and F1-score

Syntax Table

SL NO Technique Syntax Example Description
1 Initialize SMOTE SMOTE() Prepares the SMOTE instance
2 Fit and Resample X_res, y_res = SMOTE().fit_resample(X, y) Generates new synthetic samples
3 Custom Strategy SMOTE(sampling_strategy=0.5) Balances classes with specific ratio
4 K Neighbors SMOTE(k_neighbors=3) Changes the number of neighbors for SMOTE
5 Use in Pipeline Pipeline([('smote', SMOTE()), ('clf', model)]) Applies SMOTE inside training pipeline

Syntax Explanation

1. Initialize SMOTE

What is it?
Creates a SMOTE instance for resampling.

Syntax:

from imblearn.over_sampling import SMOTE
smote = SMOTE()

Explanation:

  • This line initializes the SMOTE object using default settings.
  • It’s essential before applying fit_resample().
  • You can later modify hyperparameters like sampling_strategy, k_neighbors, etc.
  • Default settings create a fully balanced dataset by oversampling the minority class to match the majority.
  • Ensure the data is numeric; otherwise, you may need SMOTENC or preprocessing.

2. Fit and Resample

What is it?
Generates synthetic minority class samples and returns balanced features and labels.

Syntax:

X_res, y_res = smote.fit_resample(X_train, y_train)

Explanation:

  • Trains the SMOTE model on your training dataset.
  • Returns the new feature set X_res and labels y_res with increased minority instances.
  • The newly generated samples are not simply duplicatedβ€”they are interpolated using nearest neighbors.
  • This step is crucial and should only be applied to training data after splitting to avoid data leakage.
  • Can be combined with under-sampling methods in pipelines for better balance.

3. Custom Sampling Strategy

What is it?
Specifies the desired class distribution ratio.

Syntax:

smote = SMOTE(sampling_strategy=0.5)

Explanation:

  • This will resample the minority class until it’s 50% the size of the majority class.
  • Instead of a fixed number, this is a float-based ratio between 0 and 1.
  • Alternatively, sampling_strategy can be a dictionary {class_label: target_count}.
  • Useful for partial balancing when full parity is not ideal.
  • Helps tailor oversampling to specific business or risk constraints.

4. Adjusting K Neighbors

What is it?
Modifies how SMOTE interpolates new samples using nearest neighbors.

Syntax:

smote = SMOTE(k_neighbors=3)

Explanation:

  • k_neighbors determines how many neighboring minority samples are used for interpolation.
  • Lower values create tighter clusters (less diversity), higher values increase synthetic variability.
  • Default is 5; reducing it may help with extremely small datasets.
  • Should be tuned based on dataset size and distribution.
  • Can also use custom nearest neighbor algorithms via knn=CustomNN().

5. Using SMOTE in a Pipeline

What is it?
Combines SMOTE and model into a single reproducible training workflow.

Syntax:

from imblearn.pipeline import Pipeline
pipeline = Pipeline([
    ('smote', SMOTE()),
    ('clf', LogisticRegression())
])

Explanation:

  • This ensures SMOTE is only applied to the training set during cross-validation.
  • Prevents information leakage across folds.
  • Makes it easy to use GridSearchCV or cross_val_score safely.
  • Can include preprocessing steps like StandardScaler() before modeling.
  • Highly recommended when doing repeated evaluations or deploying models.

Would you like to also expand this document with a section on SMOTENC (for categorical data) or SMOTE with ensemble classifiers?

Real-Life Project: SMOTE for Loan Default Prediction

Project Name

Loan Default Classification

Project Overview

Predict whether a customer will default using SMOTE to balance the dataset.

Project Goal

Improve recall on the default (minority) class.

Code for This Project

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
from imblearn.over_sampling import SMOTE
import pandas as pd

# Load data
X = pd.read_csv('loan_features.csv')
y = pd.read_csv('loan_labels.csv').values.ravel()

# Split
test_size = 0.3
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=test_size, random_state=42)

# Apply SMOTE
sm = SMOTE(k_neighbors=4, sampling_strategy='auto')
X_res, y_res = sm.fit_resample(X_train, y_train)

# Train model
model = LogisticRegression(class_weight='balanced')
model.fit(X_res, y_res)

# Evaluate
y_pred = model.predict(X_test)
print(classification_report(y_test, y_pred))

Expected Output

  • Improved recall and F1-score for the minority class (loan defaults)
  • Classification report showing balanced performance across classes

Common Mistakes to Avoid

  • ❌ Applying SMOTE before train-test split (leads to data leakage)
  • ❌ Using with categorical features without encoding (SMOTE requires numeric input)
  • ❌ Ignoring feature scaling when using distance-based classifiers

Further Reading Recommendation

Handling Imbalanced Data with Scikit-learn

Imbalanced datasets occur when one class significantly outweighs others, often leading to biased models. Scikit-learn offers tools and strategies to address class imbalance through resampling, algorithmic adjustments, and evaluation metrics.

Key Characteristics

  • Target variable has skewed class distribution
  • Causes poor recall for minority classes
  • Needs special preprocessing or model adjustments
  • Affects classification more than regression

Basic Rules

  • Never evaluate solely with accuracy
  • Use stratified splits during training
  • Always monitor precision, recall, and F1-score
  • Apply techniques like resampling or class weighting

Syntax Table

SL NO Technique Syntax Example Description
1 Class Weighting LogisticRegression(class_weight='balanced') Penalizes majority class
2 SMOTE Oversampling SMOTE().fit_resample(X, y) Synthesizes new minority samples
3 Random Under-sampling RandomUnderSampler().fit_resample(X, y) Removes samples from majority class
4 Stratified Split StratifiedKFold(n_splits=5) Ensures class proportions in folds
5 Classification Report classification_report(y_true, y_pred) Evaluates recall, precision, and F1-score

Syntax Explanation

1. Class Weighting

What is it?
Adjusts the loss function to penalize misclassification of minority classes.

Syntax:

from sklearn.linear_model import LogisticRegression
model = LogisticRegression(class_weight='balanced')

Explanation:

  • class_weight='balanced' adjusts class weights inversely proportional to class frequencies in the data.
  • This setting tells the model to pay more attention to minority class samples.
  • Can also pass a dictionary with custom weights, e.g., class_weight={0: 1, 1: 5}.
  • Available in various models like RandomForestClassifier, SVC, and DecisionTreeClassifier.
  • Helps reduce bias toward majority class without modifying the dataset.

2. SMOTE Oversampling

What is it?
Synthetic Minority Oversampling Technique creates synthetic samples from the minority class.

Syntax:

from imblearn.over_sampling import SMOTE
X_res, y_res = SMOTE().fit_resample(X, y)

Explanation:

  • SMOTE creates new synthetic instances by interpolating between existing minority class instances.
  • It helps balance the dataset and prevent overfitting from simple duplication.
  • fit_resample returns a new feature matrix and target vector.
  • Can be customized using k_neighbors, sampling_strategy, and other parameters.
  • Part of the imbalanced-learn (imblearn) package, which must be installed separately.

3. Random Under-sampling

What is it?
Reduces class imbalance by randomly removing samples from the majority class.

Syntax:

from imblearn.under_sampling import RandomUnderSampler
X_res, y_res = RandomUnderSampler().fit_resample(X, y)

Explanation:

  • This method drops samples from the majority class to match the size of the minority class.
  • Helps simplify the dataset and reduce training time.
  • Can lead to information loss if not used carefully.
  • Works well when you have abundant data and want faster training.
  • Combine with SMOTE in a pipeline using Pipeline() from imblearn.pipeline for optimal performance.

4. Stratified Split

What is it?
Ensures each fold has the same class distribution as the original dataset.

Syntax:

from sklearn.model_selection import StratifiedKFold
skf = StratifiedKFold(n_splits=5)

Explanation:

  • Creates train/test splits such that each fold maintains the original class distribution.
  • Prevents imbalance during cross-validation which can skew results.
  • Use .split(X, y) to generate train/test indices.
  • Commonly used with cross_val_score or custom CV loops.
  • Also available as StratifiedShuffleSplit for randomized splitting.

5. Classification Report

What is it?
Displays precision, recall, F1-score, and support for each class.

Syntax:

from sklearn.metrics import classification_report
print(classification_report(y_true, y_pred))

Explanation:

  • precision = TP / (TP + FP): focus on positive prediction correctness.
  • recall = TP / (TP + FN): focus on identifying all relevant samples.
  • f1-score = harmonic mean of precision and recall.
  • support = number of true samples for each label.
  • Particularly helpful to monitor minority class performance, which may have poor recall in imbalanced settings.
  • Should be used in conjunction with a confusion matrix for full clarity.

Real-Life Project: Fraud Detection with Imbalanced Data

Project Name

Credit Card Fraud Classification

Project Overview

Detect fraudulent transactions from highly imbalanced financial dataset.

Project Goal

Use oversampling and evaluation metrics to identify fraud effectively.

Code for This Project

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
from imblearn.over_sampling import SMOTE

# Simulated example dataset
X = pd.read_csv('features.csv')
y = pd.read_csv('labels.csv').values.ravel()

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=42)

# Apply SMOTE
smote = SMOTE()
X_res, y_res = smote.fit_resample(X_train, y_train)

# Train model
model = LogisticRegression(class_weight='balanced')
model.fit(X_res, y_res)

# Predict and evaluate
y_pred = model.predict(X_test)
print(classification_report(y_test, y_pred))

Expected Output

  • Higher recall on minority class (fraud)
  • Balanced F1-scores across both classes

Common Mistakes to Avoid

  • ❌ Relying only on accuracy
  • ❌ Not stratifying during split or validation
  • ❌ Oversampling before splitting (leads to leakage)
  • ❌ Ignoring class imbalance in metrics

Further Reading Recommendation

πŸ“˜ Hands-On Python and Scikit-Learn: A Practical Guide to Machine Learning by Sarful Hassan

πŸ”— Available on Amazon

Time Series Cross-Validation in Scikit-learn

Traditional k-fold cross-validation is not suitable for time series data due to the temporal dependency between observations. Instead, Scikit-learn provides TimeSeriesSplit, a strategy that preserves order and prevents leakage by ensuring that the training set always precedes the test set chronologically.

Key Characteristics

  • Maintains chronological order in splits
  • Avoids training on future data
  • Useful for evaluating time-based model stability
  • Supports consistent model validation in rolling or expanding windows

Basic Rules

  • Always split sequentiallyβ€”not randomly
  • Training set must precede test set
  • Use consistent time intervals
  • Ideal for univariate and multivariate time series tasks

Syntax Table

SL NO Technique Syntax Example Description
1 Initialize split TimeSeriesSplit(n_splits=5) Creates time-ordered cross-validation sets
2 Access splits for train_idx, test_idx in tscv.split(X): ... Iterates through each CV fold
3 Train model model.fit(X[train_idx], y[train_idx]) Trains on training portion of each split
4 Evaluate model model.predict(X[test_idx]) Evaluates on time-valid test data
5 Visualization plt.plot(train_idx), plt.plot(test_idx) Useful to understand fold composition

Syntax Explanation

1. Initialize TimeSeriesSplit

What is it?
Creates a cross-validator that provides train/test indices in time-ordered folds.

Syntax:

from sklearn.model_selection import TimeSeriesSplit
tscv = TimeSeriesSplit(n_splits=5)

Explanation:

  • n_splits=5 will divide the data into 5 sequential folds.
  • Each new fold adds more data to the training set.
  • Ideal for rolling window validation to mimic real-time prediction environments.
  • You can also customize the max_train_size parameter to limit how large the training set grows.

2. Access Splits

What is it?
Extracts the train and test indices from each fold using a for loop.

Syntax:

for train_idx, test_idx in tscv.split(X):
    print("Train indices:", train_idx, "Test indices:", test_idx)

Explanation:

  • Iterates over each fold and provides integer indices for slicing.
  • Ensures test data follows training data in time.
  • Very helpful for debugging, logging, and visualizing the sequence of training and testing.
  • Each iteration updates the model using increasingly more historical data.

3. Train Model

What is it?
Fits the model to the current training set for the current fold.

Syntax:

model.fit(X[train_idx], y[train_idx])

Explanation:

  • Ensures training is only done using past observations.
  • This loop enables robust evaluation of model performance across various time splits.
  • Supports all Scikit-learn estimators (LinearRegression, SVR, Ridge, etc.)
  • For pipelines, use: pipeline.fit(X[train_idx], y[train_idx])

4. Evaluate Model

What is it?
Generates predictions on the future (test) portion of the time series.

Syntax:

y_pred = model.predict(X[test_idx])

Explanation:

  • Makes one-step ahead (or multi-step if structured) predictions.
  • Should compare predictions with y[test_idx] using evaluation metrics like RMSE, MAE, or MAPE.
  • Important for simulating how a deployed model would perform on unseen data.
  • You can log each fold’s score or average them at the end.

5. Visualization of Splits

What is it?
Optional step to plot how splits are formed over time.

Syntax:

import matplotlib.pyplot as plt
plt.plot(train_idx, label='Train')
plt.plot(test_idx, label='Test')
plt.legend()

Explanation:

  • Great for checking how data is partitioned visually.
  • Confirms model never sees future data during training.
  • Helps ensure correct fold structure for reproducibility.
  • Can reveal issues like short test folds or improper sequences.

Real-Life Project: Time Series CV for Stock Forecasting

Project Name

Sequential Cross-Validation for Stock Returns

Project Overview

Use time series split to validate a regression model predicting next-day stock returns.

Project Goal

Implement walk-forward validation using TimeSeriesSplit.

Code for This Project

import numpy as np
import pandas as pd
from sklearn.linear_model import Ridge
from sklearn.model_selection import TimeSeriesSplit
from sklearn.metrics import mean_squared_error

# Simulated data
df = pd.DataFrame({
    'feature1': np.random.randn(100),
    'feature2': np.random.randn(100),
    'target': np.random.randn(100)
})

X = df[['feature1', 'feature2']].values
y = df['target'].values

# Initialize split
tscv = TimeSeriesSplit(n_splits=5)

# Run CV
for fold, (train_idx, test_idx) in enumerate(tscv.split(X)):
    model = Ridge()
    model.fit(X[train_idx], y[train_idx])
    y_pred = model.predict(X[test_idx])
    print(f"Fold {fold + 1} MSE:", mean_squared_error(y[test_idx], y_pred))

Expected Output

  • Fold-wise MSE printed
  • Performance consistency across time folds

Common Mistakes to Avoid

  • ❌ Using KFold instead of TimeSeriesSplit
  • ❌ Training on future data (data leakage)
  • ❌ Not scaling after train/test split (if needed)

Further Reading Recommendation

πŸ“˜ Hands-On Python and Scikit-Learn: A Practical Guide to Machine Learning by Sarful Hassan

πŸ”— Available on Amazon

Lag Features and Rolling Means with Scikit-learn

Lag features and rolling statistics are powerful tools in time series forecasting. While Scikit-learn doesn’t provide these natively, they can be engineered using pandas before feeding into models. These features help capture temporal dependencies, seasonality, and trends.

Key Characteristics

  • Lag features represent past observations
  • Rolling means smooth short-term fluctuations
  • Used in feature engineering for regression and classification
  • Improves model context over time

Basic Rules

  • Always shift or roll before training to prevent leakage
  • Drop NaNs after applying lag/rolling
  • Combine multiple lags and windows for better performance
  • Can be used in pipeline with FunctionTransformer

Syntax Table

SL NO Technique Syntax Example Description
1 Lag feature df['lag1'] = df['value'].shift(1) Adds previous time step as feature
2 Multiple lags df['lag3'] = df['value'].shift(3) Adds value from 3 steps back
3 Rolling mean df['roll_mean_3'] = df['value'].rolling(3).mean() Computes 3-step moving average
4 Rolling std df['roll_std_5'] = df['value'].rolling(5).std() Rolling standard deviation
5 Drop NaNs df = df.dropna() Removes rows with missing values

Syntax Explanation

1. Lag Feature

What is it?
Adds a new column that contains the value from one time step ago. This helps the model learn temporal dependencies between observations.

Syntax:

df['lag1'] = df['value'].shift(1)

Explanation:

  • Shifts the original series by 1 row to align each observation with its prior value.
  • Essential for converting time series to supervised learning.
  • Can stack several lags to build memory into the model.
  • Watch out for NaN at the start, which must be removed before training.

2. Multiple Lags

What is it?
Creates additional lagged features with greater gaps to capture longer temporal effects.

Syntax:

df['lag3'] = df['value'].shift(3)

Explanation:

  • Offers deeper historical context.
  • Improves learning of cyclic or weekly patterns.
  • Combine lag1, lag3, lag7, etc., to capture short-term and seasonal behavior.
  • Enables models to use multi-step historical dependencies as features.

3. Rolling Mean

What is it?
Smooths out time series by averaging values over a sliding window.

Syntax:

df['roll_mean_3'] = df['value'].rolling(3).mean()

Explanation:

  • Calculates the average of current and previous 2 values (window=3).
  • Useful for trend extraction and smoothing noise.
  • Reduces impact of short-term fluctuations and sharp jumps.
  • Can be used directly or in combination with other features.

4. Rolling Standard Deviation

What is it?
Quantifies the variation or volatility within a sliding window of observations.

Syntax:

df['roll_std_5'] = df['value'].rolling(5).std()

Explanation:

  • Measures the degree of deviation from the mean within a window.
  • Helpful in modeling uncertainty and market volatility.
  • Can highlight periods of instability or abnormal behavior.
  • Use different window sizes to capture short- or long-term volatility.

5. Drop NaNs

What is it?
Removes all rows that contain NaN values, typically introduced by lag or rolling computations.

Syntax:

df = df.dropna()

Explanation:

  • Necessary cleanup step before model training.
  • Avoids errors when passing data to Scikit-learn models.
  • Drop only after all lag and rolling features have been added.
  • Alternatively, impute missing values if losing rows is unacceptable.

Real-Life Project: Temperature Forecasting with Lag + Rolling

Project Name

Enhanced Daily Temperature Forecast

Project Overview

Add lag and rolling mean features to improve prediction of daily temperatures.

Project Goal

Use lag and rolling statistics in a linear regression model.

Code for This Project

import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

# Simulate data
dates = pd.date_range(start='2023-01-01', periods=100)
temps = np.random.normal(loc=25, scale=3, size=100)
df = pd.DataFrame({'date': dates, 'temp': temps})

# Create features
df['lag1'] = df['temp'].shift(1)
df['roll_mean_3'] = df['temp'].rolling(3).mean()
df['roll_std_3'] = df['temp'].rolling(3).std()
df = df.dropna()

X = df[['lag1', 'roll_mean_3', 'roll_std_3']]
y = df['temp']
X_train, X_test, y_train, y_test = train_test_split(X, y, shuffle=False)

model = LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print("MSE:", mean_squared_error(y_test, y_pred))

Expected Output

  • Improved accuracy over lag-only model
  • Highlights benefit of combining lag and rolling features

Common Mistakes to Avoid

  • ❌ Using rolling without handling NaNs
  • ❌ Leaking future information by shifting incorrectly
  • ❌ Applying rolling after splitting data (must be before!)

Further Reading Recommendation

πŸ“˜ Hands-On Python and Scikit-Learn: A Practical Guide to Machine Learning by Sarful Hassan

πŸ”— Available on Amazon

Working with Time Series Data in Scikit-learn

Scikit-learn is primarily designed for tabular data and doesn’t natively support time series analysis. However, it can be adapted for time series forecasting and classification by carefully managing data splits and feature engineering. For advanced time series tasks, integration with pandas, statsmodels, or sktime is common.

Key Characteristics

  • Supports time series prediction using supervised learning format
  • Requires lag feature creation
  • Must avoid data leakage with proper temporal splits
  • Compatible with scikit-learn pipelines

Basic Rules

  • Never randomly split time series data (use chronological split)
  • Create lagged features to convert time series to supervised format
  • Always scale after splitting to avoid leakage
  • Consider time-aware cross-validation (e.g., TimeSeriesSplit)

Syntax Table

SL NO Technique Syntax Example Description
1 Chronological split train_test_split(data, shuffle=False) Maintains time order
2 Lag feature creation df['lag1'] = df['value'].shift(1) Adds lagged version of a feature
3 TimeSeriesSplit TimeSeriesSplit(n_splits=5) Splits time series for cross-validation
4 Model training model.fit(X_train, y_train) Trains model on lagged features
5 Forecasting model.predict(X_test) Predicts future values from past features

Syntax Explanation

1. Chronological Split

What is it?
Splits data in a way that respects temporal order. This is critical to prevent data leakage and ensure the model doesn’t learn from future information.

Syntax:

from sklearn.model_selection import train_test_split
X_train, X_test = train_test_split(X, shuffle=False)

Explanation:

  • Ensures training happens only on past data
  • Maintains sequence integrity for forecasting
  • Avoids shuffling which would break time relationships

2. Lag Feature Creation

What is it?
Creates columns with shifted values of the original time series to simulate previous time steps.

Syntax:

df['lag1'] = df['value'].shift(1)

Explanation:

  • Converts time series into a supervised learning dataset
  • Lagged values act as predictors
  • Can add multiple lags (lag2, lag3) for better context
  • Be sure to drop NaN rows created by shifting

3. Time Series Cross-Validation

What is it?
Implements k-fold validation where the folds maintain time sequence. Useful when testing model consistency over time.

Syntax:

from sklearn.model_selection import TimeSeriesSplit
tscv = TimeSeriesSplit(n_splits=5)
for train_idx, test_idx in tscv.split(X):
    model.fit(X[train_idx], y[train_idx])
    predictions = model.predict(X[test_idx])

Explanation:

  • Each split trains on older data and tests on newer data
  • No future data leaks into the past
  • Useful for evaluating stability across time periods

4. Model Training

What is it?
Fits a model on the lagged feature training set to learn time-based patterns.

Syntax:

model.fit(X_train, y_train)

Explanation:

  • Trains on features representing past values
  • Supports any supervised model: LinearRegression, RandomForest, etc.
  • Model learns how current outcomes relate to previous inputs

5. Forecasting Future Values

What is it?
Uses the trained model to generate predictions for future steps.

Syntax:

y_pred = model.predict(X_test)

Explanation:

  • Produces future values based on previously observed patterns
  • Often evaluated using metrics like MSE, MAE, RMSE
  • Can be extended for multi-step forecasting using recursive methods

Real-Life Project: Time Series Forecasting with Lag Features

Project Name

Daily Temperature Prediction

Project Overview

Forecast the next day’s temperature using previous day temperatures.

Project Goal

Train a linear regression model using lag features.

Code for This Project

import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

# Simulate time series data
dates = pd.date_range(start='2023-01-01', periods=100)
temps = np.random.normal(loc=25, scale=3, size=100)
df = pd.DataFrame({'date': dates, 'temp': temps})
df['lag1'] = df['temp'].shift(1)
df = df.dropna()

X = df[['lag1']]
y = df['temp']
X_train, X_test, y_train, y_test = train_test_split(X, y, shuffle=False)

model = LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print("MSE:", mean_squared_error(y_test, y_pred))

Expected Output

  • Mean Squared Error for one-step ahead forecasting
  • Demonstrates lag-based supervised learning pipeline

Common Mistakes to Avoid

  • ❌ Random shuffling of time series data
  • ❌ Using future information in lag features
  • ❌ Ignoring stationarity assumptions

Further Reading Recommendation

πŸ“˜ Hands-On Python and Scikit-Learn: A Practical Guide to Machine Learning by Sarful Hassan

πŸ”— Available on Amazon

Real-World Dataset: Wine Classification in Scikit-learn

The Wine dataset is a classic multiclass classification dataset available in Scikit-learn. It contains the results of a chemical analysis of wines grown in the same region in Italy but derived from three different cultivars. The goal is to classify the wine based on 13 features such as alcohol content, ash, flavanoids, and more.

Key Characteristics

  • Multiclass classification problem (3 classes)
  • Target: Wine class labels (0, 1, 2)
  • Features: Alcohol, Malic acid, Ash, Flavanoids, etc.
  • Clean and well-structured dataset

Basic Rules

  • Standardize features before training
  • Use accuracy and confusion matrix for evaluation
  • Try different classifiers (Logistic Regression, KNN, SVM)
  • Use stratify=y to maintain class proportions

Syntax Table

SL NO Step Syntax Example Description
1 Load dataset load_wine(return_X_y=True) Loads wine features and class labels
2 Train/test split train_test_split(X, y, stratify=y, test_size=0.3) Ensures balanced class split
3 Standard scaling StandardScaler().fit_transform(X_train) Scales features
4 Train classifier LogisticRegression().fit(X_train, y_train) Trains a classification model
5 Evaluate model confusion_matrix(y_test, y_pred) Shows prediction correctness per class

Syntax Explanation

1. Load Dataset

What is it?
Loads the Wine dataset from Scikit-learn.

Syntax:

from sklearn.datasets import load_wine
X, y = load_wine(return_X_y=True)

Explanation:

  • X contains 13 chemical features of wine samples
  • y contains the class labels (0, 1, 2)

2. Train/Test Split

What is it?
Divides the dataset into training and testing subsets.

Syntax:

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.3, random_state=42)

Explanation:

  • Maintains class proportions in train and test sets

3. Standard Scaling

What is it?
Applies normalization to the input features.

Syntax:

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

Explanation:

  • Prevents features with larger scales from dominating the model

4. Train Classifier

What is it?
Fits a logistic regression classifier on the wine data.

Syntax:

from sklearn.linear_model import LogisticRegression
model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)

Explanation:

  • Learns decision boundaries for each wine class
  • Logistic Regression supports multiclass classification

5. Evaluate Model

What is it?
Assesses the model performance with a confusion matrix.

Syntax:

from sklearn.metrics import confusion_matrix, accuracy_score
y_pred = model.predict(X_test)
print(confusion_matrix(y_test, y_pred))
print("Accuracy:", accuracy_score(y_test, y_pred))

Explanation:

  • Shows how many instances were correctly or incorrectly classified

Real-Life Project: Wine Type Prediction

Project Name

Wine Quality Classifier

Project Overview

Classify wines into one of three types using their chemical properties.

Project Goal

Develop a model that accurately identifies the wine class based on input features.

Code for This Project

from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, accuracy_score

# Load data
X, y = load_wine(return_X_y=True)

# Split
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.3, random_state=42)

# Scale
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# Train
model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)

# Predict & Evaluate
y_pred = model.predict(X_test)
print(confusion_matrix(y_test, y_pred))
print("Accuracy:", accuracy_score(y_test, y_pred))

Expected Output

  • Confusion matrix and accuracy score
  • High classification accuracy (typically >95%)

Common Mistakes to Avoid

  • ❌ Not scaling features before model training
  • ❌ Ignoring class imbalance in split
  • ❌ Using binary classifiers for multiclass problems

Further Reading Recommendation

πŸ“˜ Hands-On Python and Scikit-Learn: A Practical Guide to Machine Learning by Sarful Hassan

πŸ”— Available on Amazon