Scikit-learn

Multi-Class Classification Strategies in Scikit-learn

Posted on June 5, 2025 by Lab

Multi-class classification is a supervised learning task where the goal is to assign each input sample to one of three or more classes. Scikit-learn provides several strategies to handle multi-class problems, including One-vs-Rest (OvR), One-vs-One (OvO), and native multiclass classifiers like RandomForestClassifier or LogisticRegression.

Key Characteristics

Handles more than two class labels
Supports One-vs-Rest (OvR) and One-vs-One (OvO) strategies
Can use native classifiers or meta-estimators
Works with both linear and nonlinear models

Basic Rules

Choose OvR for high efficiency on large datasets
OvO may work better with models sensitive to class boundaries
Evaluate confusion matrix to understand per-class performance
Use stratified train-test split to ensure balanced class distribution

Syntax Table

SL NO	Technique	Syntax Example	Description
1	One-vs-Rest	`OneVsRestClassifier(LogisticRegression())`	Trains one classifier per class vs all others
2	One-vs-One	`OneVsOneClassifier(SVC())`	Trains one classifier per class pair
3	Native Support	`RandomForestClassifier()`	Native support for multi-class
4	Fit Model	`model.fit(X_train, y_train)`	Trains the chosen classifier
5	Predict Classes	`model.predict(X_test)`	Returns predicted class labels

Syntax Explanation

1. One-vs-Rest (OvR)

What is it?
A strategy that fits one classifier per class, where each classifier distinguishes a class from all others.

Syntax:

from sklearn.multiclass import OneVsRestClassifier
from sklearn.linear_model import LogisticRegression
model = OneVsRestClassifier(LogisticRegression())

Explanation:

Suitable for linear models
Efficient on large datasets
Each classifier outputs a confidence score; highest score wins

2. One-vs-One (OvO)

What is it?
A strategy that fits one classifier per class pair.

Syntax:

from sklearn.multiclass import OneVsOneClassifier
from sklearn.svm import SVC
model = OneVsOneClassifier(SVC())

Explanation:

Builds N(N-1)/2 classifiers for N classes
Each classifier votes, and majority class wins
Effective when class boundaries are complex

3. Native Multi-Class Classifier

What is it?
Classifiers like Random Forest and Logistic Regression inherently support multi-class classification.

Syntax:

from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier()

Explanation:

No need to wrap with OvR or OvO
Handles class imbalance and non-linearity well
Straightforward integration

4. Fit Model

What is it?
Trains the selected model on the labeled dataset.

Syntax:

model.fit(X_train, y_train)

Explanation:

Accepts feature matrix and label vector
Learns the decision boundaries between classes
Can be combined with grid search or pipelines

5. Predict Classes

What is it?
Predicts the class labels for unseen test data.

Syntax:

predictions = model.predict(X_test)

Explanation:

Produces an array of predicted class labels
Useful for accuracy, confusion matrix, or F1 score evaluations
Can be used in real-time prediction systems

Real-Life Project: Digit Recognition (MNIST)

Project Overview

Use multi-class classification strategies to identify handwritten digits (0–9) from images.

Code Example

from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

# Load data
digits = load_digits()
X_train, X_test, y_train, y_test = train_test_split(digits.data, digits.target, test_size=0.3, random_state=42)

# Train model with native multi-class support
model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)

# Predict and evaluate
y_pred = model.predict(X_test)
print(classification_report(y_test, y_pred))

Expected Output

Per-class precision, recall, and F1-scores
Overall accuracy of multi-class classifier

Common Mistakes to Avoid

❌ Ignoring label distribution in train/test splits
❌ Using binary classifiers without wrapping them in OvR or OvO
❌ Not evaluating model using class-specific metrics

📘 Hands-On Python and Scikit-Learn: A Practical Guide to Machine Learning by Sarful Hassan

🔗 Available on Amazon

Scikit-learn

One-Class SVM in Scikit-learn

Posted on June 5, 2025 by Lab

One-Class Support Vector Machine (One-Class SVM) is an unsupervised anomaly detection algorithm that learns a decision function to separate normal data points from outliers. It is particularly effective for problems where only normal data is available for training.

Key Characteristics

Learns a boundary around normal data in feature space
Based on SVM principles with a special loss formulation
Assumes data is centered around the origin in feature space
Works well with smaller and moderately sized datasets

Basic Rules

Always scale data before fitting the model
Suitable when only normal examples are present in training
Use nu to control the proportion of outliers
kernel selection is crucial for performance (e.g., ‘rbf’)

Syntax Table

SL NO	Technique	Syntax Example	Description
1	Initialize Model	`OneClassSVM(kernel='rbf', nu=0.05)`	Creates SVM with RBF kernel and anomaly ratio
2	Fit Model	`model.fit(X_train)`	Learns the boundary from normal samples
3	Predict	`model.predict(X_test)`	Returns -1 for anomalies, 1 for inliers
4	Score Samples	`model.decision_function(X_test)`	Computes distance from decision boundary
5	Use in Pipeline	`Pipeline([...])`	Wraps One-Class SVM with preprocessing

Syntax Explanation

1. Initialize Model

What is it?
Creates a One-Class SVM model to detect anomalies using a kernel function.

Syntax:

from sklearn.svm import OneClassSVM
model = OneClassSVM(kernel='rbf', nu=0.05)

Explanation:

OneClassSVM() initializes the model using support vector machine formulation adapted for unsupervised outlier detection.
kernel='rbf' means the model uses a radial basis function kernel for nonlinear separation.
nu is a regularization parameter: it defines an upper bound on the fraction of anomalies and a lower bound on the fraction of support vectors.
Higher nu values allow more points to be considered outliers.
Kernel options like 'linear', 'sigmoid', and 'poly' can also be explored depending on the data shape.
This parameter influences the model’s sensitivity to outliers and the generalization ability.

2. Fit Model

What is it?
Trains the One-Class SVM on a dataset containing only inliers.

Syntax:

model.fit(X_train)

Explanation:

fit() builds the SVM model that defines the decision function separating normal from anomalous instances.
The method assumes that X_train consists only of normal observations.
Internally, the model tries to find a hypersphere or hyperplane that contains most of the data.
Proper scaling is crucial as SVMs are sensitive to feature magnitudes.
Always preprocess using StandardScaler or MinMaxScaler before fitting.
Training only on normal data makes the model learn the support boundary of normal class distribution.

3. Predict

What is it?
Uses the trained model to classify new data points as either normal or anomalous.

Syntax:

predictions = model.predict(X_test)

Explanation:

Predicts each sample in X_test as either 1 (normal) or -1 (anomaly).
Output can be used to trigger alerts, logs, or further inspection.
You can convert results to a binary fraud format using a simple list comprehension.
Helps in integrating with anomaly filtering pipelines or dashboards.
Effective in real-time monitoring systems.

4. Score Samples

What is it?
Measures how far each test instance is from the learned boundary.

Syntax:

scores = model.decision_function(X_test)

Explanation:

This function returns real-valued scores: the more negative, the more likely a point is anomalous.
Use scores to set a threshold rather than relying on predict() for strict binary labels.
This allows fine-tuning for specific recall or precision targets in deployment.
Ideal for visualization (e.g., histograms of anomaly scores).
Helps build custom logic for business-critical anomaly definitions.

5. Use in Pipeline

What is it?
Embeds One-Class SVM into a pipeline along with preprocessing steps like scaling.

Syntax:

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

pipeline = Pipeline([
    ('scale', StandardScaler()),
    ('svm', OneClassSVM(kernel='rbf', nu=0.05))
])
pipeline.fit(X_train)

Explanation:

Helps standardize the modeling workflow and reduce error risk.
Ensures the same scaler is applied during both training and inference.
Use in cross-validation or GridSearchCV to optimize parameters.
Simplifies deployment and automation for production environments.
Can integrate with more steps like PCA, feature selection, etc.

Real-Life Project: Credit Card Fraud Detection

Project Overview

Use One-Class SVM to identify fraudulent transactions in a credit card dataset.

Code Example

import pandas as pd
from sklearn.svm import OneClassSVM
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report

# Load dataset
data = pd.read_csv('credit_card.csv')
X = data.drop(columns=['Class'])
y = data['Class']  # 0 = normal, 1 = fraud

# Use only normal transactions for training
X_train = X[y == 0]
X_test = X

# Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train model
model = OneClassSVM(kernel='rbf', nu=0.05)
model.fit(X_train_scaled)

# Predict
y_pred = model.predict(X_test_scaled)
y_pred = [1 if i == -1 else 0 for i in y_pred]  # Convert -1 (anomaly) to 1 (fraud)

print(classification_report(y, y_pred))

Expected Output

Improved detection of fraud class (1) using unsupervised method
Precision/recall varies based on nu and dataset size

Common Mistakes to Avoid

❌ Forgetting to scale input data
❌ Misunderstanding nu as contamination rate (it’s an upper bound)
❌ Using in supervised settings with labeled data (better to use classifiers there)

📘 Hands-On Python and Scikit-Learn: A Practical Guide to Machine Learning by Sarful Hassan

🔗 Available on Amazon

Scikit-learn

One-Class SVM in Scikit-learn

Posted on June 5, 2025 by Lab

Key Characteristics

Learns a boundary around normal data in feature space
Based on SVM principles with a special loss formulation
Assumes data is centered around the origin in feature space
Works well with smaller and moderately sized datasets

Basic Rules

Always scale data before fitting the model
Suitable when only normal examples are present in training
Use nu to control the proportion of outliers
kernel selection is crucial for performance (e.g., ‘rbf’)

Syntax Table

SL NO	Technique	Syntax Example	Description
1	Initialize Model	`OneClassSVM(kernel='rbf', nu=0.05)`	Creates SVM with RBF kernel and anomaly ratio
2	Fit Model	`model.fit(X_train)`	Learns the boundary from normal samples
3	Predict	`model.predict(X_test)`	Returns -1 for anomalies, 1 for inliers
4	Score Samples	`model.decision_function(X_test)`	Computes distance from decision boundary
5	Use in Pipeline	`Pipeline([...])`	Wraps One-Class SVM with preprocessing

Syntax Explanation

1. Initialize Model

What is it?
Creates a One-Class SVM model to detect anomalies using a kernel function.

Syntax:

from sklearn.svm import OneClassSVM
model = OneClassSVM(kernel='rbf', nu=0.05)

Explanation:

OneClassSVM() initializes the model using support vector machine formulation adapted for unsupervised outlier detection.
kernel='rbf' means the model uses a radial basis function kernel for nonlinear separation.
nu is a regularization parameter: it defines an upper bound on the fraction of anomalies and a lower bound on the fraction of support vectors.
Higher nu values allow more points to be considered outliers.
Kernel options like 'linear', 'sigmoid', and 'poly' can also be explored depending on the data shape.
This parameter influences the model’s sensitivity to outliers and the generalization ability.

2. Fit Model

What is it?
Trains the One-Class SVM on a dataset containing only inliers.

Syntax:

model.fit(X_train)

Explanation:

fit() builds the SVM model that defines the decision function separating normal from anomalous instances.
The method assumes that X_train consists only of normal observations.
Internally, the model tries to find a hypersphere or hyperplane that contains most of the data.
Proper scaling is crucial as SVMs are sensitive to feature magnitudes.
Always preprocess using StandardScaler or MinMaxScaler before fitting.
Training only on normal data makes the model learn the support boundary of normal class distribution.

3. Predict

What is it?
Uses the trained model to classify new data points as either normal or anomalous.

Syntax:

predictions = model.predict(X_test)

Explanation:

Predicts each sample in X_test as either 1 (normal) or -1 (anomaly).
Output can be used to trigger alerts, logs, or further inspection.
You can convert results to a binary fraud format using a simple list comprehension.
Helps in integrating with anomaly filtering pipelines or dashboards.
Effective in real-time monitoring systems.

4. Score Samples

What is it?
Measures how far each test instance is from the learned boundary.

Syntax:

scores = model.decision_function(X_test)

Explanation:

This function returns real-valued scores: the more negative, the more likely a point is anomalous.
Use scores to set a threshold rather than relying on predict() for strict binary labels.
This allows fine-tuning for specific recall or precision targets in deployment.
Ideal for visualization (e.g., histograms of anomaly scores).
Helps build custom logic for business-critical anomaly definitions.

5. Use in Pipeline

What is it?
Embeds One-Class SVM into a pipeline along with preprocessing steps like scaling.

Syntax:

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

pipeline = Pipeline([
    ('scale', StandardScaler()),
    ('svm', OneClassSVM(kernel='rbf', nu=0.05))
])
pipeline.fit(X_train)

Explanation:

Helps standardize the modeling workflow and reduce error risk.
Ensures the same scaler is applied during both training and inference.
Use in cross-validation or GridSearchCV to optimize parameters.
Simplifies deployment and automation for production environments.
Can integrate with more steps like PCA, feature selection, etc.

Real-Life Project: Credit Card Fraud Detection

Project Overview

Use One-Class SVM to identify fraudulent transactions in a credit card dataset.

Code Example

import pandas as pd
from sklearn.svm import OneClassSVM
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report

# Load dataset
data = pd.read_csv('credit_card.csv')
X = data.drop(columns=['Class'])
y = data['Class']  # 0 = normal, 1 = fraud

# Use only normal transactions for training
X_train = X[y == 0]
X_test = X

# Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train model
model = OneClassSVM(kernel='rbf', nu=0.05)
model.fit(X_train_scaled)

# Predict
y_pred = model.predict(X_test_scaled)
y_pred = [1 if i == -1 else 0 for i in y_pred]  # Convert -1 (anomaly) to 1 (fraud)

print(classification_report(y, y_pred))

Expected Output

Improved detection of fraud class (1) using unsupervised method
Precision/recall varies based on nu and dataset size

Common Mistakes to Avoid

❌ Forgetting to scale input data
❌ Misunderstanding nu as contamination rate (it’s an upper bound)
❌ Using in supervised settings with labeled data (better to use classifiers there)

📘 Hands-On Python and Scikit-Learn: A Practical Guide to Machine Learning by Sarful Hassan

🔗 Available on Amazon

Scikit-learn

Anomaly Detection Techniques using Scikit-learn

Posted on June 5, 2025 by Lab

Anomaly detection is the process of identifying rare items, events, or observations that differ significantly from the majority of the data. Scikit-learn provides several models to perform unsupervised anomaly detection, including One-Class SVM, Isolation Forest, and Elliptic Envelope.

Key Characteristics

Detects outliers or rare events in datasets
Often used in fraud detection, network security, and monitoring systems
Works in unsupervised settings (without labeled data)
Sensitive to feature scaling and data distribution

Basic Rules

Normalize or standardize features before applying models
Use domain knowledge to validate detected anomalies
Evaluate using precision-recall or domain-specific metrics
Suitable for high-dimensional data when models are properly tuned

Syntax Table

SL NO	Technique	Syntax Example	Description
1	One-Class SVM	`OneClassSVM(kernel='rbf', nu=0.1)`	Learns a decision function for outliers
2	Isolation Forest	`IsolationForest(contamination=0.1)`	Isolates anomalies based on tree splits
3	Elliptic Envelope	`EllipticEnvelope(contamination=0.1)`	Assumes Gaussian distribution
4	Fit Model	`model.fit(X)`	Trains on normal (inlier) data
5	Predict Outliers	`model.predict(X) # -1 = outlier, 1 = inlier`	Identifies anomalies in the dataset

Syntax Explanation

1. One-Class SVM

What is it? Learns a boundary that surrounds the inliers in feature space.

from sklearn.svm import OneClassSVM
model = OneClassSVM(kernel='rbf', nu=0.05)

nu controls the fraction of outliers
Sensitive to kernel choice and scaling

2. Isolation Forest

What is it? Randomly partitions data and isolates anomalies with fewer splits.

from sklearn.ensemble import IsolationForest
model = IsolationForest(contamination=0.1)

Works well with high-dimensional data
Fast and efficient for large datasets

3. Elliptic Envelope

What is it? Fits a Gaussian distribution to the dataset and detects outliers as points far from the center.

from sklearn.covariance import EllipticEnvelope
model = EllipticEnvelope(contamination=0.1)

Best for data with normal distribution
Requires features to be normally distributed

4. Fit the Model

model.fit(X_train)

Trains the model assuming X_train contains only inliers (normal cases)

5. Predict Outliers

pred = model.predict(X_test)  # -1 = anomaly, 1 = normal

Use output for further investigation or alert systems

Real-Life Project: Detecting Fraudulent Transactions

Project Overview

Identify potentially fraudulent financial transactions using Isolation Forest.

Code Example

import pandas as pd
from sklearn.ensemble import IsolationForest
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report

# Load dataset
df = pd.read_csv('transactions.csv')
X = df.drop(columns=['is_fraud'])
y = df['is_fraud']

# Scale features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Fit model
model = IsolationForest(contamination=0.05, random_state=42)
model.fit(X_scaled)

# Predict
y_pred = model.predict(X_scaled)
y_pred = [1 if p == -1 else 0 for p in y_pred]  # convert -1 to 1 (fraud)

print(classification_report(y, y_pred))

Expected Output

High recall for fraud class (1)
Balanced precision depending on contamination setting

Common Mistakes to Avoid

❌ Using raw unscaled data
❌ Ignoring contamination rate tuning
❌ Applying to labeled supervised data (better handled with classifiers)

📘 Hands-On Python and Scikit-Learn: A Practical Guide to Machine Learning by Sarful Hassan

🔗 Available on Amazon

Scikit-learn

SMOTE for Data Balancing in Scikit-learn

Posted on June 5, 2025 by Lab

SMOTE (Synthetic Minority Oversampling Technique) is a popular method to address class imbalance by generating synthetic examples of the minority class. Unlike random oversampling, which duplicates data, SMOTE synthesizes new, plausible samples by interpolating between existing minority class samples.

Key Characteristics

Generates synthetic minority class samples
Reduces risk of overfitting compared to random oversampling
Enhances model performance on imbalanced datasets
Works only on numeric features (requires preprocessing for categorical data)

Basic Rules

Use after train/test split to avoid data leakage
Scale features if base model is sensitive to distance metrics
Combine with under-sampling or ensemble models for better results
Evaluate performance using recall and F1-score

Syntax Table

SL NO	Technique	Syntax Example	Description
1	Initialize SMOTE	`SMOTE()`	Prepares the SMOTE instance
2	Fit and Resample	`X_res, y_res = SMOTE().fit_resample(X, y)`	Generates new synthetic samples
3	Custom Strategy	`SMOTE(sampling_strategy=0.5)`	Balances classes with specific ratio
4	K Neighbors	`SMOTE(k_neighbors=3)`	Changes the number of neighbors for SMOTE
5	Use in Pipeline	`Pipeline([('smote', SMOTE()), ('clf', model)])`	Applies SMOTE inside training pipeline

Syntax Explanation

1. Initialize SMOTE

What is it?
Creates a SMOTE instance for resampling.

Syntax:

from imblearn.over_sampling import SMOTE
smote = SMOTE()

Explanation:

This line initializes the SMOTE object using default settings.
It’s essential before applying fit_resample().
You can later modify hyperparameters like sampling_strategy, k_neighbors, etc.
Default settings create a fully balanced dataset by oversampling the minority class to match the majority.
Ensure the data is numeric; otherwise, you may need SMOTENC or preprocessing.

2. Fit and Resample

What is it?
Generates synthetic minority class samples and returns balanced features and labels.

Syntax:

X_res, y_res = smote.fit_resample(X_train, y_train)

Explanation:

Trains the SMOTE model on your training dataset.
Returns the new feature set X_res and labels y_res with increased minority instances.
The newly generated samples are not simply duplicated—they are interpolated using nearest neighbors.
This step is crucial and should only be applied to training data after splitting to avoid data leakage.
Can be combined with under-sampling methods in pipelines for better balance.

3. Custom Sampling Strategy

What is it?
Specifies the desired class distribution ratio.

Syntax:

smote = SMOTE(sampling_strategy=0.5)

Explanation:

This will resample the minority class until it’s 50% the size of the majority class.
Instead of a fixed number, this is a float-based ratio between 0 and 1.
Alternatively, sampling_strategy can be a dictionary {class_label: target_count}.
Useful for partial balancing when full parity is not ideal.
Helps tailor oversampling to specific business or risk constraints.

4. Adjusting K Neighbors

What is it?
Modifies how SMOTE interpolates new samples using nearest neighbors.

Syntax:

smote = SMOTE(k_neighbors=3)

Explanation:

k_neighbors determines how many neighboring minority samples are used for interpolation.
Lower values create tighter clusters (less diversity), higher values increase synthetic variability.
Default is 5; reducing it may help with extremely small datasets.
Should be tuned based on dataset size and distribution.
Can also use custom nearest neighbor algorithms via knn=CustomNN().

5. Using SMOTE in a Pipeline

What is it?
Combines SMOTE and model into a single reproducible training workflow.

Syntax:

from imblearn.pipeline import Pipeline
pipeline = Pipeline([
    ('smote', SMOTE()),
    ('clf', LogisticRegression())
])

Explanation:

This ensures SMOTE is only applied to the training set during cross-validation.
Prevents information leakage across folds.
Makes it easy to use GridSearchCV or cross_val_score safely.
Can include preprocessing steps like StandardScaler() before modeling.
Highly recommended when doing repeated evaluations or deploying models.

Would you like to also expand this document with a section on SMOTENC (for categorical data) or SMOTE with ensemble classifiers?

Real-Life Project: SMOTE for Loan Default Prediction

Project Name

Loan Default Classification

Project Overview

Predict whether a customer will default using SMOTE to balance the dataset.

Project Goal

Improve recall on the default (minority) class.

Code for This Project

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
from imblearn.over_sampling import SMOTE
import pandas as pd

# Load data
X = pd.read_csv('loan_features.csv')
y = pd.read_csv('loan_labels.csv').values.ravel()

# Split
test_size = 0.3
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=test_size, random_state=42)

# Apply SMOTE
sm = SMOTE(k_neighbors=4, sampling_strategy='auto')
X_res, y_res = sm.fit_resample(X_train, y_train)

# Train model
model = LogisticRegression(class_weight='balanced')
model.fit(X_res, y_res)

# Evaluate
y_pred = model.predict(X_test)
print(classification_report(y_test, y_pred))

Expected Output

Improved recall and F1-score for the minority class (loan defaults)
Classification report showing balanced performance across classes

Common Mistakes to Avoid

❌ Applying SMOTE before train-test split (leads to data leakage)
❌ Using with categorical features without encoding (SMOTE requires numeric input)
❌ Ignoring feature scaling when using distance-based classifiers

Handling Imbalanced Data with Scikit-learn

Posted on June 5, 2025 by Lab

Imbalanced datasets occur when one class significantly outweighs others, often leading to biased models. Scikit-learn offers tools and strategies to address class imbalance through resampling, algorithmic adjustments, and evaluation metrics.

Key Characteristics

Target variable has skewed class distribution
Causes poor recall for minority classes
Needs special preprocessing or model adjustments
Affects classification more than regression

Basic Rules

Never evaluate solely with accuracy
Use stratified splits during training
Always monitor precision, recall, and F1-score
Apply techniques like resampling or class weighting

Syntax Table

SL NO	Technique	Syntax Example	Description
1	Class Weighting	`LogisticRegression(class_weight='balanced')`	Penalizes majority class
2	SMOTE Oversampling	`SMOTE().fit_resample(X, y)`	Synthesizes new minority samples
3	Random Under-sampling	`RandomUnderSampler().fit_resample(X, y)`	Removes samples from majority class
4	Stratified Split	`StratifiedKFold(n_splits=5)`	Ensures class proportions in folds
5	Classification Report	`classification_report(y_true, y_pred)`	Evaluates recall, precision, and F1-score

Syntax Explanation

1. Class Weighting

What is it?
Adjusts the loss function to penalize misclassification of minority classes.

Syntax:

from sklearn.linear_model import LogisticRegression
model = LogisticRegression(class_weight='balanced')

Explanation:

class_weight='balanced' adjusts class weights inversely proportional to class frequencies in the data.
This setting tells the model to pay more attention to minority class samples.
Can also pass a dictionary with custom weights, e.g., class_weight={0: 1, 1: 5}.
Available in various models like RandomForestClassifier, SVC, and DecisionTreeClassifier.
Helps reduce bias toward majority class without modifying the dataset.

2. SMOTE Oversampling

What is it?
Synthetic Minority Oversampling Technique creates synthetic samples from the minority class.

Syntax:

from imblearn.over_sampling import SMOTE
X_res, y_res = SMOTE().fit_resample(X, y)

Explanation:

SMOTE creates new synthetic instances by interpolating between existing minority class instances.
It helps balance the dataset and prevent overfitting from simple duplication.
fit_resample returns a new feature matrix and target vector.
Can be customized using k_neighbors, sampling_strategy, and other parameters.
Part of the imbalanced-learn (imblearn) package, which must be installed separately.

3. Random Under-sampling

What is it?
Reduces class imbalance by randomly removing samples from the majority class.

Syntax:

from imblearn.under_sampling import RandomUnderSampler
X_res, y_res = RandomUnderSampler().fit_resample(X, y)

Explanation:

This method drops samples from the majority class to match the size of the minority class.
Helps simplify the dataset and reduce training time.
Can lead to information loss if not used carefully.
Works well when you have abundant data and want faster training.
Combine with SMOTE in a pipeline using Pipeline() from imblearn.pipeline for optimal performance.

4. Stratified Split

What is it?
Ensures each fold has the same class distribution as the original dataset.

Syntax:

from sklearn.model_selection import StratifiedKFold
skf = StratifiedKFold(n_splits=5)

Explanation:

Creates train/test splits such that each fold maintains the original class distribution.
Prevents imbalance during cross-validation which can skew results.
Use .split(X, y) to generate train/test indices.
Commonly used with cross_val_score or custom CV loops.
Also available as StratifiedShuffleSplit for randomized splitting.

5. Classification Report

What is it?
Displays precision, recall, F1-score, and support for each class.

Syntax:

from sklearn.metrics import classification_report
print(classification_report(y_true, y_pred))

Explanation:

precision = TP / (TP + FP): focus on positive prediction correctness.
recall = TP / (TP + FN): focus on identifying all relevant samples.
f1-score = harmonic mean of precision and recall.
support = number of true samples for each label.
Particularly helpful to monitor minority class performance, which may have poor recall in imbalanced settings.
Should be used in conjunction with a confusion matrix for full clarity.

Real-Life Project: Fraud Detection with Imbalanced Data

Project Name

Credit Card Fraud Classification

Project Overview

Detect fraudulent transactions from highly imbalanced financial dataset.

Project Goal

Use oversampling and evaluation metrics to identify fraud effectively.

Code for This Project

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
from imblearn.over_sampling import SMOTE

# Simulated example dataset
X = pd.read_csv('features.csv')
y = pd.read_csv('labels.csv').values.ravel()

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=42)

# Apply SMOTE
smote = SMOTE()
X_res, y_res = smote.fit_resample(X_train, y_train)

# Train model
model = LogisticRegression(class_weight='balanced')
model.fit(X_res, y_res)

# Predict and evaluate
y_pred = model.predict(X_test)
print(classification_report(y_test, y_pred))

Expected Output

Higher recall on minority class (fraud)
Balanced F1-scores across both classes

Common Mistakes to Avoid

❌ Relying only on accuracy
❌ Not stratifying during split or validation
❌ Oversampling before splitting (leads to leakage)
❌ Ignoring class imbalance in metrics

📘 Hands-On Python and Scikit-Learn: A Practical Guide to Machine Learning by Sarful Hassan

🔗 Available on Amazon

Scikit-learn

Time Series Cross-Validation in Scikit-learn

Posted on June 5, 2025 by Lab

Traditional k-fold cross-validation is not suitable for time series data due to the temporal dependency between observations. Instead, Scikit-learn provides TimeSeriesSplit, a strategy that preserves order and prevents leakage by ensuring that the training set always precedes the test set chronologically.

Key Characteristics

Maintains chronological order in splits
Avoids training on future data
Useful for evaluating time-based model stability
Supports consistent model validation in rolling or expanding windows

Basic Rules

Always split sequentially—not randomly
Training set must precede test set
Use consistent time intervals
Ideal for univariate and multivariate time series tasks

Syntax Table

SL NO	Technique	Syntax Example	Description
1	Initialize split	`TimeSeriesSplit(n_splits=5)`	Creates time-ordered cross-validation sets
2	Access splits	`for train_idx, test_idx in tscv.split(X): ...`	Iterates through each CV fold
3	Train model	`model.fit(X[train_idx], y[train_idx])`	Trains on training portion of each split
4	Evaluate model	`model.predict(X[test_idx])`	Evaluates on time-valid test data
5	Visualization	`plt.plot(train_idx), plt.plot(test_idx)`	Useful to understand fold composition

Syntax Explanation

1. Initialize TimeSeriesSplit

What is it?
Creates a cross-validator that provides train/test indices in time-ordered folds.

Syntax:

from sklearn.model_selection import TimeSeriesSplit
tscv = TimeSeriesSplit(n_splits=5)

Explanation:

n_splits=5 will divide the data into 5 sequential folds.
Each new fold adds more data to the training set.
Ideal for rolling window validation to mimic real-time prediction environments.
You can also customize the max_train_size parameter to limit how large the training set grows.

2. Access Splits

What is it?
Extracts the train and test indices from each fold using a for loop.

Syntax:

for train_idx, test_idx in tscv.split(X):
    print("Train indices:", train_idx, "Test indices:", test_idx)

Explanation:

Iterates over each fold and provides integer indices for slicing.
Ensures test data follows training data in time.
Very helpful for debugging, logging, and visualizing the sequence of training and testing.
Each iteration updates the model using increasingly more historical data.

3. Train Model

What is it?
Fits the model to the current training set for the current fold.

Syntax:

model.fit(X[train_idx], y[train_idx])

Explanation:

Ensures training is only done using past observations.
This loop enables robust evaluation of model performance across various time splits.
Supports all Scikit-learn estimators (LinearRegression, SVR, Ridge, etc.)
For pipelines, use: pipeline.fit(X[train_idx], y[train_idx])

4. Evaluate Model

What is it?
Generates predictions on the future (test) portion of the time series.

Syntax:

y_pred = model.predict(X[test_idx])

Explanation:

Makes one-step ahead (or multi-step if structured) predictions.
Should compare predictions with y[test_idx] using evaluation metrics like RMSE, MAE, or MAPE.
Important for simulating how a deployed model would perform on unseen data.
You can log each fold’s score or average them at the end.

5. Visualization of Splits

What is it?
Optional step to plot how splits are formed over time.

Syntax:

import matplotlib.pyplot as plt
plt.plot(train_idx, label='Train')
plt.plot(test_idx, label='Test')
plt.legend()

Explanation:

Great for checking how data is partitioned visually.
Confirms model never sees future data during training.
Helps ensure correct fold structure for reproducibility.
Can reveal issues like short test folds or improper sequences.

Real-Life Project: Time Series CV for Stock Forecasting

Project Name

Sequential Cross-Validation for Stock Returns

Project Overview

Use time series split to validate a regression model predicting next-day stock returns.

Project Goal

Implement walk-forward validation using TimeSeriesSplit.

Code for This Project

import numpy as np
import pandas as pd
from sklearn.linear_model import Ridge
from sklearn.model_selection import TimeSeriesSplit
from sklearn.metrics import mean_squared_error

# Simulated data
df = pd.DataFrame({
    'feature1': np.random.randn(100),
    'feature2': np.random.randn(100),
    'target': np.random.randn(100)
})

X = df[['feature1', 'feature2']].values
y = df['target'].values

# Initialize split
tscv = TimeSeriesSplit(n_splits=5)

# Run CV
for fold, (train_idx, test_idx) in enumerate(tscv.split(X)):
    model = Ridge()
    model.fit(X[train_idx], y[train_idx])
    y_pred = model.predict(X[test_idx])
    print(f"Fold {fold + 1} MSE:", mean_squared_error(y[test_idx], y_pred))

Expected Output

Fold-wise MSE printed
Performance consistency across time folds

Common Mistakes to Avoid

❌ Using KFold instead of TimeSeriesSplit
❌ Training on future data (data leakage)
❌ Not scaling after train/test split (if needed)

📘 Hands-On Python and Scikit-Learn: A Practical Guide to Machine Learning by Sarful Hassan

🔗 Available on Amazon

Scikit-learn

Lag Features and Rolling Means with Scikit-learn

Posted on June 5, 2025 by Lab

Lag features and rolling statistics are powerful tools in time series forecasting. While Scikit-learn doesn’t provide these natively, they can be engineered using pandas before feeding into models. These features help capture temporal dependencies, seasonality, and trends.

Key Characteristics

Lag features represent past observations
Rolling means smooth short-term fluctuations
Used in feature engineering for regression and classification
Improves model context over time

Basic Rules

Always shift or roll before training to prevent leakage
Drop NaNs after applying lag/rolling
Combine multiple lags and windows for better performance
Can be used in pipeline with FunctionTransformer

Syntax Table

SL NO	Technique	Syntax Example	Description
1	Lag feature	`df['lag1'] = df['value'].shift(1)`	Adds previous time step as feature
2	Multiple lags	`df['lag3'] = df['value'].shift(3)`	Adds value from 3 steps back
3	Rolling mean	`df['roll_mean_3'] = df['value'].rolling(3).mean()`	Computes 3-step moving average
4	Rolling std	`df['roll_std_5'] = df['value'].rolling(5).std()`	Rolling standard deviation
5	Drop NaNs	`df = df.dropna()`	Removes rows with missing values

Syntax Explanation

1. Lag Feature

What is it?
Adds a new column that contains the value from one time step ago. This helps the model learn temporal dependencies between observations.

Syntax:

df['lag1'] = df['value'].shift(1)

Explanation:

Shifts the original series by 1 row to align each observation with its prior value.
Essential for converting time series to supervised learning.
Can stack several lags to build memory into the model.
Watch out for NaN at the start, which must be removed before training.

2. Multiple Lags

What is it?
Creates additional lagged features with greater gaps to capture longer temporal effects.

Syntax:

df['lag3'] = df['value'].shift(3)

Explanation:

Offers deeper historical context.
Improves learning of cyclic or weekly patterns.
Combine lag1, lag3, lag7, etc., to capture short-term and seasonal behavior.
Enables models to use multi-step historical dependencies as features.

3. Rolling Mean

What is it?
Smooths out time series by averaging values over a sliding window.

Syntax:

df['roll_mean_3'] = df['value'].rolling(3).mean()

Explanation:

Calculates the average of current and previous 2 values (window=3).
Useful for trend extraction and smoothing noise.
Reduces impact of short-term fluctuations and sharp jumps.
Can be used directly or in combination with other features.

4. Rolling Standard Deviation

What is it?
Quantifies the variation or volatility within a sliding window of observations.

Syntax:

df['roll_std_5'] = df['value'].rolling(5).std()

Explanation:

Measures the degree of deviation from the mean within a window.
Helpful in modeling uncertainty and market volatility.
Can highlight periods of instability or abnormal behavior.
Use different window sizes to capture short- or long-term volatility.

5. Drop NaNs

What is it?
Removes all rows that contain NaN values, typically introduced by lag or rolling computations.

Syntax:

df = df.dropna()

Explanation:

Necessary cleanup step before model training.
Avoids errors when passing data to Scikit-learn models.
Drop only after all lag and rolling features have been added.
Alternatively, impute missing values if losing rows is unacceptable.

Real-Life Project: Temperature Forecasting with Lag + Rolling

Project Name

Enhanced Daily Temperature Forecast

Project Overview

Add lag and rolling mean features to improve prediction of daily temperatures.

Project Goal

Use lag and rolling statistics in a linear regression model.

Code for This Project

import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

# Simulate data
dates = pd.date_range(start='2023-01-01', periods=100)
temps = np.random.normal(loc=25, scale=3, size=100)
df = pd.DataFrame({'date': dates, 'temp': temps})

# Create features
df['lag1'] = df['temp'].shift(1)
df['roll_mean_3'] = df['temp'].rolling(3).mean()
df['roll_std_3'] = df['temp'].rolling(3).std()
df = df.dropna()

X = df[['lag1', 'roll_mean_3', 'roll_std_3']]
y = df['temp']
X_train, X_test, y_train, y_test = train_test_split(X, y, shuffle=False)

model = LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print("MSE:", mean_squared_error(y_test, y_pred))

Expected Output

Improved accuracy over lag-only model
Highlights benefit of combining lag and rolling features

Common Mistakes to Avoid

❌ Using rolling without handling NaNs
❌ Leaking future information by shifting incorrectly
❌ Applying rolling after splitting data (must be before!)

📘 Hands-On Python and Scikit-Learn: A Practical Guide to Machine Learning by Sarful Hassan

🔗 Available on Amazon

Scikit-learn

Working with Time Series Data in Scikit-learn

Posted on June 5, 2025 by Lab

Scikit-learn is primarily designed for tabular data and doesn’t natively support time series analysis. However, it can be adapted for time series forecasting and classification by carefully managing data splits and feature engineering. For advanced time series tasks, integration with pandas, statsmodels, or sktime is common.

Key Characteristics

Supports time series prediction using supervised learning format
Requires lag feature creation
Must avoid data leakage with proper temporal splits
Compatible with scikit-learn pipelines

Basic Rules

Never randomly split time series data (use chronological split)
Create lagged features to convert time series to supervised format
Always scale after splitting to avoid leakage
Consider time-aware cross-validation (e.g., TimeSeriesSplit)

Syntax Table

SL NO	Technique	Syntax Example	Description
1	Chronological split	`train_test_split(data, shuffle=False)`	Maintains time order
2	Lag feature creation	`df['lag1'] = df['value'].shift(1)`	Adds lagged version of a feature
3	TimeSeriesSplit	`TimeSeriesSplit(n_splits=5)`	Splits time series for cross-validation
4	Model training	`model.fit(X_train, y_train)`	Trains model on lagged features
5	Forecasting	`model.predict(X_test)`	Predicts future values from past features

Syntax Explanation

1. Chronological Split

What is it?
Splits data in a way that respects temporal order. This is critical to prevent data leakage and ensure the model doesn’t learn from future information.

Syntax:

from sklearn.model_selection import train_test_split
X_train, X_test = train_test_split(X, shuffle=False)

Explanation:

Ensures training happens only on past data
Maintains sequence integrity for forecasting
Avoids shuffling which would break time relationships

2. Lag Feature Creation

What is it?
Creates columns with shifted values of the original time series to simulate previous time steps.

Syntax:

df['lag1'] = df['value'].shift(1)

Explanation:

Converts time series into a supervised learning dataset
Lagged values act as predictors
Can add multiple lags (lag2, lag3) for better context
Be sure to drop NaN rows created by shifting

3. Time Series Cross-Validation

What is it?
Implements k-fold validation where the folds maintain time sequence. Useful when testing model consistency over time.

Syntax:

from sklearn.model_selection import TimeSeriesSplit
tscv = TimeSeriesSplit(n_splits=5)
for train_idx, test_idx in tscv.split(X):
    model.fit(X[train_idx], y[train_idx])
    predictions = model.predict(X[test_idx])

Explanation:

Each split trains on older data and tests on newer data
No future data leaks into the past
Useful for evaluating stability across time periods

4. Model Training

What is it?
Fits a model on the lagged feature training set to learn time-based patterns.

Syntax:

model.fit(X_train, y_train)

Explanation:

Trains on features representing past values
Supports any supervised model: LinearRegression, RandomForest, etc.
Model learns how current outcomes relate to previous inputs

5. Forecasting Future Values

What is it?
Uses the trained model to generate predictions for future steps.

Syntax:

y_pred = model.predict(X_test)

Explanation:

Produces future values based on previously observed patterns
Often evaluated using metrics like MSE, MAE, RMSE
Can be extended for multi-step forecasting using recursive methods

Real-Life Project: Time Series Forecasting with Lag Features

Project Name

Daily Temperature Prediction

Project Overview

Forecast the next day’s temperature using previous day temperatures.

Project Goal

Train a linear regression model using lag features.

Code for This Project

import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

# Simulate time series data
dates = pd.date_range(start='2023-01-01', periods=100)
temps = np.random.normal(loc=25, scale=3, size=100)
df = pd.DataFrame({'date': dates, 'temp': temps})
df['lag1'] = df['temp'].shift(1)
df = df.dropna()

X = df[['lag1']]
y = df['temp']
X_train, X_test, y_train, y_test = train_test_split(X, y, shuffle=False)

model = LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print("MSE:", mean_squared_error(y_test, y_pred))

Expected Output

Mean Squared Error for one-step ahead forecasting
Demonstrates lag-based supervised learning pipeline

Common Mistakes to Avoid

❌ Random shuffling of time series data
❌ Using future information in lag features
❌ Ignoring stationarity assumptions

📘 Hands-On Python and Scikit-Learn: A Practical Guide to Machine Learning by Sarful Hassan

🔗 Available on Amazon

Scikit-learn

Real-World Dataset: Wine Classification in Scikit-learn

Posted on June 4, 2025 by Lab

The Wine dataset is a classic multiclass classification dataset available in Scikit-learn. It contains the results of a chemical analysis of wines grown in the same region in Italy but derived from three different cultivars. The goal is to classify the wine based on 13 features such as alcohol content, ash, flavanoids, and more.

Key Characteristics

Multiclass classification problem (3 classes)
Target: Wine class labels (0, 1, 2)
Features: Alcohol, Malic acid, Ash, Flavanoids, etc.
Clean and well-structured dataset

Basic Rules

Standardize features before training
Use accuracy and confusion matrix for evaluation
Try different classifiers (Logistic Regression, KNN, SVM)
Use stratify=y to maintain class proportions

Syntax Table

SL NO	Step	Syntax Example	Description
1	Load dataset	`load_wine(return_X_y=True)`	Loads wine features and class labels
2	Train/test split	`train_test_split(X, y, stratify=y, test_size=0.3)`	Ensures balanced class split
3	Standard scaling	`StandardScaler().fit_transform(X_train)`	Scales features
4	Train classifier	`LogisticRegression().fit(X_train, y_train)`	Trains a classification model
5	Evaluate model	`confusion_matrix(y_test, y_pred)`	Shows prediction correctness per class

Syntax Explanation

1. Load Dataset

What is it?
Loads the Wine dataset from Scikit-learn.

Syntax:

from sklearn.datasets import load_wine
X, y = load_wine(return_X_y=True)

Explanation:

X contains 13 chemical features of wine samples
y contains the class labels (0, 1, 2)

2. Train/Test Split

What is it?
Divides the dataset into training and testing subsets.

Syntax:

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.3, random_state=42)

Explanation:

Maintains class proportions in train and test sets

3. Standard Scaling

What is it?
Applies normalization to the input features.

Syntax:

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

Explanation:

Prevents features with larger scales from dominating the model

4. Train Classifier

What is it?
Fits a logistic regression classifier on the wine data.

Syntax:

from sklearn.linear_model import LogisticRegression
model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)

Explanation:

Learns decision boundaries for each wine class
Logistic Regression supports multiclass classification

5. Evaluate Model

What is it?
Assesses the model performance with a confusion matrix.

Syntax:

from sklearn.metrics import confusion_matrix, accuracy_score
y_pred = model.predict(X_test)
print(confusion_matrix(y_test, y_pred))
print("Accuracy:", accuracy_score(y_test, y_pred))

Explanation:

Shows how many instances were correctly or incorrectly classified

Real-Life Project: Wine Type Prediction

Project Name

Wine Quality Classifier

Project Overview

Classify wines into one of three types using their chemical properties.

Project Goal

Develop a model that accurately identifies the wine class based on input features.

Code for This Project

from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, accuracy_score

# Load data
X, y = load_wine(return_X_y=True)

# Split
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.3, random_state=42)

# Scale
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# Train
model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)

# Predict & Evaluate
y_pred = model.predict(X_test)
print(confusion_matrix(y_test, y_pred))
print("Accuracy:", accuracy_score(y_test, y_pred))

Expected Output

Confusion matrix and accuracy score
High classification accuracy (typically >95%)

Common Mistakes to Avoid

❌ Not scaling features before model training
❌ Ignoring class imbalance in split
❌ Using binary classifiers for multiclass problems

📘 Hands-On Python and Scikit-Learn: A Practical Guide to Machine Learning by Sarful Hassan

🔗 Available on Amazon

Key Characteristics

Basic Rules

Syntax Table

Syntax Explanation

1. One-vs-Rest (OvR)

2. One-vs-One (OvO)

3. Native Multi-Class Classifier

4. Fit Model

5. Predict Classes

Real-Life Project: Digit Recognition (MNIST)

Project Overview

Code Example

Expected Output

Common Mistakes to Avoid

Further Reading Recommendation

📘 Hands-On Python and Scikit-Learn: A Practical Guide to Machine Learning by Sarful Hassan

Key Characteristics

Basic Rules

Syntax Table

Syntax Explanation

1. Initialize Model

2. Fit Model

3. Predict

4. Score Samples

5. Use in Pipeline

Real-Life Project: Credit Card Fraud Detection

Project Overview

Code Example

Expected Output

Common Mistakes to Avoid

Further Reading Recommendation

📘 Hands-On Python and Scikit-Learn: A Practical Guide to Machine Learning by Sarful Hassan

Key Characteristics

Basic Rules

Syntax Table

Syntax Explanation

1. Initialize Model

2. Fit Model

3. Predict

4. Score Samples

5. Use in Pipeline

Real-Life Project: Credit Card Fraud Detection

Project Overview

Code Example

Expected Output

Common Mistakes to Avoid

Further Reading Recommendation

📘 Hands-On Python and Scikit-Learn: A Practical Guide to Machine Learning by Sarful Hassan

Key Characteristics

Basic Rules

Syntax Table

Syntax Explanation

1. One-Class SVM

2. Isolation Forest

3. Elliptic Envelope

4. Fit the Model

5. Predict Outliers

Real-Life Project: Detecting Fraudulent Transactions

Project Overview

Code Example

Expected Output

Common Mistakes to Avoid

Further Reading Recommendation

📘 Hands-On Python and Scikit-Learn: A Practical Guide to Machine Learning by Sarful Hassan

Key Characteristics

Basic Rules

Syntax Table

Syntax Explanation

1. Initialize SMOTE

2. Fit and Resample

3. Custom Sampling Strategy

4. Adjusting K Neighbors

5. Using SMOTE in a Pipeline

Real-Life Project: SMOTE for Loan Default Prediction

Project Name

Project Overview

Project Goal

Code for This Project

Expected Output

Common Mistakes to Avoid