Handling Imbalanced Data with Scikit-learn

Imbalanced datasets occur when one class significantly outweighs others, often leading to biased models. Scikit-learn offers tools and strategies to address class imbalance through resampling, algorithmic adjustments, and evaluation metrics.

Key Characteristics

  • Target variable has skewed class distribution
  • Causes poor recall for minority classes
  • Needs special preprocessing or model adjustments
  • Affects classification more than regression

Basic Rules

  • Never evaluate solely with accuracy
  • Use stratified splits during training
  • Always monitor precision, recall, and F1-score
  • Apply techniques like resampling or class weighting

Syntax Table

SL NO Technique Syntax Example Description
1 Class Weighting LogisticRegression(class_weight='balanced') Penalizes majority class
2 SMOTE Oversampling SMOTE().fit_resample(X, y) Synthesizes new minority samples
3 Random Under-sampling RandomUnderSampler().fit_resample(X, y) Removes samples from majority class
4 Stratified Split StratifiedKFold(n_splits=5) Ensures class proportions in folds
5 Classification Report classification_report(y_true, y_pred) Evaluates recall, precision, and F1-score

Syntax Explanation

1. Class Weighting

What is it?
Adjusts the loss function to penalize misclassification of minority classes.

Syntax:

from sklearn.linear_model import LogisticRegression
model = LogisticRegression(class_weight='balanced')

Explanation:

  • class_weight='balanced' adjusts class weights inversely proportional to class frequencies in the data.
  • This setting tells the model to pay more attention to minority class samples.
  • Can also pass a dictionary with custom weights, e.g., class_weight={0: 1, 1: 5}.
  • Available in various models like RandomForestClassifier, SVC, and DecisionTreeClassifier.
  • Helps reduce bias toward majority class without modifying the dataset.

2. SMOTE Oversampling

What is it?
Synthetic Minority Oversampling Technique creates synthetic samples from the minority class.

Syntax:

from imblearn.over_sampling import SMOTE
X_res, y_res = SMOTE().fit_resample(X, y)

Explanation:

  • SMOTE creates new synthetic instances by interpolating between existing minority class instances.
  • It helps balance the dataset and prevent overfitting from simple duplication.
  • fit_resample returns a new feature matrix and target vector.
  • Can be customized using k_neighbors, sampling_strategy, and other parameters.
  • Part of the imbalanced-learn (imblearn) package, which must be installed separately.

3. Random Under-sampling

What is it?
Reduces class imbalance by randomly removing samples from the majority class.

Syntax:

from imblearn.under_sampling import RandomUnderSampler
X_res, y_res = RandomUnderSampler().fit_resample(X, y)

Explanation:

  • This method drops samples from the majority class to match the size of the minority class.
  • Helps simplify the dataset and reduce training time.
  • Can lead to information loss if not used carefully.
  • Works well when you have abundant data and want faster training.
  • Combine with SMOTE in a pipeline using Pipeline() from imblearn.pipeline for optimal performance.

4. Stratified Split

What is it?
Ensures each fold has the same class distribution as the original dataset.

Syntax:

from sklearn.model_selection import StratifiedKFold
skf = StratifiedKFold(n_splits=5)

Explanation:

  • Creates train/test splits such that each fold maintains the original class distribution.
  • Prevents imbalance during cross-validation which can skew results.
  • Use .split(X, y) to generate train/test indices.
  • Commonly used with cross_val_score or custom CV loops.
  • Also available as StratifiedShuffleSplit for randomized splitting.

5. Classification Report

What is it?
Displays precision, recall, F1-score, and support for each class.

Syntax:

from sklearn.metrics import classification_report
print(classification_report(y_true, y_pred))

Explanation:

  • precision = TP / (TP + FP): focus on positive prediction correctness.
  • recall = TP / (TP + FN): focus on identifying all relevant samples.
  • f1-score = harmonic mean of precision and recall.
  • support = number of true samples for each label.
  • Particularly helpful to monitor minority class performance, which may have poor recall in imbalanced settings.
  • Should be used in conjunction with a confusion matrix for full clarity.

Real-Life Project: Fraud Detection with Imbalanced Data

Project Name

Credit Card Fraud Classification

Project Overview

Detect fraudulent transactions from highly imbalanced financial dataset.

Project Goal

Use oversampling and evaluation metrics to identify fraud effectively.

Code for This Project

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
from imblearn.over_sampling import SMOTE

# Simulated example dataset
X = pd.read_csv('features.csv')
y = pd.read_csv('labels.csv').values.ravel()

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=42)

# Apply SMOTE
smote = SMOTE()
X_res, y_res = smote.fit_resample(X_train, y_train)

# Train model
model = LogisticRegression(class_weight='balanced')
model.fit(X_res, y_res)

# Predict and evaluate
y_pred = model.predict(X_test)
print(classification_report(y_test, y_pred))

Expected Output

  • Higher recall on minority class (fraud)
  • Balanced F1-scores across both classes

Common Mistakes to Avoid

  • ❌ Relying only on accuracy
  • ❌ Not stratifying during split or validation
  • ❌ Oversampling before splitting (leads to leakage)
  • ❌ Ignoring class imbalance in metrics

Further Reading Recommendation

📘 Hands-On Python and Scikit-Learn: A Practical Guide to Machine Learning by Sarful Hassan

🔗 Available on Amazon