Imbalanced datasets occur when one class significantly outweighs others, often leading to biased models. Scikit-learn offers tools and strategies to address class imbalance through resampling, algorithmic adjustments, and evaluation metrics.
Key Characteristics
- Target variable has skewed class distribution
- Causes poor recall for minority classes
- Needs special preprocessing or model adjustments
- Affects classification more than regression
Basic Rules
- Never evaluate solely with accuracy
- Use stratified splits during training
- Always monitor precision, recall, and F1-score
- Apply techniques like resampling or class weighting
Syntax Table
SL NO | Technique | Syntax Example | Description |
---|---|---|---|
1 | Class Weighting | LogisticRegression(class_weight='balanced') |
Penalizes majority class |
2 | SMOTE Oversampling | SMOTE().fit_resample(X, y) |
Synthesizes new minority samples |
3 | Random Under-sampling | RandomUnderSampler().fit_resample(X, y) |
Removes samples from majority class |
4 | Stratified Split | StratifiedKFold(n_splits=5) |
Ensures class proportions in folds |
5 | Classification Report | classification_report(y_true, y_pred) |
Evaluates recall, precision, and F1-score |
Syntax Explanation
1. Class Weighting
What is it?
Adjusts the loss function to penalize misclassification of minority classes.
Syntax:
from sklearn.linear_model import LogisticRegression
model = LogisticRegression(class_weight='balanced')
Explanation:
class_weight='balanced'
adjusts class weights inversely proportional to class frequencies in the data.- This setting tells the model to pay more attention to minority class samples.
- Can also pass a dictionary with custom weights, e.g.,
class_weight={0: 1, 1: 5}
. - Available in various models like
RandomForestClassifier
,SVC
, andDecisionTreeClassifier
. - Helps reduce bias toward majority class without modifying the dataset.
2. SMOTE Oversampling
What is it?
Synthetic Minority Oversampling Technique creates synthetic samples from the minority class.
Syntax:
from imblearn.over_sampling import SMOTE
X_res, y_res = SMOTE().fit_resample(X, y)
Explanation:
- SMOTE creates new synthetic instances by interpolating between existing minority class instances.
- It helps balance the dataset and prevent overfitting from simple duplication.
fit_resample
returns a new feature matrix and target vector.- Can be customized using
k_neighbors
,sampling_strategy
, and other parameters. - Part of the
imbalanced-learn
(imblearn
) package, which must be installed separately.
3. Random Under-sampling
What is it?
Reduces class imbalance by randomly removing samples from the majority class.
Syntax:
from imblearn.under_sampling import RandomUnderSampler
X_res, y_res = RandomUnderSampler().fit_resample(X, y)
Explanation:
- This method drops samples from the majority class to match the size of the minority class.
- Helps simplify the dataset and reduce training time.
- Can lead to information loss if not used carefully.
- Works well when you have abundant data and want faster training.
- Combine with SMOTE in a pipeline using
Pipeline()
fromimblearn.pipeline
for optimal performance.
4. Stratified Split
What is it?
Ensures each fold has the same class distribution as the original dataset.
Syntax:
from sklearn.model_selection import StratifiedKFold
skf = StratifiedKFold(n_splits=5)
Explanation:
- Creates train/test splits such that each fold maintains the original class distribution.
- Prevents imbalance during cross-validation which can skew results.
- Use
.split(X, y)
to generate train/test indices. - Commonly used with
cross_val_score
or custom CV loops. - Also available as
StratifiedShuffleSplit
for randomized splitting.
5. Classification Report
What is it?
Displays precision, recall, F1-score, and support for each class.
Syntax:
from sklearn.metrics import classification_report
print(classification_report(y_true, y_pred))
Explanation:
precision
= TP / (TP + FP): focus on positive prediction correctness.recall
= TP / (TP + FN): focus on identifying all relevant samples.f1-score
= harmonic mean of precision and recall.support
= number of true samples for each label.- Particularly helpful to monitor minority class performance, which may have poor recall in imbalanced settings.
- Should be used in conjunction with a confusion matrix for full clarity.
Real-Life Project: Fraud Detection with Imbalanced Data
Project Name
Credit Card Fraud Classification
Project Overview
Detect fraudulent transactions from highly imbalanced financial dataset.
Project Goal
Use oversampling and evaluation metrics to identify fraud effectively.
Code for This Project
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
from imblearn.over_sampling import SMOTE
# Simulated example dataset
X = pd.read_csv('features.csv')
y = pd.read_csv('labels.csv').values.ravel()
# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=42)
# Apply SMOTE
smote = SMOTE()
X_res, y_res = smote.fit_resample(X_train, y_train)
# Train model
model = LogisticRegression(class_weight='balanced')
model.fit(X_res, y_res)
# Predict and evaluate
y_pred = model.predict(X_test)
print(classification_report(y_test, y_pred))
Expected Output
- Higher recall on minority class (fraud)
- Balanced F1-scores across both classes
Common Mistakes to Avoid
- ❌ Relying only on accuracy
- ❌ Not stratifying during split or validation
- ❌ Oversampling before splitting (leads to leakage)
- ❌ Ignoring class imbalance in metrics