SMOTE for Data Balancing in Scikit-learn

SMOTE (Synthetic Minority Oversampling Technique) is a popular method to address class imbalance by generating synthetic examples of the minority class. Unlike random oversampling, which duplicates data, SMOTE synthesizes new, plausible samples by interpolating between existing minority class samples.

Key Characteristics

  • Generates synthetic minority class samples
  • Reduces risk of overfitting compared to random oversampling
  • Enhances model performance on imbalanced datasets
  • Works only on numeric features (requires preprocessing for categorical data)

Basic Rules

  • Use after train/test split to avoid data leakage
  • Scale features if base model is sensitive to distance metrics
  • Combine with under-sampling or ensemble models for better results
  • Evaluate performance using recall and F1-score

Syntax Table

SL NO Technique Syntax Example Description
1 Initialize SMOTE SMOTE() Prepares the SMOTE instance
2 Fit and Resample X_res, y_res = SMOTE().fit_resample(X, y) Generates new synthetic samples
3 Custom Strategy SMOTE(sampling_strategy=0.5) Balances classes with specific ratio
4 K Neighbors SMOTE(k_neighbors=3) Changes the number of neighbors for SMOTE
5 Use in Pipeline Pipeline([('smote', SMOTE()), ('clf', model)]) Applies SMOTE inside training pipeline

Syntax Explanation

1. Initialize SMOTE

What is it?
Creates a SMOTE instance for resampling.

Syntax:

from imblearn.over_sampling import SMOTE
smote = SMOTE()

Explanation:

  • This line initializes the SMOTE object using default settings.
  • It’s essential before applying fit_resample().
  • You can later modify hyperparameters like sampling_strategy, k_neighbors, etc.
  • Default settings create a fully balanced dataset by oversampling the minority class to match the majority.
  • Ensure the data is numeric; otherwise, you may need SMOTENC or preprocessing.

2. Fit and Resample

What is it?
Generates synthetic minority class samples and returns balanced features and labels.

Syntax:

X_res, y_res = smote.fit_resample(X_train, y_train)

Explanation:

  • Trains the SMOTE model on your training dataset.
  • Returns the new feature set X_res and labels y_res with increased minority instances.
  • The newly generated samples are not simply duplicated—they are interpolated using nearest neighbors.
  • This step is crucial and should only be applied to training data after splitting to avoid data leakage.
  • Can be combined with under-sampling methods in pipelines for better balance.

3. Custom Sampling Strategy

What is it?
Specifies the desired class distribution ratio.

Syntax:

smote = SMOTE(sampling_strategy=0.5)

Explanation:

  • This will resample the minority class until it’s 50% the size of the majority class.
  • Instead of a fixed number, this is a float-based ratio between 0 and 1.
  • Alternatively, sampling_strategy can be a dictionary {class_label: target_count}.
  • Useful for partial balancing when full parity is not ideal.
  • Helps tailor oversampling to specific business or risk constraints.

4. Adjusting K Neighbors

What is it?
Modifies how SMOTE interpolates new samples using nearest neighbors.

Syntax:

smote = SMOTE(k_neighbors=3)

Explanation:

  • k_neighbors determines how many neighboring minority samples are used for interpolation.
  • Lower values create tighter clusters (less diversity), higher values increase synthetic variability.
  • Default is 5; reducing it may help with extremely small datasets.
  • Should be tuned based on dataset size and distribution.
  • Can also use custom nearest neighbor algorithms via knn=CustomNN().

5. Using SMOTE in a Pipeline

What is it?
Combines SMOTE and model into a single reproducible training workflow.

Syntax:

from imblearn.pipeline import Pipeline
pipeline = Pipeline([
    ('smote', SMOTE()),
    ('clf', LogisticRegression())
])

Explanation:

  • This ensures SMOTE is only applied to the training set during cross-validation.
  • Prevents information leakage across folds.
  • Makes it easy to use GridSearchCV or cross_val_score safely.
  • Can include preprocessing steps like StandardScaler() before modeling.
  • Highly recommended when doing repeated evaluations or deploying models.

Would you like to also expand this document with a section on SMOTENC (for categorical data) or SMOTE with ensemble classifiers?

Real-Life Project: SMOTE for Loan Default Prediction

Project Name

Loan Default Classification

Project Overview

Predict whether a customer will default using SMOTE to balance the dataset.

Project Goal

Improve recall on the default (minority) class.

Code for This Project

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
from imblearn.over_sampling import SMOTE
import pandas as pd

# Load data
X = pd.read_csv('loan_features.csv')
y = pd.read_csv('loan_labels.csv').values.ravel()

# Split
test_size = 0.3
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=test_size, random_state=42)

# Apply SMOTE
sm = SMOTE(k_neighbors=4, sampling_strategy='auto')
X_res, y_res = sm.fit_resample(X_train, y_train)

# Train model
model = LogisticRegression(class_weight='balanced')
model.fit(X_res, y_res)

# Evaluate
y_pred = model.predict(X_test)
print(classification_report(y_test, y_pred))

Expected Output

  • Improved recall and F1-score for the minority class (loan defaults)
  • Classification report showing balanced performance across classes

Common Mistakes to Avoid

  • ❌ Applying SMOTE before train-test split (leads to data leakage)
  • ❌ Using with categorical features without encoding (SMOTE requires numeric input)
  • ❌ Ignoring feature scaling when using distance-based classifiers

Further Reading Recommendation