SMOTE (Synthetic Minority Oversampling Technique) is a popular method to address class imbalance by generating synthetic examples of the minority class. Unlike random oversampling, which duplicates data, SMOTE synthesizes new, plausible samples by interpolating between existing minority class samples.
Key Characteristics
- Generates synthetic minority class samples
- Reduces risk of overfitting compared to random oversampling
- Enhances model performance on imbalanced datasets
- Works only on numeric features (requires preprocessing for categorical data)
Basic Rules
- Use after train/test split to avoid data leakage
- Scale features if base model is sensitive to distance metrics
- Combine with under-sampling or ensemble models for better results
- Evaluate performance using recall and F1-score
Syntax Table
SL NO | Technique | Syntax Example | Description |
---|---|---|---|
1 | Initialize SMOTE | SMOTE() |
Prepares the SMOTE instance |
2 | Fit and Resample | X_res, y_res = SMOTE().fit_resample(X, y) |
Generates new synthetic samples |
3 | Custom Strategy | SMOTE(sampling_strategy=0.5) |
Balances classes with specific ratio |
4 | K Neighbors | SMOTE(k_neighbors=3) |
Changes the number of neighbors for SMOTE |
5 | Use in Pipeline | Pipeline([('smote', SMOTE()), ('clf', model)]) |
Applies SMOTE inside training pipeline |
Syntax Explanation
1. Initialize SMOTE
What is it?
Creates a SMOTE instance for resampling.
Syntax:
from imblearn.over_sampling import SMOTE
smote = SMOTE()
Explanation:
- This line initializes the SMOTE object using default settings.
- It’s essential before applying
fit_resample()
. - You can later modify hyperparameters like
sampling_strategy
,k_neighbors
, etc. - Default settings create a fully balanced dataset by oversampling the minority class to match the majority.
- Ensure the data is numeric; otherwise, you may need SMOTENC or preprocessing.
2. Fit and Resample
What is it?
Generates synthetic minority class samples and returns balanced features and labels.
Syntax:
X_res, y_res = smote.fit_resample(X_train, y_train)
Explanation:
- Trains the SMOTE model on your training dataset.
- Returns the new feature set
X_res
and labelsy_res
with increased minority instances. - The newly generated samples are not simply duplicated—they are interpolated using nearest neighbors.
- This step is crucial and should only be applied to training data after splitting to avoid data leakage.
- Can be combined with under-sampling methods in pipelines for better balance.
3. Custom Sampling Strategy
What is it?
Specifies the desired class distribution ratio.
Syntax:
smote = SMOTE(sampling_strategy=0.5)
Explanation:
- This will resample the minority class until it’s 50% the size of the majority class.
- Instead of a fixed number, this is a float-based ratio between 0 and 1.
- Alternatively,
sampling_strategy
can be a dictionary{class_label: target_count}
. - Useful for partial balancing when full parity is not ideal.
- Helps tailor oversampling to specific business or risk constraints.
4. Adjusting K Neighbors
What is it?
Modifies how SMOTE interpolates new samples using nearest neighbors.
Syntax:
smote = SMOTE(k_neighbors=3)
Explanation:
k_neighbors
determines how many neighboring minority samples are used for interpolation.- Lower values create tighter clusters (less diversity), higher values increase synthetic variability.
- Default is 5; reducing it may help with extremely small datasets.
- Should be tuned based on dataset size and distribution.
- Can also use custom nearest neighbor algorithms via
knn=CustomNN()
.
5. Using SMOTE in a Pipeline
What is it?
Combines SMOTE and model into a single reproducible training workflow.
Syntax:
from imblearn.pipeline import Pipeline
pipeline = Pipeline([
('smote', SMOTE()),
('clf', LogisticRegression())
])
Explanation:
- This ensures SMOTE is only applied to the training set during cross-validation.
- Prevents information leakage across folds.
- Makes it easy to use
GridSearchCV
orcross_val_score
safely. - Can include preprocessing steps like
StandardScaler()
before modeling. - Highly recommended when doing repeated evaluations or deploying models.
Would you like to also expand this document with a section on SMOTENC (for categorical data) or SMOTE with ensemble classifiers?
Real-Life Project: SMOTE for Loan Default Prediction
Project Name
Loan Default Classification
Project Overview
Predict whether a customer will default using SMOTE to balance the dataset.
Project Goal
Improve recall on the default (minority) class.
Code for This Project
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
from imblearn.over_sampling import SMOTE
import pandas as pd
# Load data
X = pd.read_csv('loan_features.csv')
y = pd.read_csv('loan_labels.csv').values.ravel()
# Split
test_size = 0.3
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=test_size, random_state=42)
# Apply SMOTE
sm = SMOTE(k_neighbors=4, sampling_strategy='auto')
X_res, y_res = sm.fit_resample(X_train, y_train)
# Train model
model = LogisticRegression(class_weight='balanced')
model.fit(X_res, y_res)
# Evaluate
y_pred = model.predict(X_test)
print(classification_report(y_test, y_pred))
Expected Output
- Improved recall and F1-score for the minority class (loan defaults)
- Classification report showing balanced performance across classes
Common Mistakes to Avoid
- ❌ Applying SMOTE before train-test split (leads to data leakage)
- ❌ Using with categorical features without encoding (SMOTE requires numeric input)
- ❌ Ignoring feature scaling when using distance-based classifiers
Further Reading Recommendation
- SMOTE in imbalanced-learn Docs
- Combining SMOTE and Undersampling
- Handling Categorical Data in SMOTE-NC