Custom transformers in Scikit-learn are user-defined preprocessing steps that can be integrated into a pipeline. They help tailor the transformation logic specific to the dataset or domain, allowing for consistent and reusable feature engineering.
Key Characteristics
- Inherit from
BaseEstimator
andTransformerMixin
- Implement
fit()
andtransform()
methods - Seamlessly integrate with
Pipeline
andColumnTransformer
- Useful for feature engineering, preprocessing, or filtering
Basic Rules
- Always define
fit()
even if it does nothing transform()
must return transformed data (array, DataFrame, etc.)- Use
__init__()
for parameter handling - Maintain compatibility with Scikit-learn APIs (no side-effects)
Syntax Table
SL NO | Technique | Syntax Example | Description |
---|---|---|---|
1 | Import Base Classes | from sklearn.base import BaseEstimator, TransformerMixin |
Required for building custom transformers |
2 | Create Transformer Class | class MyTransformer(BaseEstimator, TransformerMixin): ... |
Define the custom transformation logic |
3 | Implement fit() | def fit(self, X, y=None): return self |
Learns and stores necessary state if needed |
4 | Implement transform() | def transform(self, X): return X_transformed |
Applies the transformation to input data |
5 | Use in Pipeline | Pipeline([('custom', MyTransformer()), ...]) |
Integrates the transformer into model pipeline |
Syntax Explanation
1. Import Base Classes
What is it?
Imports required base classes for creating Scikit-learn compatible transformers.
Syntax:
from sklearn.base import BaseEstimator, TransformerMixin
Explanation:
BaseEstimator
provides parameter handling and representation.TransformerMixin
ensures compatibility with pipelines.- Essential to build components compatible with Scikit-learnβs tools.
2. Create Transformer Class
What is it?
Defines a new class for custom transformation logic.
Syntax:
class MyTransformer(BaseEstimator, TransformerMixin):
def __init__(self, param1=True):
self.param1 = param1
Explanation:
__init__
initializes parameters.- Conventionally all parameters must be set in
__init__
. - Enables hyperparameter tuning using
GridSearchCV
orRandomizedSearchCV
.
3. Implement fit()
What is it?
Trains or initializes any internal parameters needed for transformation.
Syntax:
def fit(self, X, y=None):
return self
Explanation:
- Typically just returns
self
unless learning is required. - Required even if no training is needed.
- Keeps the class compatible with pipeline mechanics.
4. Implement transform()
What is it?
Applies the actual transformation logic to the data.
Syntax:
def transform(self, X):
# Example transformation
return X + 1
Explanation:
- Performs the data modification.
- Must return transformed data (same shape or modified as needed).
- Should raise exceptions for invalid input types or formats.
5. Use in Pipeline
What is it?
Integrates the custom transformer into a modeling workflow.
Syntax:
from sklearn.pipeline import Pipeline
pipeline = Pipeline([
('custom', MyTransformer()),
('model', LogisticRegression())
])
Explanation:
- Enables chaining of multiple preprocessing and modeling steps.
- Useful for standardizing the ML workflow.
- Allows consistent transformation in training and inference.
Real-Life Project: Feature Engineering with Custom Transformers
Project Overview
Create a transformer that adds a new feature based on domain knowledge.
Code Example
import numpy as np
import pandas as pd
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
# Custom Transformer: Adds BMI feature
class BMICalculator(BaseEstimator, TransformerMixin):
def fit(self, X, y=None):
return self
def transform(self, X):
X = X.copy()
X['BMI'] = X['Weight'] / ((X['Height']/100) ** 2)
return X
# Sample Data
data = pd.DataFrame({
'Height': [170, 160, 180],
'Weight': [70, 60, 90],
'Target': [1, 0, 1]
})
X = data.drop('Target', axis=1)
y = data['Target']
# Pipeline
pipeline = Pipeline([
('bmi_calc', BMICalculator()),
('model', LogisticRegression())
])
X_train, X_test, y_train, y_test = train_test_split(X, y)
pipeline.fit(X_train, y_train)
Expected Output
- Model trained with BMI as an engineered feature
- Clean and modular ML workflow
Common Mistakes to Avoid
- β Forgetting to inherit both
BaseEstimator
andTransformerMixin
- β Missing return statement in
fit()
- β Changing column order or names unexpectedly in
transform()
Further Reading Recommendation
- Scikit-learn Custom Transformers Guide
- Custom Pipelines Blog
- ColumnTransformer for Feature Engineering
π Hands-On Python and Scikit-Learn: A Practical Guide to Machine Learning by Sarful Hassan
π Available on Amazon