Creating Custom Transformers in Scikit-learn

Custom transformers in Scikit-learn are user-defined preprocessing steps that can be integrated into a pipeline. They help tailor the transformation logic specific to the dataset or domain, allowing for consistent and reusable feature engineering.

Key Characteristics

Inherit from BaseEstimator and TransformerMixin
Implement fit() and transform() methods
Seamlessly integrate with Pipeline and ColumnTransformer
Useful for feature engineering, preprocessing, or filtering

Basic Rules

Always define fit() even if it does nothing
transform() must return transformed data (array, DataFrame, etc.)
Use __init__() for parameter handling
Maintain compatibility with Scikit-learn APIs (no side-effects)

Syntax Table

SL NO	Technique	Syntax Example	Description
1	Import Base Classes	`from sklearn.base import BaseEstimator, TransformerMixin`	Required for building custom transformers
2	Create Transformer Class	`class MyTransformer(BaseEstimator, TransformerMixin): ...`	Define the custom transformation logic
3	Implement fit()	`def fit(self, X, y=None): return self`	Learns and stores necessary state if needed
4	Implement transform()	`def transform(self, X): return X_transformed`	Applies the transformation to input data
5	Use in Pipeline	`Pipeline([('custom', MyTransformer()), ...])`	Integrates the transformer into model pipeline

Syntax Explanation

1. Import Base Classes

What is it?
Imports required base classes for creating Scikit-learn compatible transformers.

Syntax:

from sklearn.base import BaseEstimator, TransformerMixin

Explanation:

BaseEstimator provides parameter handling and representation.
TransformerMixin ensures compatibility with pipelines.
Essential to build components compatible with Scikit-learn’s tools.

2. Create Transformer Class

What is it?
Defines a new class for custom transformation logic.

Syntax:

class MyTransformer(BaseEstimator, TransformerMixin):
    def __init__(self, param1=True):
        self.param1 = param1

Explanation:

__init__ initializes parameters.
Conventionally all parameters must be set in __init__.
Enables hyperparameter tuning using GridSearchCV or RandomizedSearchCV.

3. Implement fit()

What is it?
Trains or initializes any internal parameters needed for transformation.

Syntax:

def fit(self, X, y=None):
    return self

Explanation:

Typically just returns self unless learning is required.
Required even if no training is needed.
Keeps the class compatible with pipeline mechanics.

4. Implement transform()

What is it?
Applies the actual transformation logic to the data.

Syntax:

def transform(self, X):
    # Example transformation
    return X + 1

Explanation:

Performs the data modification.
Must return transformed data (same shape or modified as needed).
Should raise exceptions for invalid input types or formats.

5. Use in Pipeline

What is it?
Integrates the custom transformer into a modeling workflow.

Syntax:

from sklearn.pipeline import Pipeline
pipeline = Pipeline([
    ('custom', MyTransformer()),
    ('model', LogisticRegression())
])

Explanation:

Enables chaining of multiple preprocessing and modeling steps.
Useful for standardizing the ML workflow.
Allows consistent transformation in training and inference.

Real-Life Project: Feature Engineering with Custom Transformers

Project Overview

Create a transformer that adds a new feature based on domain knowledge.

Code Example

import numpy as np
import pandas as pd
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split

# Custom Transformer: Adds BMI feature
class BMICalculator(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None):
        return self

    def transform(self, X):
        X = X.copy()
        X['BMI'] = X['Weight'] / ((X['Height']/100) ** 2)
        return X

# Sample Data
data = pd.DataFrame({
    'Height': [170, 160, 180],
    'Weight': [70, 60, 90],
    'Target': [1, 0, 1]
})

X = data.drop('Target', axis=1)
y = data['Target']

# Pipeline
pipeline = Pipeline([
    ('bmi_calc', BMICalculator()),
    ('model', LogisticRegression())
])

X_train, X_test, y_train, y_test = train_test_split(X, y)
pipeline.fit(X_train, y_train)

Expected Output

Model trained with BMI as an engineered feature
Clean and modular ML workflow

Common Mistakes to Avoid

❌ Forgetting to inherit both BaseEstimator and TransformerMixin
❌ Missing return statement in fit()
❌ Changing column order or names unexpectedly in transform()

📘 Hands-On Python and Scikit-Learn: A Practical Guide to Machine Learning by Sarful Hassan

🔗 Available on Amazon

Key Characteristics

Basic Rules

Syntax Table

Syntax Explanation

1. Import Base Classes

2. Create Transformer Class

3. Implement fit()

4. Implement transform()

5. Use in Pipeline

Real-Life Project: Feature Engineering with Custom Transformers

Project Overview

Code Example

Expected Output

Common Mistakes to Avoid

Further Reading Recommendation

📘 Hands-On Python and Scikit-Learn: A Practical Guide to Machine Learning by Sarful Hassan

Login