Creating Custom Transformers in Scikit-learn

Custom transformers in Scikit-learn are user-defined preprocessing steps that can be integrated into a pipeline. They help tailor the transformation logic specific to the dataset or domain, allowing for consistent and reusable feature engineering.

Key Characteristics

  • Inherit from BaseEstimator and TransformerMixin
  • Implement fit() and transform() methods
  • Seamlessly integrate with Pipeline and ColumnTransformer
  • Useful for feature engineering, preprocessing, or filtering

Basic Rules

  • Always define fit() even if it does nothing
  • transform() must return transformed data (array, DataFrame, etc.)
  • Use __init__() for parameter handling
  • Maintain compatibility with Scikit-learn APIs (no side-effects)

Syntax Table

SL NO Technique Syntax Example Description
1 Import Base Classes from sklearn.base import BaseEstimator, TransformerMixin Required for building custom transformers
2 Create Transformer Class class MyTransformer(BaseEstimator, TransformerMixin): ... Define the custom transformation logic
3 Implement fit() def fit(self, X, y=None): return self Learns and stores necessary state if needed
4 Implement transform() def transform(self, X): return X_transformed Applies the transformation to input data
5 Use in Pipeline Pipeline([('custom', MyTransformer()), ...]) Integrates the transformer into model pipeline

Syntax Explanation

1. Import Base Classes

What is it?
Imports required base classes for creating Scikit-learn compatible transformers.

Syntax:

from sklearn.base import BaseEstimator, TransformerMixin

Explanation:

  • BaseEstimator provides parameter handling and representation.
  • TransformerMixin ensures compatibility with pipelines.
  • Essential to build components compatible with Scikit-learn’s tools.

2. Create Transformer Class

What is it?
Defines a new class for custom transformation logic.

Syntax:

class MyTransformer(BaseEstimator, TransformerMixin):
    def __init__(self, param1=True):
        self.param1 = param1

Explanation:

  • __init__ initializes parameters.
  • Conventionally all parameters must be set in __init__.
  • Enables hyperparameter tuning using GridSearchCV or RandomizedSearchCV.

3. Implement fit()

What is it?
Trains or initializes any internal parameters needed for transformation.

Syntax:

def fit(self, X, y=None):
    return self

Explanation:

  • Typically just returns self unless learning is required.
  • Required even if no training is needed.
  • Keeps the class compatible with pipeline mechanics.

4. Implement transform()

What is it?
Applies the actual transformation logic to the data.

Syntax:

def transform(self, X):
    # Example transformation
    return X + 1

Explanation:

  • Performs the data modification.
  • Must return transformed data (same shape or modified as needed).
  • Should raise exceptions for invalid input types or formats.

5. Use in Pipeline

What is it?
Integrates the custom transformer into a modeling workflow.

Syntax:

from sklearn.pipeline import Pipeline
pipeline = Pipeline([
    ('custom', MyTransformer()),
    ('model', LogisticRegression())
])

Explanation:

  • Enables chaining of multiple preprocessing and modeling steps.
  • Useful for standardizing the ML workflow.
  • Allows consistent transformation in training and inference.

Real-Life Project: Feature Engineering with Custom Transformers

Project Overview

Create a transformer that adds a new feature based on domain knowledge.

Code Example

import numpy as np
import pandas as pd
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split

# Custom Transformer: Adds BMI feature
class BMICalculator(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None):
        return self

    def transform(self, X):
        X = X.copy()
        X['BMI'] = X['Weight'] / ((X['Height']/100) ** 2)
        return X

# Sample Data
data = pd.DataFrame({
    'Height': [170, 160, 180],
    'Weight': [70, 60, 90],
    'Target': [1, 0, 1]
})

X = data.drop('Target', axis=1)
y = data['Target']

# Pipeline
pipeline = Pipeline([
    ('bmi_calc', BMICalculator()),
    ('model', LogisticRegression())
])

X_train, X_test, y_train, y_test = train_test_split(X, y)
pipeline.fit(X_train, y_train)

Expected Output

  • Model trained with BMI as an engineered feature
  • Clean and modular ML workflow

Common Mistakes to Avoid

  • ❌ Forgetting to inherit both BaseEstimator and TransformerMixin
  • ❌ Missing return statement in fit()
  • ❌ Changing column order or names unexpectedly in transform()

Further Reading Recommendation

πŸ“˜ Hands-On Python and Scikit-Learn: A Practical Guide to Machine Learning by Sarful Hassan

πŸ”— Available on Amazon