Data cleaning and preprocessing are foundational steps in any machine learning project. Without clean and structured data, even the best algorithms cannot perform well. Scikit-learn, a leading machine learning library in Python, offers simple yet powerful tools to clean, impute, scale, encode, and prepare your data efficiently. This guide will walk you through these essential techniques.
Key Characteristics of Data Cleaning and Preprocessing with Scikit-learn
- Handles Missing Data Gracefully: Use imputers to fill missing values using statistical strategies.
- Feature Scaling: Normalize or standardize features to improve model performance.
- Categorical Encoding: Use
OneHotEncoder
andOrdinalEncoder
to convert text data. - Column-wise Processing: Apply distinct transformations to specific column types using
ColumnTransformer
. - Reusable Pipelines: Combine steps into a streamlined workflow with
Pipeline
.
Basic Rules for Cleaning and Preprocessing
- Always split data before fitting preprocessing steps to avoid data leakage.
- Use
fit_transform()
on training data andtransform()
on test data. - Impute missing values before scaling or encoding.
- Scale only numeric data and encode only categorical data.
- Wrap your steps in
Pipeline
orColumnTransformer
to keep it modular.
Syntax Table
SL NO | Function | Syntax Example | Description |
---|---|---|---|
1 | Missing Value Imputation | SimpleImputer(strategy='mean') |
Replaces missing values with column mean |
2 | Standard Scaling | StandardScaler() |
Standardizes numeric features |
3 | Min-Max Scaling | MinMaxScaler() |
Scales features to a 0–1 range |
4 | Categorical Encoding | OneHotEncoder() |
Converts text into binary columns |
5 | Column-wise Transformation | ColumnTransformer([...]) |
Apply different transforms to numeric/categorical |
6 | Processing Pipeline | Pipeline([...]) |
Chain preprocessing steps together |
Syntax Explanation
1. Missing Value Imputation
- What is it? Automatically fills in missing data in your dataset.
- Syntax:
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(strategy='mean')
X_clean = imputer.fit_transform(X)
- Explanation:
- Replace
NaN
values with the mean, median, most frequent, or constant. - Prevents dropping rows or columns unnecessarily.
- Use
strategy='constant'
for categorical fields.
- Replace
2. Feature Scaling
- What is it? Adjusts numerical features to have comparable scales.
- Syntax:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
- Explanation:
- Makes data suitable for distance-based models (e.g., SVM, KNN).
- Mean becomes 0 and variance becomes 1.
- Use
MinMaxScaler()
if data needs to be in [0, 1] range.
3. Categorical Encoding
- What is it? Converts categories into numbers that models can use.
- Syntax:
from sklearn.preprocessing import OneHotEncoder
encoder = OneHotEncoder(sparse=False)
X_encoded = encoder.fit_transform(X_cat)
- Explanation:
- Converts each category into a binary column.
- Avoids assigning misleading ordinal relationships.
handle_unknown='ignore'
helps in prediction phase.
4. Column-wise Transformation
- What is it? Applies different transformations to different column groups.
- Syntax:
from sklearn.compose import ColumnTransformer
preprocessor = ColumnTransformer([
('num', StandardScaler(), numeric_features),
('cat', OneHotEncoder(), categorical_features)
])
X_transformed = preprocessor.fit_transform(X)
- Explanation:
- Keeps transformations organized.
- Supports pipelines inside transformers.
- Essential for structured datasets with mixed types.
5. Preprocessing Pipeline
- What is it? Combines preprocessing steps into one reusable unit.
- Syntax:
from sklearn.pipeline import Pipeline
pipe = Pipeline([
('imputer', SimpleImputer(strategy='mean')),
('scaler', StandardScaler())
])
X_prepared = pipe.fit_transform(X)
- Explanation:
- Ensures reproducibility and fewer bugs.
- Can be nested inside model pipelines (
Pipeline([...], model)
). - Ideal for cross-validation and deployment workflows.
Real-Life Project: Churn Data Preprocessing
Project Name
Preprocessing Telco Customer Data for Churn Prediction
Project Overview
This project demonstrates cleaning and transforming a real-world customer churn dataset using Scikit-learn. It handles missing values, encodes categorical fields, and scales numerical features to prepare the dataset for machine learning.
Project Goal
- Impute missing values in customer records
- Encode categorical columns like gender and plan type
- Normalize charges and tenure columns
- Output a clean dataset for modeling
Code for This Project
import pandas as pd
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
# Load dataset
data = pd.read_csv('telco_churn.csv')
y = data['Churn']
X = data.drop('Churn', axis=1)
# Define column groups
num_cols = X.select_dtypes(include=['float64', 'int64']).columns
cat_cols = X.select_dtypes(include=['object']).columns
# Define transformers
numeric_pipeline = Pipeline([
('imputer', SimpleImputer(strategy='mean')),
('scaler', StandardScaler())
])
categorical_pipeline = Pipeline([
('imputer', SimpleImputer(strategy='most_frequent')),
('encoder', OneHotEncoder(handle_unknown='ignore'))
])
# Combine into preprocessor
preprocessor = ColumnTransformer([
('num', numeric_pipeline, num_cols),
('cat', categorical_pipeline, cat_cols)
])
X_preprocessed = preprocessor.fit_transform(X)
Expected Output
- No missing values
- All text fields encoded
- All numeric fields scaled
- A clean NumPy matrix ready for classification
Common Mistakes to Avoid
- ❌ Applying transformations before splitting data → causes data leakage
- ❌ Using
fit_transform()
on test data instead oftransform()
- ❌ Forgetting to handle unknown categories in OneHotEncoder
- ❌ Ignoring the pipeline structure → results in inconsistent preprocessing
Further Reading Recommendation
To go beyond basics and master real-world Scikit-learn workflows:
📘 Hands-On Python and Scikit-Learn: A Practical Guide to Machine Learning
by Sarful Hassan
🔗 Available on Amazon
Also explore:
- 🔗 Scikit-learn Docs: https://scikit-learn.org/stable/user_guide.html
- 🔗 Kaggle Notebooks for practice: https://www.kaggle.com/code
- 🔗 Scikit-learn Pipelines Tutorial on YouTube