Missing data is a common issue in real-world datasets. Whether due to user omission, system error, or data corruption, missing values can affect model performance and bias predictions. Scikit-learn provides robust strategies to detect and handle missing values efficiently.
Key Characteristics of Missing Data Handling
- Flexible Imputation Strategies: Mean, median, mode, or custom value.
- Column and Row-wise Detection: Identify missing values per column or row.
- Pipeline Integration: Handle missing values as part of preprocessing.
- Support for Numeric and Categorical Data: Choose appropriate imputation per data type.
- Constant Value Fill: Useful for flags, categories, or default fill-in.
Basic Rules for Handling Missing Data
- Always check for missing values before preprocessing or modeling.
- Use visualization (
heatmaps,missingno) for exploration. - Fit imputation on training data, then apply to test/validation sets.
- Choose imputation strategies based on column types and distributions.
- Combine imputation with scaling and encoding in a pipeline.
Syntax Table
| SL NO | Function | Syntax Example | Description |
|---|---|---|---|
| 1 | Detect Missing Values | df.isnull().sum() |
Returns missing count per column |
| 2 | Drop Rows with NaNs | df.dropna() |
Removes rows that contain NaN |
| 3 | Simple Imputer (mean) | SimpleImputer(strategy='mean') |
Imputes numeric features with mean |
| 4 | Simple Imputer (most_frequent) | SimpleImputer(strategy='most_frequent') |
Categorical mode fill |
| 5 | Constant Imputer | SimpleImputer(strategy='constant', fill_value=0) |
Fill with custom value |
| 6 | Pipeline Integration | Pipeline([...]) |
Automates imputation within workflows |
Syntax Explanation
1. Detect Missing Values
- What is it? Identifies how many values are missing per column.
- Syntax:
df.isnull().sum()
- Explanation:
- Use
df.isnull()to get a Boolean mask of missing cells. .sum()countsTrue(i.e., NaN) values column-wise.- First step in any missing data strategy.
- Use
2. Drop Rows with NaNs
- What is it? Removes any rows that contain missing values.
- Syntax:
df_cleaned = df.dropna()
- Explanation:
- Useful when missing data is minimal.
- May reduce dataset size significantly.
- Use with caution to avoid data loss.
3. SimpleImputer (Mean)
- What is it? Replaces missing values with the mean of the column.
- Syntax:
from sklearn.impute import SimpleImputer
imp = SimpleImputer(strategy='mean')
X_imputed = imp.fit_transform(X)
- Explanation:
- Suitable for continuous numeric data.
fit()learns column means from training data.transform()applies imputation to missing values.
4. SimpleImputer (Most Frequent)
- What is it? Fills missing values with the most frequent value in a column.
- Syntax:
SimpleImputer(strategy='most_frequent')
- Explanation:
- Ideal for categorical or ordinal features.
- Prevents rare categories from being overused.
- Safer than constant fill in unknown domains.
5. SimpleImputer (Constant Value)
- What is it? Fills missing values with a fixed specified value.
- Syntax:
SimpleImputer(strategy='constant', fill_value='Unknown')
- Explanation:
- Use for categorical placeholders or zero-fill.
- Makes missingness explicit for some models.
- Fill value must be type-compatible with column.
6. Pipeline Integration
- What is it? Wraps imputation logic into a reproducible pipeline.
- Syntax:
from sklearn.pipeline import Pipeline
pipe = Pipeline([
('imputer', SimpleImputer(strategy='median'))
])
X_clean = pipe.fit_transform(X)
- Explanation:
- Ensures same imputation is applied consistently.
- Can be combined with scalers, encoders, and estimators.
- Ideal for production and evaluation workflows.
Real-Life Project: Imputing Customer Demographics
Project Name
Cleaning and Imputing Missing Values in Customer Dataset
Project Overview
We will clean a dataset containing customer profiles, where Age, Income, and City columns contain missing values. Using different strategies per column type, we prepare the dataset for segmentation and modeling.
Project Goal
- Impute numerical values (
Age,Income) using mean/median. - Impute categorical fields (
City) using most frequent or a placeholder. - Wrap transformation into a single pipeline.
Code for This Project
import pandas as pd
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
# Sample dataset
customer_data = pd.read_csv('customer_data.csv')
num_cols = ['Age', 'Income']
cat_cols = ['City']
num_pipeline = Pipeline([
('imputer', SimpleImputer(strategy='mean'))
])
cat_pipeline = Pipeline([
('imputer', SimpleImputer(strategy='most_frequent'))
])
preprocessor = ColumnTransformer([
('num', num_pipeline, num_cols),
('cat', cat_pipeline, cat_cols)
])
X_cleaned = preprocessor.fit_transform(customer_data)
Expected Output
- Clean matrix with no missing values.
- Numeric fields filled with statistical values.
- Categorical fields filled with top occurring value.
- Ready for modeling or export.
Common Mistakes to Avoid
- ❌ Applying imputation after scaling/encoding
- ❌ Using test data during fit (data leakage)
- ❌ Dropping rows with high info value
- ❌ Using mean imputation for categorical columns
Further Reading Recommendation
📘 Hands-On Python and Scikit-Learn: A Practical Guide to Machine Learning
by Sarful Hassan
🔗 Available on Amazon
Also explore:
- 🔗 Scikit-learn Imputation Docs: https://scikit-learn.org/stable/modules/impute.html
- 🔗
missingnofor Missing Data Visualization - 🔗 Real-World Datasets with Missingness (Kaggle)
