Mastering Handling Missing Data with Scikit-learn

Missing data is a common issue in real-world datasets. Whether due to user omission, system error, or data corruption, missing values can affect model performance and bias predictions. Scikit-learn provides robust strategies to detect and handle missing values efficiently.

Key Characteristics of Missing Data Handling

Flexible Imputation Strategies: Mean, median, mode, or custom value.
Column and Row-wise Detection: Identify missing values per column or row.
Pipeline Integration: Handle missing values as part of preprocessing.
Support for Numeric and Categorical Data: Choose appropriate imputation per data type.
Constant Value Fill: Useful for flags, categories, or default fill-in.

Basic Rules for Handling Missing Data

Always check for missing values before preprocessing or modeling.
Use visualization (heatmaps, missingno) for exploration.
Fit imputation on training data, then apply to test/validation sets.
Choose imputation strategies based on column types and distributions.
Combine imputation with scaling and encoding in a pipeline.

Syntax Table

SL NO	Function	Syntax Example	Description
1	Detect Missing Values	`df.isnull().sum()`	Returns missing count per column
2	Drop Rows with NaNs	`df.dropna()`	Removes rows that contain NaN
3	Simple Imputer (mean)	`SimpleImputer(strategy='mean')`	Imputes numeric features with mean
4	Simple Imputer (most_frequent)	`SimpleImputer(strategy='most_frequent')`	Categorical mode fill
5	Constant Imputer	`SimpleImputer(strategy='constant', fill_value=0)`	Fill with custom value
6	Pipeline Integration	`Pipeline([...])`	Automates imputation within workflows

Syntax Explanation

1. Detect Missing Values

What is it? Identifies how many values are missing per column.
Syntax:

df.isnull().sum()

Explanation:
- Use df.isnull() to get a Boolean mask of missing cells.
- .sum() counts True (i.e., NaN) values column-wise.
- First step in any missing data strategy.

2. Drop Rows with NaNs

What is it? Removes any rows that contain missing values.
Syntax:

df_cleaned = df.dropna()

Explanation:
- Useful when missing data is minimal.
- May reduce dataset size significantly.
- Use with caution to avoid data loss.

3. SimpleImputer (Mean)

What is it? Replaces missing values with the mean of the column.
Syntax:

from sklearn.impute import SimpleImputer
imp = SimpleImputer(strategy='mean')
X_imputed = imp.fit_transform(X)

Explanation:
- Suitable for continuous numeric data.
- fit() learns column means from training data.
- transform() applies imputation to missing values.

4. SimpleImputer (Most Frequent)

What is it? Fills missing values with the most frequent value in a column.
Syntax:

SimpleImputer(strategy='most_frequent')

Explanation:
- Ideal for categorical or ordinal features.
- Prevents rare categories from being overused.
- Safer than constant fill in unknown domains.

5. SimpleImputer (Constant Value)

What is it? Fills missing values with a fixed specified value.
Syntax:

SimpleImputer(strategy='constant', fill_value='Unknown')

Explanation:
- Use for categorical placeholders or zero-fill.
- Makes missingness explicit for some models.
- Fill value must be type-compatible with column.

6. Pipeline Integration

What is it? Wraps imputation logic into a reproducible pipeline.
Syntax:

from sklearn.pipeline import Pipeline
pipe = Pipeline([
  ('imputer', SimpleImputer(strategy='median'))
])
X_clean = pipe.fit_transform(X)

Explanation:
- Ensures same imputation is applied consistently.
- Can be combined with scalers, encoders, and estimators.
- Ideal for production and evaluation workflows.

Real-Life Project: Imputing Customer Demographics

Project Name

Cleaning and Imputing Missing Values in Customer Dataset

Project Overview

We will clean a dataset containing customer profiles, where Age, Income, and City columns contain missing values. Using different strategies per column type, we prepare the dataset for segmentation and modeling.

Project Goal

Impute numerical values (Age, Income) using mean/median.
Impute categorical fields (City) using most frequent or a placeholder.
Wrap transformation into a single pipeline.

Code for This Project

import pandas as pd
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer

# Sample dataset
customer_data = pd.read_csv('customer_data.csv')

num_cols = ['Age', 'Income']
cat_cols = ['City']

num_pipeline = Pipeline([
  ('imputer', SimpleImputer(strategy='mean'))
])

cat_pipeline = Pipeline([
  ('imputer', SimpleImputer(strategy='most_frequent'))
])

preprocessor = ColumnTransformer([
  ('num', num_pipeline, num_cols),
  ('cat', cat_pipeline, cat_cols)
])

X_cleaned = preprocessor.fit_transform(customer_data)

Expected Output

Clean matrix with no missing values.
Numeric fields filled with statistical values.
Categorical fields filled with top occurring value.
Ready for modeling or export.

Common Mistakes to Avoid

❌ Applying imputation after scaling/encoding
❌ Using test data during fit (data leakage)
❌ Dropping rows with high info value
❌ Using mean imputation for categorical columns