Mastering Handling Missing Data with Scikit-learn

Missing data is a common issue in real-world datasets. Whether due to user omission, system error, or data corruption, missing values can affect model performance and bias predictions. Scikit-learn provides robust strategies to detect and handle missing values efficiently.

Key Characteristics of Missing Data Handling

  • Flexible Imputation Strategies: Mean, median, mode, or custom value.
  • Column and Row-wise Detection: Identify missing values per column or row.
  • Pipeline Integration: Handle missing values as part of preprocessing.
  • Support for Numeric and Categorical Data: Choose appropriate imputation per data type.
  • Constant Value Fill: Useful for flags, categories, or default fill-in.

Basic Rules for Handling Missing Data

  • Always check for missing values before preprocessing or modeling.
  • Use visualization (heatmaps, missingno) for exploration.
  • Fit imputation on training data, then apply to test/validation sets.
  • Choose imputation strategies based on column types and distributions.
  • Combine imputation with scaling and encoding in a pipeline.

Syntax Table

SL NO Function Syntax Example Description
1 Detect Missing Values df.isnull().sum() Returns missing count per column
2 Drop Rows with NaNs df.dropna() Removes rows that contain NaN
3 Simple Imputer (mean) SimpleImputer(strategy='mean') Imputes numeric features with mean
4 Simple Imputer (most_frequent) SimpleImputer(strategy='most_frequent') Categorical mode fill
5 Constant Imputer SimpleImputer(strategy='constant', fill_value=0) Fill with custom value
6 Pipeline Integration Pipeline([...]) Automates imputation within workflows

Syntax Explanation

1. Detect Missing Values

  • What is it? Identifies how many values are missing per column.
  • Syntax:
df.isnull().sum()
  • Explanation:
    • Use df.isnull() to get a Boolean mask of missing cells.
    • .sum() counts True (i.e., NaN) values column-wise.
    • First step in any missing data strategy.

2. Drop Rows with NaNs

  • What is it? Removes any rows that contain missing values.
  • Syntax:
df_cleaned = df.dropna()
  • Explanation:
    • Useful when missing data is minimal.
    • May reduce dataset size significantly.
    • Use with caution to avoid data loss.

3. SimpleImputer (Mean)

  • What is it? Replaces missing values with the mean of the column.
  • Syntax:
from sklearn.impute import SimpleImputer
imp = SimpleImputer(strategy='mean')
X_imputed = imp.fit_transform(X)
  • Explanation:
    • Suitable for continuous numeric data.
    • fit() learns column means from training data.
    • transform() applies imputation to missing values.

4. SimpleImputer (Most Frequent)

  • What is it? Fills missing values with the most frequent value in a column.
  • Syntax:
SimpleImputer(strategy='most_frequent')
  • Explanation:
    • Ideal for categorical or ordinal features.
    • Prevents rare categories from being overused.
    • Safer than constant fill in unknown domains.

5. SimpleImputer (Constant Value)

  • What is it? Fills missing values with a fixed specified value.
  • Syntax:
SimpleImputer(strategy='constant', fill_value='Unknown')
  • Explanation:
    • Use for categorical placeholders or zero-fill.
    • Makes missingness explicit for some models.
    • Fill value must be type-compatible with column.

6. Pipeline Integration

  • What is it? Wraps imputation logic into a reproducible pipeline.
  • Syntax:
from sklearn.pipeline import Pipeline
pipe = Pipeline([
  ('imputer', SimpleImputer(strategy='median'))
])
X_clean = pipe.fit_transform(X)
  • Explanation:
    • Ensures same imputation is applied consistently.
    • Can be combined with scalers, encoders, and estimators.
    • Ideal for production and evaluation workflows.

Real-Life Project: Imputing Customer Demographics

Project Name

Cleaning and Imputing Missing Values in Customer Dataset

Project Overview

We will clean a dataset containing customer profiles, where Age, Income, and City columns contain missing values. Using different strategies per column type, we prepare the dataset for segmentation and modeling.

Project Goal

  • Impute numerical values (Age, Income) using mean/median.
  • Impute categorical fields (City) using most frequent or a placeholder.
  • Wrap transformation into a single pipeline.

Code for This Project

import pandas as pd
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer

# Sample dataset
customer_data = pd.read_csv('customer_data.csv')

num_cols = ['Age', 'Income']
cat_cols = ['City']

num_pipeline = Pipeline([
  ('imputer', SimpleImputer(strategy='mean'))
])

cat_pipeline = Pipeline([
  ('imputer', SimpleImputer(strategy='most_frequent'))
])

preprocessor = ColumnTransformer([
  ('num', num_pipeline, num_cols),
  ('cat', cat_pipeline, cat_cols)
])

X_cleaned = preprocessor.fit_transform(customer_data)

Expected Output

  • Clean matrix with no missing values.
  • Numeric fields filled with statistical values.
  • Categorical fields filled with top occurring value.
  • Ready for modeling or export.

Common Mistakes to Avoid

  • ❌ Applying imputation after scaling/encoding
  • ❌ Using test data during fit (data leakage)
  • ❌ Dropping rows with high info value
  • ❌ Using mean imputation for categorical columns

Further Reading Recommendation

📘 Hands-On Python and Scikit-Learn: A Practical Guide to Machine Learning
by Sarful Hassan
🔗 Available on Amazon

Also explore: