Integration of Scikit-learn with Pandas and NumPy

Scikit-learn integrates seamlessly with Pandas and NumPy, the two most commonly used Python libraries for data manipulation and numerical computing. This integration allows smooth preprocessing, modeling, and analysis workflows using familiar data structures.

Key Characteristics

  • Accepts NumPy arrays and Pandas DataFrames as input
  • Maintains compatibility with Pandas for column-based operations
  • Output predictions and transformations as NumPy arrays (can convert to DataFrame)
  • Works naturally with iloc, loc, indexing, and slicing

Basic Rules

  • Always check data types and shapes before feeding to Scikit-learn
  • Use values or .to_numpy() if explicit NumPy format is needed
  • Convert NumPy predictions back to Pandas Series/DataFrame with proper indexing
  • Avoid passing mixed-type DataFrames unless using ColumnTransformer

Syntax Table

SL NO Technique Syntax Example Description
1 Fit Model with DataFrame model.fit(df[['feature']], df['target']) Fits model using Pandas DataFrame inputs
2 Transform DataFrame Columns scaler.fit_transform(df[['feature']]) Applies scaling on selected columns
3 Predict and Convert to Series pd.Series(model.predict(df), index=df.index) Converts NumPy output to Series with index
4 Use with NumPy Array model.fit(X_array, y_array) Standard NumPy array input
5 ColumnTransformer with Names ColumnTransformer([...], remainder='passthrough') Processes selected columns with transformers

Syntax Explanation

1. Fit Model with DataFrame

What is it?
Trains a model using Pandas DataFrame as feature and target input.

Syntax:

model.fit(df[['feature']], df['target'])

Explanation:

  • Uses DataFrame column(s) directly, maintaining label references.
  • Helpful in feature selection or pipeline-based transformations.

2. Transform DataFrame Columns

What is it?
Scales or modifies specific columns in a DataFrame.

Syntax:

scaler.fit_transform(df[['feature']])

Explanation:

  • Fits a transformer on selected DataFrame columns.
  • Output is a NumPy array but can be converted back to DataFrame.

3. Predict and Convert to Series

What is it?
Runs model prediction and wraps result in a Pandas Series with original index.

Syntax:

pd.Series(model.predict(df), index=df.index)

Explanation:

  • Ensures output aligns with original data indices.
  • Useful for joining predictions back to the original dataset.

4. Use with NumPy Array

What is it?
Trains or predicts using NumPy arrays instead of DataFrames.

Syntax:

model.fit(X_array, y_array)

Explanation:

  • Default input format in Scikit-learn.
  • Offers speed and simplicity, especially for large datasets.

5. ColumnTransformer with Names

What is it?
Applies transformations to specified columns in a DataFrame using names.

Syntax:

from sklearn.compose import ColumnTransformer
ColumnTransformer([
    ('scale', StandardScaler(), ['col1', 'col2']),
    ('encode', OneHotEncoder(), ['category'])
], remainder='passthrough')

Explanation:

  • Allows selective column-wise transformations.
  • Keeps unprocessed columns using remainder='passthrough'.
  • Very effective for mixed data types (numeric + categorical).

Real-Life Project: Customer Churn Prediction

Project Overview

Use Pandas DataFrame with Scikit-learn pipeline to train a model for predicting customer churn.

Code Example

import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Load data
df = pd.read_csv("churn_data.csv")
X = df[['age', 'monthly_fee', 'contract_type']]
y = df['churn']

# Preprocess numeric features
scaler = StandardScaler()
X[['age', 'monthly_fee']] = scaler.fit_transform(X[['age', 'monthly_fee']])

# Train/test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

# Train model
model = RandomForestClassifier()
model.fit(X_train, y_train)

# Predict
y_pred = pd.Series(model.predict(X_test), index=X_test.index)

Expected Output

  • Scaled features and predicted labels aligned with original DataFrame index.

Common Mistakes to Avoid

  • ❌ Using DataFrame with object dtype (ensure all columns are numeric or properly encoded)
  • ❌ Mismatched shape or index when merging prediction with original data
  • ❌ Not converting vectorized output back to Series/DataFrame

Further Reading Recommendation

📘 Hands-On Python and Scikit-Learn: A Practical Guide to Machine Learning by Sarful Hassan

🔗 Available on Amazon