Integration of Scikit-learn with Pandas and NumPy

Scikit-learn integrates seamlessly with Pandas and NumPy, the two most commonly used Python libraries for data manipulation and numerical computing. This integration allows smooth preprocessing, modeling, and analysis workflows using familiar data structures.

Key Characteristics

Accepts NumPy arrays and Pandas DataFrames as input
Maintains compatibility with Pandas for column-based operations
Output predictions and transformations as NumPy arrays (can convert to DataFrame)
Works naturally with iloc, loc, indexing, and slicing

Basic Rules

Always check data types and shapes before feeding to Scikit-learn
Use values or .to_numpy() if explicit NumPy format is needed
Convert NumPy predictions back to Pandas Series/DataFrame with proper indexing
Avoid passing mixed-type DataFrames unless using ColumnTransformer

Syntax Table

SL NO	Technique	Syntax Example	Description
1	Fit Model with DataFrame	`model.fit(df[['feature']], df['target'])`	Fits model using Pandas DataFrame inputs
2	Transform DataFrame Columns	`scaler.fit_transform(df[['feature']])`	Applies scaling on selected columns
3	Predict and Convert to Series	`pd.Series(model.predict(df), index=df.index)`	Converts NumPy output to Series with index
4	Use with NumPy Array	`model.fit(X_array, y_array)`	Standard NumPy array input
5	ColumnTransformer with Names	`ColumnTransformer([...], remainder='passthrough')`	Processes selected columns with transformers

Syntax Explanation

1. Fit Model with DataFrame

What is it?
Trains a model using Pandas DataFrame as feature and target input.

Syntax:

model.fit(df[['feature']], df['target'])

Explanation:

Uses DataFrame column(s) directly, maintaining label references.
Helpful in feature selection or pipeline-based transformations.

2. Transform DataFrame Columns

What is it?
Scales or modifies specific columns in a DataFrame.

Syntax:

scaler.fit_transform(df[['feature']])

Explanation:

Fits a transformer on selected DataFrame columns.
Output is a NumPy array but can be converted back to DataFrame.

3. Predict and Convert to Series

What is it?
Runs model prediction and wraps result in a Pandas Series with original index.

Syntax:

pd.Series(model.predict(df), index=df.index)

Explanation:

Ensures output aligns with original data indices.
Useful for joining predictions back to the original dataset.

4. Use with NumPy Array

What is it?
Trains or predicts using NumPy arrays instead of DataFrames.

Syntax:

model.fit(X_array, y_array)

Explanation:

Default input format in Scikit-learn.
Offers speed and simplicity, especially for large datasets.

5. ColumnTransformer with Names

What is it?
Applies transformations to specified columns in a DataFrame using names.

Syntax:

from sklearn.compose import ColumnTransformer
ColumnTransformer([
    ('scale', StandardScaler(), ['col1', 'col2']),
    ('encode', OneHotEncoder(), ['category'])
], remainder='passthrough')

Explanation:

Allows selective column-wise transformations.
Keeps unprocessed columns using remainder='passthrough'.
Very effective for mixed data types (numeric + categorical).

Real-Life Project: Customer Churn Prediction

Project Overview

Use Pandas DataFrame with Scikit-learn pipeline to train a model for predicting customer churn.

Code Example

import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Load data
df = pd.read_csv("churn_data.csv")
X = df[['age', 'monthly_fee', 'contract_type']]
y = df['churn']

# Preprocess numeric features
scaler = StandardScaler()
X[['age', 'monthly_fee']] = scaler.fit_transform(X[['age', 'monthly_fee']])

# Train/test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

# Train model
model = RandomForestClassifier()
model.fit(X_train, y_train)

# Predict
y_pred = pd.Series(model.predict(X_test), index=X_test.index)

Expected Output

Scaled features and predicted labels aligned with original DataFrame index.

Common Mistakes to Avoid

❌ Using DataFrame with object dtype (ensure all columns are numeric or properly encoded)
❌ Mismatched shape or index when merging prediction with original data
❌ Not converting vectorized output back to Series/DataFrame

📘 Hands-On Python and Scikit-Learn: A Practical Guide to Machine Learning by Sarful Hassan

🔗 Available on Amazon

Key Characteristics

Basic Rules

Syntax Table

Syntax Explanation

1. Fit Model with DataFrame

2. Transform DataFrame Columns

3. Predict and Convert to Series

4. Use with NumPy Array

5. ColumnTransformer with Names

Real-Life Project: Customer Churn Prediction

Project Overview

Code Example

Expected Output

Common Mistakes to Avoid

Further Reading Recommendation

📘 Hands-On Python and Scikit-Learn: A Practical Guide to Machine Learning by Sarful Hassan

Login