Scikit-learn integrates seamlessly with Pandas and NumPy, the two most commonly used Python libraries for data manipulation and numerical computing. This integration allows smooth preprocessing, modeling, and analysis workflows using familiar data structures.
Key Characteristics
- Accepts NumPy arrays and Pandas DataFrames as input
- Maintains compatibility with Pandas for column-based operations
- Output predictions and transformations as NumPy arrays (can convert to DataFrame)
- Works naturally with
iloc
,loc
, indexing, and slicing
Basic Rules
- Always check data types and shapes before feeding to Scikit-learn
- Use
values
or.to_numpy()
if explicit NumPy format is needed - Convert NumPy predictions back to Pandas Series/DataFrame with proper indexing
- Avoid passing mixed-type DataFrames unless using ColumnTransformer
Syntax Table
SL NO | Technique | Syntax Example | Description |
---|---|---|---|
1 | Fit Model with DataFrame | model.fit(df[['feature']], df['target']) |
Fits model using Pandas DataFrame inputs |
2 | Transform DataFrame Columns | scaler.fit_transform(df[['feature']]) |
Applies scaling on selected columns |
3 | Predict and Convert to Series | pd.Series(model.predict(df), index=df.index) |
Converts NumPy output to Series with index |
4 | Use with NumPy Array | model.fit(X_array, y_array) |
Standard NumPy array input |
5 | ColumnTransformer with Names | ColumnTransformer([...], remainder='passthrough') |
Processes selected columns with transformers |
Syntax Explanation
1. Fit Model with DataFrame
What is it?
Trains a model using Pandas DataFrame as feature and target input.
Syntax:
model.fit(df[['feature']], df['target'])
Explanation:
- Uses DataFrame column(s) directly, maintaining label references.
- Helpful in feature selection or pipeline-based transformations.
2. Transform DataFrame Columns
What is it?
Scales or modifies specific columns in a DataFrame.
Syntax:
scaler.fit_transform(df[['feature']])
Explanation:
- Fits a transformer on selected DataFrame columns.
- Output is a NumPy array but can be converted back to DataFrame.
3. Predict and Convert to Series
What is it?
Runs model prediction and wraps result in a Pandas Series with original index.
Syntax:
pd.Series(model.predict(df), index=df.index)
Explanation:
- Ensures output aligns with original data indices.
- Useful for joining predictions back to the original dataset.
4. Use with NumPy Array
What is it?
Trains or predicts using NumPy arrays instead of DataFrames.
Syntax:
model.fit(X_array, y_array)
Explanation:
- Default input format in Scikit-learn.
- Offers speed and simplicity, especially for large datasets.
5. ColumnTransformer with Names
What is it?
Applies transformations to specified columns in a DataFrame using names.
Syntax:
from sklearn.compose import ColumnTransformer
ColumnTransformer([
('scale', StandardScaler(), ['col1', 'col2']),
('encode', OneHotEncoder(), ['category'])
], remainder='passthrough')
Explanation:
- Allows selective column-wise transformations.
- Keeps unprocessed columns using
remainder='passthrough'
. - Very effective for mixed data types (numeric + categorical).
Real-Life Project: Customer Churn Prediction
Project Overview
Use Pandas DataFrame with Scikit-learn pipeline to train a model for predicting customer churn.
Code Example
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
# Load data
df = pd.read_csv("churn_data.csv")
X = df[['age', 'monthly_fee', 'contract_type']]
y = df['churn']
# Preprocess numeric features
scaler = StandardScaler()
X[['age', 'monthly_fee']] = scaler.fit_transform(X[['age', 'monthly_fee']])
# Train/test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
# Train model
model = RandomForestClassifier()
model.fit(X_train, y_train)
# Predict
y_pred = pd.Series(model.predict(X_test), index=X_test.index)
Expected Output
- Scaled features and predicted labels aligned with original DataFrame index.
Common Mistakes to Avoid
- ❌ Using DataFrame with object dtype (ensure all columns are numeric or properly encoded)
- ❌ Mismatched shape or index when merging prediction with original data
- ❌ Not converting vectorized output back to Series/DataFrame