When building machine learning workflows, combining preprocessing steps and model training in a Pipeline ensures consistency and reproducibility. GridSearchCV can tune hyperparameters across the entire pipeline, optimizing both preprocessing and estimator stages.
Key Characteristics
- Unified Preprocessing and Modeling
- Hyperparameter Tuning Across Steps
- Avoids Data Leakage
- Clean and Modular Workflow
Basic Rules
- Use
Pipelinefromsklearn.pipeline. - Assign names to each pipeline step.
- Use double underscore
__to specify step parameters. - Always fit and evaluate on full pipeline.
Syntax Table
| SL NO | Function/Tool | Syntax Example | Description |
|---|---|---|---|
| 1 | Import Pipeline | from sklearn.pipeline import Pipeline |
Combines multiple processing steps |
| 2 | Create Pipeline | pipe = Pipeline([...]) |
Constructs processing and model stages |
| 3 | Define Param Grid | param_grid = {'model__n_neighbors': [3, 5, 7]} |
Grid of parameters including pipeline prefix |
| 4 | Setup GridSearchCV | grid = GridSearchCV(pipe, param_grid, cv=5) |
Applies grid search across pipeline |
| 5 | Fit and Access Results | grid.fit(X, y), grid.best_params_ |
Fits and retrieves best configuration |
Syntax Explanation
1. Import Pipeline
What is it? Tool to encapsulate all processing and model steps.
Syntax:
from sklearn.pipeline import Pipeline
Explanation:
- Ensures preprocessing steps are applied consistently.
- Prevents leakage of test data into preprocessing.
2. Create Pipeline
What is it? Build a sequence of named steps: preprocessing, modeling, etc.
Syntax:
pipe = Pipeline([
('scaler', StandardScaler()),
('model', KNeighborsClassifier())
])
Explanation:
- Each tuple contains (name, transformer/estimator).
- Can include scalers, PCA, feature selection, classifiers, etc.
3. Define Param Grid
What is it? Dictionary of parameter options to be tuned.
Syntax:
param_grid = {'model__n_neighbors': [3, 5, 7]}
Explanation:
- Prefix parameters with step name followed by
__. - Enables tuning of parameters in specific pipeline steps.
4. Setup GridSearchCV
What is it? Apply cross-validated grid search to the pipeline.
Syntax:
grid = GridSearchCV(pipe, param_grid, cv=5, scoring='accuracy')
Explanation:
cv: Number of folds for cross-validation.scoring: Metric to optimize.- Automatically trains and tests all combinations.
5. Fit and Access Results
What is it? Train and evaluate the best pipeline configuration.
Syntax:
grid.fit(X, y)
print(grid.best_params_)
print(grid.best_score_)
Explanation:
- Access best-performing pipeline settings.
- Pipeline includes both transformer and estimator stages.
Real-Life Project: Optimize Preprocessing + KNN with GridSearchCV
Objective
Build a full pipeline including scaling and KNN classification, and tune hyperparameters.
Code Example
import pandas as pd
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
# Load data
data = pd.read_csv('classification_data.csv')
X = data.drop('target', axis=1)
y = data['target']
# Define pipeline
pipe = Pipeline([
('scaler', StandardScaler()),
('model', KNeighborsClassifier())
])
# Define parameter grid
param_grid = {
'model__n_neighbors': [3, 5, 7],
'model__weights': ['uniform', 'distance']
}
# Grid search
grid = GridSearchCV(pipe, param_grid, cv=5, scoring='accuracy')
grid.fit(X, y)
# Output
print("Best Parameters:", grid.best_params_)
print("Best Score:", grid.best_score_)
Expected Output
- Optimal pipeline configuration.
- Scaled and validated model ready for deployment.
Common Mistakes
- ❌ Omitting pipeline step names in
param_grid. - ❌ Scaling data outside the pipeline.
- ❌ Not using
Pipelinewhen tuning both preprocessing and model.
Further Reading
- Scikit-learn Pipeline Docs
- GridSearchCV with Pipeline (ML Mastery)
- Nested Parameter Tuning in Pipelines (Kaggle)
