Logistic Regression for Classification in Scikit-learn

Logistic regression is a fundamental classification algorithm that models the probability of class membership using a logistic (sigmoid) function. Despite its name, logistic regression is used for binary and multi-class classification tasks. Scikit-learn offers a robust implementation through the LogisticRegression class.

Key Characteristics of Logistic Regression

  • Classification, Not Regression: Used for binary or multi-class classification.
  • Outputs Probabilities: Estimates the likelihood of each class.
  • Sigmoid Function: Converts linear combination of inputs to probability.
  • Interpretable Coefficients: Feature weights indicate importance.
  • Supports Regularization: Includes L1 and L2 penalties for generalization.

Basic Rules for Logistic Regression

  • Target variable should be categorical (e.g., 0/1, or class labels).
  • Scale features for better convergence.
  • For multi-class, use multi_class='multinomial'.
  • Use solver='liblinear', saga, or lbfgs depending on dataset size and penalty.
  • Evaluate using metrics like accuracy, precision, recall, and ROC-AUC.

Syntax Table

SL NO Function Syntax Example Description
1 Import Model from sklearn.linear_model import LogisticRegression Loads logistic regression class
2 Create Model LogisticRegression() Initializes logistic classifier
3 Train Model model.fit(X_train, y_train) Fits model to training data
4 Predict Labels model.predict(X_test) Predicts class labels
5 Predict Probabilities model.predict_proba(X_test) Gives class probabilities
6 Evaluate Accuracy accuracy_score(y_test, y_pred) Measures classification performance

Syntax Explanation

1. Import and Initialize Model

  • What is it? Loads the logistic regression model for classification.
  • Syntax:
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
  • Explanation:
    • Supports binary and multi-class classification.
    • You can set regularization type using penalty and solver method.

2. Fit the Model

  • What is it? Trains the model on labeled data.
  • Syntax:
model.fit(X_train, y_train)
  • Explanation:
    • Learns the coefficients of the logistic model.
    • Uses the sigmoid/logit function internally.

3. Predict Class Labels

  • What is it? Predicts the most likely class for new data.
  • Syntax:
y_pred = model.predict(X_test)
  • Explanation:
    • Returns 0 or 1 (or more for multi-class).
    • Useful for final decisions.

4. Predict Probabilities

  • What is it? Outputs the probability of each class.
  • Syntax:
probs = model.predict_proba(X_test)
  • Explanation:
    • Each row contains probabilities for each class.
    • Used in ROC curves and threshold tuning.

5. Evaluate Accuracy

  • What is it? Measures how often the model predicts correctly.
  • Syntax:
from sklearn.metrics import accuracy_score
accuracy = accuracy_score(y_test, y_pred)
  • Explanation:
    • Compares predicted and actual labels.
    • Good for balanced datasets; use F1 or ROC-AUC for imbalanced ones.

Real-Life Project: Spam Email Classification

Project Name

Spam Detector with Logistic Regression

Project Overview

This project classifies email messages as spam or not spam based on word frequencies and text features. Logistic regression offers a fast, interpretable, and effective solution.

Project Goal

  • Transform email text into numeric features
  • Train logistic model on labeled dataset
  • Evaluate prediction quality on unseen messages

Code for This Project

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report

# Load data
data = pd.read_csv('emails.csv')
X = data['message']
y = data['label']  # 0 = not spam, 1 = spam

# Text to numeric
vectorizer = CountVectorizer()
X_vec = vectorizer.fit_transform(X)

# Split
X_train, X_test, y_train, y_test = train_test_split(X_vec, y, test_size=0.2, random_state=42)

# Train
model = LogisticRegression()
model.fit(X_train, y_train)

# Predict
y_pred = model.predict(X_test)

# Evaluate
print("Accuracy:", accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))

Expected Output

  • Accuracy score
  • Precision, recall, and F1-score report
  • Working binary spam classifier

Common Mistakes to Avoid

  • ❌ Not scaling numeric features if present
  • ❌ Using wrong solver for large datasets
  • ❌ Ignoring precision/recall on imbalanced data
  • ❌ Overfitting by including too many irrelevant features

Further Reading Recommendation

πŸ“˜ Hands-On Python and Scikit-Learn: A Practical Guide to Machine Learning
by Sarful Hassan
πŸ”— Available on Amazon

Also explore:

Polynomial Regression in Scikit-learn

Polynomial regression allows linear models to fit nonlinear relationships by adding polynomial terms to the feature set. This technique enhances model flexibility while retaining the interpretability of linear regression. Scikit-learn offers an easy-to-use interface via PolynomialFeatures combined with LinearRegression.

Key Characteristics of Polynomial Regression

  • Extends Linear Regression: Captures curved trends by adding polynomial powers.
  • Works with Pipelines: Seamlessly integrate with Pipeline for preprocessing.
  • Degree Parameter: Controls model complexity and fit.
  • Requires Feature Scaling: Higher-degree terms may cause numeric instability.
  • Used for Curve Fitting: Ideal for modeling nonlinear patterns.

Basic Rules for Polynomial Regression

  • Scale your features if using high-degree polynomials.
  • Avoid too high a degree to prevent overfitting.
  • Combine with regularization (e.g., Ridge) for robust models.
  • Use train/test split or cross-validation to validate performance.
  • Always visualize predictions vs. actual values.

Syntax Table

SL NO Function Syntax Example Description
1 Polynomial Generator PolynomialFeatures(degree=2) Adds squared and interaction terms
2 Regression Model LinearRegression() Fits linear model on transformed features
3 Pipeline Integration Pipeline([...]) Chains polynomial and regression together
4 Feature Scaling StandardScaler() Normalizes features for stability
5 Plotting Predictions plt.plot(X, model.predict(X)) Visualizes the polynomial fit

Syntax Explanation

1. PolynomialFeatures

  • What is it? Transforms input features into polynomial combinations (e.g., x, xΒ², xΒ³).
  • Syntax:
from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(degree=2)
X_poly = poly.fit_transform(X)
  • Explanation:
    • Adds polynomial terms to the dataset.
    • Degree controls the highest power.
    • Interaction terms between features are included.

2. LinearRegression

  • What is it? Performs ordinary least squares on the expanded polynomial feature set.
  • Syntax:
from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(X_poly, y)
  • Explanation:
    • Treats transformed input as standard linear regression.
    • Requires separate prediction using transformed data.

3. Pipeline Integration

  • What is it? Combines transformation and modeling into a single object.
  • Syntax:
from sklearn.pipeline import Pipeline
pipe = Pipeline([
  ('poly', PolynomialFeatures(degree=3)),
  ('model', LinearRegression())
])
pipe.fit(X, y)
  • Explanation:
    • Cleaner code for workflow.
    • Easy to evaluate and reuse.

4. Scaling (Optional)

  • What is it? Standardizes features to avoid dominance by larger magnitude terms.
  • Syntax:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
  • Explanation:
    • Important for high-degree models.
    • Reduces numerical instability.

5. Plotting

  • What is it? Visual representation of model’s curve fitting.
  • Syntax:
import matplotlib.pyplot as plt
plt.scatter(X, y)
plt.plot(X, pipe.predict(X))
plt.show()
  • Explanation:
    • Helps assess under/overfitting visually.

Real-Life Project: Housing Price Curve Fitting

Project Name

Polynomial Regression on House Size vs. Price

Project Overview

This project predicts house prices using a nonlinear relationship between square footage and price. It shows how a polynomial regression model can fit curves better than a standard linear model.

Project Goal

  • Build a pipeline with polynomial features
  • Fit and evaluate nonlinear model
  • Visualize predicted vs. actual prices

Code for This Project

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import Pipeline

# Load dataset
data = pd.read_csv('house_data.csv')
X = data[['SquareFeet']].values
y = data['Price'].values

# Pipeline
model = Pipeline([
  ('poly', PolynomialFeatures(degree=2)),
  ('linreg', LinearRegression())
])

model.fit(X, y)

# Plot
plt.scatter(X, y)
plt.plot(X, model.predict(X), color='red')
plt.title('Polynomial Regression Fit')
plt.xlabel('Square Feet')
plt.ylabel('Price')
plt.show()

Expected Output

  • Curved regression line fitting the scatter plot
  • Better fit than standard linear model

Common Mistakes to Avoid

  • ❌ Using too high degree β†’ overfitting
  • ❌ Not scaling features β†’ poor performance on higher degrees
  • ❌ Forgetting to use fit_transform() β†’ pipeline breaks
  • ❌ Comparing results without visualization

Further Reading Recommendation

πŸ“˜ Hands-On Python and Scikit-Learn: A Practical Guide to Machine Learning
by Sarful Hassan
πŸ”— Available on Amazon

Also explore:

Ridge and Lasso Regression using Scikit-learn

Regularized regression techniques like Ridge and Lasso are powerful tools for handling multicollinearity and preventing overfitting in linear models. They add penalty terms to the loss function, shrinking coefficients and improving generalization. Scikit-learn provides both via Ridge and Lasso classes.

Key Characteristics of Ridge and Lasso Regression

  • Regularization: Penalizes large coefficients to reduce overfitting.
  • Ridge (L2): Shrinks coefficients but keeps all variables.
  • Lasso (L1): Shrinks some coefficients to zero, enabling feature selection.
  • Works Like Linear Regression: Similar API with added regularization strength.
  • Useful with Multicollinearity: Helps when predictors are correlated.

Basic Rules for Ridge and Lasso Regression

  • Normalize features before applying (use StandardScaler).
  • Use alpha to control the strength of the penalty.
  • Ridge is better for multicollinearity, Lasso for sparse feature selection.
  • Tune alpha using cross-validation (e.g., RidgeCV, LassoCV).
  • Evaluate performance using RMSE and RΒ² metrics.

Syntax Table

SL NO Function Syntax Example Description
1 Ridge Model Ridge(alpha=1.0) Adds L2 regularization
2 Lasso Model Lasso(alpha=0.1) Adds L1 regularization
3 Scaling StandardScaler().fit_transform(X) Standardizes features
4 Cross-Validation RidgeCV(alphas=[0.1, 1.0, 10.0]) Finds optimal alpha
5 Coefficients View model.coef_ Displays model coefficients

Syntax Explanation

1. Ridge Regression

  • What is it? A linear regression model with L2 regularization that penalizes large coefficients to prevent overfitting.
  • Syntax:
from sklearn.linear_model import Ridge
ridge = Ridge(alpha=1.0)
ridge.fit(X_train, y_train)
  • Explanation:
    • Adds a penalty equal to the square of the magnitude of coefficients.
    • Helps in cases of multicollinearity.
    • Does not eliminate any features.

2. Lasso Regression

  • What is it? A linear regression model with L1 regularization that can shrink some coefficients to zero for feature selection.
  • Syntax:
from sklearn.linear_model import Lasso
lasso = Lasso(alpha=0.1)
lasso.fit(X_train, y_train)
  • Explanation:
    • Adds a penalty equal to the absolute value of coefficients.
    • Encourages sparse models (zero coefficients for less important features).
    • Good for datasets with many features.

3. Feature Scaling

  • What is it? Standardizing features so they contribute equally to the model.
  • Syntax:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
  • Explanation:
    • Required before applying Ridge or Lasso.
    • Ensures penalty is applied fairly across features.

4. Cross-Validation for Hyperparameter Tuning

  • What is it? A method to find the best alpha (regularization strength) using multiple train-test splits.
  • Syntax:
from sklearn.linear_model import RidgeCV
model_cv = RidgeCV(alphas=[0.1, 1.0, 10.0], cv=5)
model_cv.fit(X_train, y_train)
  • Explanation:
    • alphas is a list of candidate values.
    • Automatically selects the best performing one.

5. Evaluating the Model

  • What is it? Assessing model performance using prediction metrics.
  • Syntax:
from sklearn.metrics import mean_squared_error, r2_score
pred = model.predict(X_test)
rmse = mean_squared_error(y_test, pred, squared=False)
r2 = r2_score(y_test, pred)
  • Explanation:
    • RMSE shows average prediction error.
    • RΒ² reveals how well the features explain target variance.

Real-Life Project: Predicting Car Prices

Project Name

Car Price Prediction with Ridge and Lasso

Project Overview

This project uses Ridge and Lasso regression to predict used car prices based on engine size, age, mileage, and other numeric features. It compares the effect of regularization on model performance.

Project Goal

  • Compare Ridge and Lasso for price prediction
  • Visualize which features get eliminated by Lasso
  • Evaluate models using RMSE and RΒ²

Code for This Project

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import Ridge, Lasso
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error, r2_score

# Load data
data = pd.read_csv('used_cars.csv')
X = data.drop('Price', axis=1)
y = data['Price']

# Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Scale
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Ridge
ridge = Ridge(alpha=1.0)
ridge.fit(X_train_scaled, y_train)
pred_ridge = ridge.predict(X_test_scaled)

# Lasso
lasso = Lasso(alpha=0.1)
lasso.fit(X_train_scaled, y_train)
pred_lasso = lasso.predict(X_test_scaled)

# Evaluate
print("Ridge RMSE:", mean_squared_error(y_test, pred_ridge, squared=False))
print("Lasso RMSE:", mean_squared_error(y_test, pred_lasso, squared=False))

Expected Output

  • RMSE and RΒ² values for both models
  • Lasso may drop features β†’ zero coefficients
  • Ridge keeps all features but reduces overfitting

Common Mistakes to Avoid

  • ❌ Using unscaled data β†’ skews regularization
  • ❌ Too high alpha β†’ underfitting
  • ❌ Ignoring feature selection in Lasso
  • ❌ Comparing Lasso to OLS without scaling

Further Reading Recommendation

πŸ“˜ Hands-On Python and Scikit-Learn: A Practical Guide to Machine Learning
by Sarful Hassan
πŸ”— Available on Amazon

Also explore:

Linear Regression with Scikit-learn

Linear regression is one of the simplest and most interpretable algorithms in machine learning. It models the relationship between one or more input variables and a continuous output variable by fitting a straight line (in simple regression) or hyperplane (in multiple regression). Scikit-learn offers a straightforward implementation of linear regression through the LinearRegression class.

Key Characteristics of Linear Regression

  • Continuous Target Variable: Predicts real-valued outputs.
  • Assumes Linearity: Relationship between features and target is linear.
  • Interpretability: Coefficients explain feature impact.
  • No Need for Scaling: Works without feature scaling (unlike regularized versions).
  • Fast and Efficient: Suitable for large datasets with linear patterns.

Basic Rules for Using Linear Regression

  • Ensure features are numerically encoded.
  • Check for linear relationship between inputs and output.
  • Remove multicollinearity among features if possible.
  • Split dataset into training and testing sets.
  • Evaluate model with RMSE or RΒ² score.

Syntax Table

SL NO Function Syntax Example Description
1 Import Model from sklearn.linear_model import LinearRegression Loads regression model class
2 Create Model model = LinearRegression() Initializes model
3 Train Model model.fit(X_train, y_train) Trains model on training data
4 Make Predictions y_pred = model.predict(X_test) Predicts target values
5 Evaluate RMSE mean_squared_error(y_test, y_pred, squared=False) Root Mean Squared Error
6 Evaluate RΒ² Score r2_score(y_test, y_pred) Measures goodness of fit

Syntax Explanation

1. Import and Initialize Model

  • What is it? Loads and prepares the regression model.
  • Syntax:
from sklearn.linear_model import LinearRegression
model = LinearRegression()
  • Explanation:
    • Prepares a fresh instance of linear regression.
    • Default fits intercept and does not normalize features.

2. Train the Model

  • What is it? Fits the linear regression model to training data.
  • Syntax:
model.fit(X_train, y_train)
  • Explanation:
    • Learns the weights (coefficients) of input features.
    • Fits a line or hyperplane that minimizes squared error.

3. Make Predictions

  • What is it? Predicts target values using the trained model.
  • Syntax:
y_pred = model.predict(X_test)
  • Explanation:
    • Applies learned coefficients to unseen data.
    • Produces continuous-valued outputs.

4. Evaluate with RMSE

  • What is it? Measures average prediction error in the same unit as the target.
  • Syntax:
from sklearn.metrics import mean_squared_error
rmse = mean_squared_error(y_test, y_pred, squared=False)
  • Explanation:
    • Common metric for regression tasks.
    • Lower RMSE = better model.

5. Evaluate with RΒ² Score

  • What is it? Represents how much variance in the target is explained by features.
  • Syntax:
from sklearn.metrics import r2_score
r2 = r2_score(y_test, y_pred)
  • Explanation:
    • Ranges from 0 (poor fit) to 1 (perfect fit).
    • Indicates the strength of the linear relationship.

Real-Life Project: Predicting House Prices

Project Name

House Price Prediction Using Linear Regression

Project Overview

This project demonstrates the use of linear regression to predict house prices based on features such as square footage, number of bedrooms, and location index.

Project Goal

  • Build and evaluate a linear regression model
  • Predict continuous house prices
  • Interpret coefficients to understand feature impact

Code for This Project

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

# Load dataset
data = pd.read_csv('house_prices.csv')
X = data[['SqFt', 'Bedrooms', 'LocationIndex']]
y = data['Price']

# Split dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train model
model = LinearRegression()
model.fit(X_train, y_train)

# Predict
y_pred = model.predict(X_test)

# Evaluate
rmse = mean_squared_error(y_test, y_pred, squared=False)
r2 = r2_score(y_test, y_pred)
print("RMSE:", rmse)
print("RΒ² Score:", r2)

Expected Output

  • RMSE value indicating prediction error
  • RΒ² score showing how well features explain price
  • Trained model ready for deployment or analysis

Common Mistakes to Avoid

  • ❌ Using categorical variables without encoding
  • ❌ Failing to check for multicollinearity
  • ❌ Ignoring assumptions of linearity and homoscedasticity
  • ❌ Using RMSE aloneβ€”consider visualizing residuals

Further Reading Recommendation

πŸ“˜ Hands-On Python and Scikit-Learn: A Practical Guide to Machine Learning
by Sarful Hassan
πŸ”— Available on Amazon

Also explore:

Introduction to Supervised Learning in Scikit-learn

Supervised learning is one of the most common machine learning paradigms, where the algorithm learns a mapping between input features and known output labels. Scikit-learn provides a rich set of tools for building and evaluating supervised learning models for both classification and regression tasks.

Key Characteristics of Supervised Learning

  • Labeled Training Data: Requires input-output pairs for training.
  • Two Main Types: Classification (categorical target) and Regression (continuous target).
  • Model Evaluation: Uses metrics like accuracy, precision, RMSE, etc.
  • Generalization: Learns patterns to make predictions on unseen data.
  • Scikit-learn Friendly: Offers estimators, pipelines, and evaluation tools.

Basic Rules for Supervised Learning in Scikit-learn

  • Split data into train and test sets using train_test_split().
  • Select appropriate model type (LogisticRegression, RandomForestClassifier, etc.).
  • Fit the model using model.fit(X_train, y_train).
  • Predict using model.predict(X_test).
  • Evaluate with relevant metrics using sklearn.metrics.

Syntax Table

SL NO Function Syntax Example Description
1 Train-Test Split train_test_split(X, y) Splits data for training/testing
2 Model Training model.fit(X_train, y_train) Trains the supervised model
3 Make Predictions model.predict(X_test) Predicts outputs from test input
4 Accuracy Score accuracy_score(y_test, y_pred) Measures performance (classification)
5 RMSE Score mean_squared_error(y_test, y_pred, squared=False) Measures regression error

Syntax Explanation

1. Train-Test Split

  • What is it? Separates your dataset into training and testing sets.
  • Syntax:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
  • Explanation:
    • Prevents overfitting by evaluating on unseen data.
    • test_size=0.2 means 20% used for testing.

2. Model Training

  • What is it? Fits the model on training data.
  • Syntax:
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
model.fit(X_train, y_train)
  • Explanation:
    • The model learns patterns from X_train to predict y_train.
    • Applies optimization based on selected algorithm.

3. Make Predictions

  • What is it? Uses the trained model to make predictions.
  • Syntax:
y_pred = model.predict(X_test)
  • Explanation:
    • Applies learned rules to test inputs.
    • Used to evaluate accuracy, error, or other performance metrics.

4. Accuracy Score (for Classification)

  • What is it? Measures the percentage of correct predictions.
  • Syntax:
from sklearn.metrics import accuracy_score
accuracy_score(y_test, y_pred)
  • Explanation:
    • Works for classification problems.
    • 1.0 = perfect score, 0.0 = no correct predictions.

5. RMSE Score (for Regression)

  • What is it? Measures the average error in predictions.
  • Syntax:
from sklearn.metrics import mean_squared_error
rmse = mean_squared_error(y_test, y_pred, squared=False)
  • Explanation:
    • Evaluates how far predictions are from true values.
    • Lower RMSE indicates better performance.

Real-Life Project: Predicting Student Exam Pass/Fail

Project Name

Binary Classification for Exam Outcome Prediction

Project Overview

This project aims to predict whether a student will pass or fail an exam based on study hours and past performance using supervised learning.

Project Goal

  • Train a logistic regression classifier
  • Predict outcomes on new student records
  • Evaluate model accuracy

Code for This Project

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Load data
data = pd.read_csv('student_scores.csv')
X = data[['StudyHours', 'PastScore']]
y = data['Pass']

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

# Train model
model = LogisticRegression()
model.fit(X_train, y_train)

# Predict
y_pred = model.predict(X_test)

# Evaluate
print("Accuracy:", accuracy_score(y_test, y_pred))

Expected Output

  • Trained model using student study features
  • Predictions for pass/fail labels
  • Accuracy score between 0 and 1

Common Mistakes to Avoid

  • ❌ Not splitting data properly
  • ❌ Using regression for categorical outputs
  • ❌ Failing to evaluate model on test data
  • ❌ Skipping feature scaling (if needed by model type)

Further Reading Recommendation

πŸ“˜ Hands-On Python and Scikit-Learn: A Practical Guide to Machine Learning
by Sarful Hassan
πŸ”— Available on Amazon

Also explore:

Splitting Data into Train and Test Sets using Scikit-learn

Train-test splitting is a fundamental concept in machine learning. It ensures that models are trained on one portion of the data and evaluated on another, promoting generalization and preventing overfitting. Scikit-learn provides a simple and reliable utility for splitting datasets.

Key Characteristics of Train-Test Splitting

  • Ensures Generalization: Evaluates model performance on unseen data.
  • Randomization Support: Randomizes the dataset before splitting.
  • Custom Split Ratios: Allows flexible train/test proportions.
  • Stratification: Maintains class balance during classification splits.
  • Reproducibility: Controlled with random seed (random_state).

Basic Rules for Train-Test Splits

  • Always split before preprocessing or model training.
  • Use train_test_split() from sklearn.model_selection.
  • Stratify on target variable when dealing with classification problems.
  • Avoid data leakage by ensuring test data is untouched during training.
  • Use a fixed random_state to ensure reproducibility.

Syntax Table

SL NO Function Syntax Example Description
1 Import Function from sklearn.model_selection import train_test_split Imports splitter from Scikit-learn
2 Basic Split X_train, X_test, y_train, y_test = train_test_split(X, y) Splits data into train/test
3 Custom Ratio train_test_split(X, y, test_size=0.3) 70/30 split example
4 Set Seed train_test_split(X, y, random_state=42) Ensures reproducible results
5 Stratified Split train_test_split(X, y, stratify=y) Maintains label proportions

Syntax Explanation

1. Basic Train-Test Split

  • What is it? Separates features and target into training and testing groups.
  • Syntax:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y)
  • Explanation:
    • Default split is 75% training and 25% testing.
    • Random shuffling is performed before splitting.
    • Keeps feature (X) and target (y) aligned.

2. Custom Split Ratio

  • What is it? Allows control over the percentage allocated to the test set.
  • Syntax:
train_test_split(X, y, test_size=0.2)
  • Explanation:
    • 80% of data for training and 20% for testing.
    • Accepts float (0.2 = 20%) or int (e.g., 100 samples).
    • Ensure test size is not too small for model evaluation.

3. Stratified Splitting

  • What is it? Maintains label balance between train and test sets.
  • Syntax:
train_test_split(X, y, stratify=y)
  • Explanation:
    • Especially useful for imbalanced datasets.
    • Ensures proportion of each class is consistent.
    • Crucial for fair performance evaluation.

4. Reproducibility with Random Seed

  • What is it? Ensures same random split every run.
  • Syntax:
train_test_split(X, y, random_state=42)
  • Explanation:
    • Random shuffling can change results.
    • Setting random_state makes results reproducible.
    • Use same seed across experiments for consistency.

Real-Life Project: Splitting Heart Disease Dataset

Project Name

Train-Test Split for Predicting Heart Disease

Project Overview

The dataset includes various health metrics and a binary label indicating presence of heart disease. Proper train-test splitting will allow unbiased model evaluation.

Project Goal

  • Split data into train/test sets
  • Maintain label balance using stratification
  • Prepare data for preprocessing and modeling

Code for This Project

import pandas as pd
from sklearn.model_selection import train_test_split

# Load dataset
data = pd.read_csv('heart_disease.csv')
X = data.drop('target', axis=1)
y = data['target']

# Split with stratification and seed
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

Expected Output

  • 80% of data in X_train, y_train
  • 20% in X_test, y_test
  • Class distribution preserved
  • Reproducible split for modeling workflows

Common Mistakes to Avoid

  • ❌ Fitting preprocessing before splitting β†’ causes data leakage
  • ❌ Ignoring class imbalance β†’ skews evaluation metrics
  • ❌ Forgetting random_state β†’ inconsistent results
  • ❌ Confusing X and y order β†’ misaligned splits

Further Reading Recommendation

πŸ“˜ Hands-On Python and Scikit-Learn: A Practical Guide to Machine Learning
by Sarful Hassan
πŸ”— Available on Amazon

Also explore:

Mastering Encoding Categorical Variables in Scikit-learn

Many real-world datasets include categorical featuresβ€”such as city names, gender, or product typesβ€”that machine learning models cannot process directly. Encoding transforms these textual or symbolic values into numerical format suitable for model training. Scikit-learn offers multiple encoding strategies tailored for different use cases.

Key Characteristics of Categorical Encoding

  • Label-Free Representations: Convert categories into integers or binary vectors.
  • Model-Friendly: Makes categorical data usable for statistical or ML models.
  • Multiple Strategies: Supports one-hot, ordinal, and frequency encoding.
  • Robustness: Can handle unknown categories and missing values.
  • Pipeline Compatible: Easily integrated into automated ML workflows.

Basic Rules for Encoding Categorical Variables

  • Use OneHotEncoder for nominal (unordered) categories.
  • Use OrdinalEncoder for ordinal (ordered) categories.
  • Handle unknown categories with handle_unknown='ignore'.
  • Always encode after imputing missing values.
  • Use ColumnTransformer to encode only relevant columns.

Syntax Table

SL NO Function Syntax Example Description
1 One-Hot Encoding OneHotEncoder(sparse=False) Creates binary columns for each category
2 Ordinal Encoding OrdinalEncoder() Assigns ordered integers to categories
3 Column Encoding ColumnTransformer([...]) Encode selected columns in pipeline
4 Handling Unknowns OneHotEncoder(handle_unknown='ignore') Avoids errors on unseen categories
5 Fit-Transform Logic fit_transform() on train, transform() on test Prevents data leakage
6 Full Pipeline Pipeline([...]) Automates encoding with other preprocessing

Syntax Explanation

1. OneHotEncoder

  • What is it? Converts categories into multiple binary columns.
  • Syntax:
from sklearn.preprocessing import OneHotEncoder
encoder = OneHotEncoder(sparse=False)
X_encoded = encoder.fit_transform(X)
  • Explanation:
    • Suitable for nominal data like Color, City, Brand.
    • Each category gets its own binary column.
    • sparse=False returns NumPy array instead of sparse matrix.

2. OrdinalEncoder

  • What is it? Assigns integer labels to each category based on order.
  • Syntax:
from sklearn.preprocessing import OrdinalEncoder
ord_enc = OrdinalEncoder()
X_ord = ord_enc.fit_transform(X)
  • Explanation:
    • Best for ordinal features like Education Level, Size, Rank.
    • Preserves order but not distance between values.
    • Use caution with tree-based modelsβ€”integer order might be misinterpreted.

3. ColumnTransformer

  • What is it? Applies encoders to specific column sets.
  • Syntax:
from sklearn.compose import ColumnTransformer
ct = ColumnTransformer([
  ('city_ohe', OneHotEncoder(), ['City']),
  ('rank_ord', OrdinalEncoder(), ['Rank'])
])
X_transformed = ct.fit_transform(X)
  • Explanation:
    • Separates encoding logic by column type.
    • Maintains clean, modular pipeline structure.
    • Useful when combining with numeric transformations.

4. Handling Unknown Categories

  • What is it? Avoids errors during prediction on unseen categories.
  • Syntax:
OneHotEncoder(handle_unknown='ignore')
  • Explanation:
    • Ensures robustness across training and inference.
    • Skips encoding for new labels instead of raising errors.
    • Particularly useful in production pipelines.

5. Full Encoding Pipeline

  • What is it? Combines encoders with transformers and models.
  • Syntax:
from sklearn.pipeline import Pipeline
pipe = Pipeline([
  ('encoder', OneHotEncoder(handle_unknown='ignore'))
])
X_encoded = pipe.fit_transform(X)
  • Explanation:
    • Streamlines entire preprocessing.
    • Avoids duplication of logic across train/test.
    • Works well with model training or cross-validation.

Real-Life Project: Encoding Loan Application Data

Project Name

Encoding Categorical Features for Loan Approval Prediction

Project Overview

Loan application datasets include several categorical features such as marital status, employment type, and loan purpose. This project demonstrates proper encoding using OneHotEncoder and OrdinalEncoder to prepare such data for modeling.

Project Goal

  • Encode categorical fields using appropriate methods
  • Maintain clean format for downstream ML models
  • Prevent model errors due to unseen categories

Code for This Project

import pandas as pd
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

# Sample dataset
df = pd.read_csv('loan_applications.csv')

# Define categorical features
nominal = ['Gender', 'Married', 'Loan_Purpose']
ordinal = ['Education_Level']

# Define transformers
ohe = OneHotEncoder(handle_unknown='ignore', sparse=False)
ord = OrdinalEncoder()

# Column transformer
preprocessor = ColumnTransformer([
  ('nominal_enc', ohe, nominal),
  ('ordinal_enc', ord, ordinal)
])

X_encoded = preprocessor.fit_transform(df)

Expected Output

  • All categorical variables numerically encoded.
  • Robust handling of unknown labels.
  • Ready for use in classification or regression models.

Common Mistakes to Avoid

  • ❌ Using label encoding for nominal features
  • ❌ Not setting handle_unknown='ignore'
  • ❌ Forgetting to exclude target variable from encoding
  • ❌ Mixing fit/transform logic between train/test data

Further Reading Recommendation

πŸ“˜ Hands-On Python and Scikit-Learn: A Practical Guide to Machine Learning
by Sarful Hassan
πŸ”— Available on Amazon

Also explore:

Mastering Handling Missing Data with Scikit-learn

Missing data is a common issue in real-world datasets. Whether due to user omission, system error, or data corruption, missing values can affect model performance and bias predictions. Scikit-learn provides robust strategies to detect and handle missing values efficiently.

Key Characteristics of Missing Data Handling

  • Flexible Imputation Strategies: Mean, median, mode, or custom value.
  • Column and Row-wise Detection: Identify missing values per column or row.
  • Pipeline Integration: Handle missing values as part of preprocessing.
  • Support for Numeric and Categorical Data: Choose appropriate imputation per data type.
  • Constant Value Fill: Useful for flags, categories, or default fill-in.

Basic Rules for Handling Missing Data

  • Always check for missing values before preprocessing or modeling.
  • Use visualization (heatmaps, missingno) for exploration.
  • Fit imputation on training data, then apply to test/validation sets.
  • Choose imputation strategies based on column types and distributions.
  • Combine imputation with scaling and encoding in a pipeline.

Syntax Table

SL NO Function Syntax Example Description
1 Detect Missing Values df.isnull().sum() Returns missing count per column
2 Drop Rows with NaNs df.dropna() Removes rows that contain NaN
3 Simple Imputer (mean) SimpleImputer(strategy='mean') Imputes numeric features with mean
4 Simple Imputer (most_frequent) SimpleImputer(strategy='most_frequent') Categorical mode fill
5 Constant Imputer SimpleImputer(strategy='constant', fill_value=0) Fill with custom value
6 Pipeline Integration Pipeline([...]) Automates imputation within workflows

Syntax Explanation

1. Detect Missing Values

  • What is it? Identifies how many values are missing per column.
  • Syntax:
df.isnull().sum()
  • Explanation:
    • Use df.isnull() to get a Boolean mask of missing cells.
    • .sum() counts True (i.e., NaN) values column-wise.
    • First step in any missing data strategy.

2. Drop Rows with NaNs

  • What is it? Removes any rows that contain missing values.
  • Syntax:
df_cleaned = df.dropna()
  • Explanation:
    • Useful when missing data is minimal.
    • May reduce dataset size significantly.
    • Use with caution to avoid data loss.

3. SimpleImputer (Mean)

  • What is it? Replaces missing values with the mean of the column.
  • Syntax:
from sklearn.impute import SimpleImputer
imp = SimpleImputer(strategy='mean')
X_imputed = imp.fit_transform(X)
  • Explanation:
    • Suitable for continuous numeric data.
    • fit() learns column means from training data.
    • transform() applies imputation to missing values.

4. SimpleImputer (Most Frequent)

  • What is it? Fills missing values with the most frequent value in a column.
  • Syntax:
SimpleImputer(strategy='most_frequent')
  • Explanation:
    • Ideal for categorical or ordinal features.
    • Prevents rare categories from being overused.
    • Safer than constant fill in unknown domains.

5. SimpleImputer (Constant Value)

  • What is it? Fills missing values with a fixed specified value.
  • Syntax:
SimpleImputer(strategy='constant', fill_value='Unknown')
  • Explanation:
    • Use for categorical placeholders or zero-fill.
    • Makes missingness explicit for some models.
    • Fill value must be type-compatible with column.

6. Pipeline Integration

  • What is it? Wraps imputation logic into a reproducible pipeline.
  • Syntax:
from sklearn.pipeline import Pipeline
pipe = Pipeline([
  ('imputer', SimpleImputer(strategy='median'))
])
X_clean = pipe.fit_transform(X)
  • Explanation:
    • Ensures same imputation is applied consistently.
    • Can be combined with scalers, encoders, and estimators.
    • Ideal for production and evaluation workflows.

Real-Life Project: Imputing Customer Demographics

Project Name

Cleaning and Imputing Missing Values in Customer Dataset

Project Overview

We will clean a dataset containing customer profiles, where Age, Income, and City columns contain missing values. Using different strategies per column type, we prepare the dataset for segmentation and modeling.

Project Goal

  • Impute numerical values (Age, Income) using mean/median.
  • Impute categorical fields (City) using most frequent or a placeholder.
  • Wrap transformation into a single pipeline.

Code for This Project

import pandas as pd
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer

# Sample dataset
customer_data = pd.read_csv('customer_data.csv')

num_cols = ['Age', 'Income']
cat_cols = ['City']

num_pipeline = Pipeline([
  ('imputer', SimpleImputer(strategy='mean'))
])

cat_pipeline = Pipeline([
  ('imputer', SimpleImputer(strategy='most_frequent'))
])

preprocessor = ColumnTransformer([
  ('num', num_pipeline, num_cols),
  ('cat', cat_pipeline, cat_cols)
])

X_cleaned = preprocessor.fit_transform(customer_data)

Expected Output

  • Clean matrix with no missing values.
  • Numeric fields filled with statistical values.
  • Categorical fields filled with top occurring value.
  • Ready for modeling or export.

Common Mistakes to Avoid

  • ❌ Applying imputation after scaling/encoding
  • ❌ Using test data during fit (data leakage)
  • ❌ Dropping rows with high info value
  • ❌ Using mean imputation for categorical columns

Further Reading Recommendation

πŸ“˜ Hands-On Python and Scikit-Learn: A Practical Guide to Machine Learning
by Sarful Hassan
πŸ”— Available on Amazon

Also explore:

Mastering Feature Engineering Techniques in Scikit-learn

Feature engineering is the process of transforming raw data into meaningful inputs that enhance model performance. It is one of the most critical steps in the machine learning workflow. Scikit-learn offers a variety of built-in tools and transformers that simplify and automate common feature engineering tasks.

Key Characteristics of Feature Engineering in Scikit-learn

  • Automation Ready: Easily integrate with pipelines for consistent transformation.
  • Custom Transformation: Create your own logic using FunctionTransformer or TransformerMixin.
  • Rich Toolkit: Includes polynomial features, interaction terms, binning, and more.
  • Compatibility: Works seamlessly with numeric, categorical, and datetime features.
  • Composable: Supports chaining and parallel processing through Pipeline and ColumnTransformer.

Basic Rules for Feature Engineering

  • Always explore your data visually before feature engineering.
  • Use domain knowledge to guide feature creation.
  • Avoid leakage by using only training data for fitting transformers.
  • Scale or encode features after feature creation.
  • Evaluate feature importance and drop irrelevant ones.

Syntax Table

SL NO Function Syntax Example Description
1 Polynomial Features PolynomialFeatures(degree=2) Adds polynomial and interaction terms
2 Binning (Discretization) KBinsDiscretizer(n_bins=3) Converts continuous data into discrete bins
3 Custom Transformation FunctionTransformer(func) Apply user-defined logic to data
4 Feature Selection SelectKBest(score_func, k=5) Selects top-k features based on scoring
5 Feature Union FeatureUnion([...]) Combines multiple transformers into one
6 Column Transformer Integration ColumnTransformer([...]) Applies different engineering steps by column

Syntax Explanation

1. PolynomialFeatures

  • What is it? Generates new features by taking polynomial combinations of existing features.
  • Syntax:
from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(degree=2)
X_poly = poly.fit_transform(X)
  • Explanation:
    • Adds interaction terms and powers of features.
    • Useful for linear models capturing non-linear patterns.
    • Rapidly increases dimensionalityβ€”use with care.

2. KBinsDiscretizer

  • What is it? Discretizes continuous data into specified number of bins.
  • Syntax:
from sklearn.preprocessing import KBinsDiscretizer
binning = KBinsDiscretizer(n_bins=4, encode='ordinal')
X_binned = binning.fit_transform(X)
  • Explanation:
    • Converts numeric values into intervals.
    • Helps with models sensitive to non-linearity or ordinal relationships.
    • strategy options include ‘uniform’, ‘quantile’, and ‘kmeans’.

3. FunctionTransformer

  • What is it? Applies any custom function to transform your data.
  • Syntax:
from sklearn.preprocessing import FunctionTransformer
import numpy as np
log_transform = FunctionTransformer(np.log1p)
X_transformed = log_transform.fit_transform(X)
  • Explanation:
    • Simple wrapper around any callable function.
    • Keeps compatibility with pipelines.
    • Great for log transforms, scaling, or unit conversions.

4. SelectKBest

  • What is it? Selects top k features based on statistical test.
  • Syntax:
from sklearn.feature_selection import SelectKBest, f_classif
selector = SelectKBest(score_func=f_classif, k=5)
X_selected = selector.fit_transform(X, y)
  • Explanation:
    • Filters out weakly related features.
    • Common score functions: f_classif, chi2, mutual_info_classif.
    • Improves model performance and reduces overfitting.

5. FeatureUnion

  • What is it? Combines outputs from multiple transformers.
  • Syntax:
from sklearn.pipeline import FeatureUnion
combined = FeatureUnion([
  ('poly', PolynomialFeatures(degree=2)),
  ('binned', KBinsDiscretizer(n_bins=3))
])
X_combined = combined.fit_transform(X)
  • Explanation:
    • Useful for parallel feature engineering.
    • All outputs are concatenated.
    • Each transformer runs independently.

6. ColumnTransformer

  • What is it? Applies specific transformers to selected columns.
  • Syntax:
from sklearn.compose import ColumnTransformer
transformer = ColumnTransformer([
  ('bin_age', KBinsDiscretizer(n_bins=3), ['Age']),
  ('poly_income', PolynomialFeatures(degree=2), ['Income'])
])
X_processed = transformer.fit_transform(data)
  • Explanation:
    • Great for structured datasets.
    • Allows fine-grained control over feature creation.
    • Keeps numeric and categorical workflows separate.

Real-Life Project: Feature Engineering on Titanic Dataset

Project Name

Creating Predictive Features for Titanic Survival Prediction

Project Overview

This project uses the Titanic dataset to demonstrate feature engineering techniques. We create new features like ‘FamilySize’ and ‘IsAlone’, bin age and fare, and apply polynomial features to improve model accuracy.

Project Goal

  • Derive new features from existing columns
  • Bin continuous variables into discrete categories
  • Apply transformations to prepare dataset for modeling

Code for This Project

import pandas as pd
from sklearn.preprocessing import KBinsDiscretizer, FunctionTransformer, PolynomialFeatures
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

# Load dataset
data = pd.read_csv('titanic.csv')
data['FamilySize'] = data['SibSp'] + data['Parch'] + 1

data['IsAlone'] = (data['FamilySize'] == 1).astype(int)

num_cols = ['Age', 'Fare']
poly = PolynomialFeatures(degree=2, include_bias=False)
binner = KBinsDiscretizer(n_bins=3, encode='ordinal')

column_transform = ColumnTransformer([
  ('poly', poly, num_cols),
  ('bin', binner, num_cols)
])

X = column_transform.fit_transform(data)

Expected Output

  • New features: FamilySize, IsAlone
  • Polynomial features for Age, Fare
  • Binned versions of continuous columns

Common Mistakes to Avoid

  • ❌ Applying polynomial features on already scaled data
  • ❌ Using fit_transform() on test set
  • ❌ Creating features that leak future info (like survival-related info)
  • ❌ Not validating new features’ contribution to model accuracy

Further Reading Recommendation

πŸ“˜ Hands-On Python and Scikit-Learn: A Practical Guide to Machine Learning
by Sarful Hassan
πŸ”— Available on Amazon

Also explore:

Mastering Feature Scaling and Normalization with Scikit-learn

Feature scaling and normalization are essential preprocessing steps in machine learning, especially when models rely on distance-based calculations. Without proper scaling, features with larger ranges can dominate others, skewing model performance. Scikit-learn offers powerful tools for performing both normalization and standardization effectively.

Key Characteristics of Feature Scaling and Normalization

  • Standardization: Transforms features to have zero mean and unit variance.
  • Normalization: Scales feature values to a fixed range, typically [0, 1].
  • Model Compatibility: Improves performance for SVM, KNN, logistic regression, etc.
  • Column-wise Transformation: Applies scaling only to numeric columns.
  • Integration with Pipelines: Easily incorporated into machine learning pipelines.

Basic Rules for Scaling and Normalization

  • Always scale numeric features only.
  • Use StandardScaler for standardization and MinMaxScaler for normalization.
  • Fit scalers on training data, then apply (transform) to test data.
  • Combine with imputation if there are missing values.
  • Avoid scaling categorical variables unless encoded numerically.

Syntax Table

SL NO Function Syntax Example Description
1 Standard Scaling StandardScaler() Zero mean, unit variance
2 Min-Max Scaling MinMaxScaler() Rescales to a 0–1 range
3 Robust Scaling RobustScaler() Scales using median and IQR
4 MaxAbs Scaling MaxAbsScaler() Scales by maximum absolute value
5 Column Transformer ColumnTransformer([...]) Applies scaling only to selected columns
6 Integration with Pipeline Pipeline([...]) Combines scaler with other preprocessing steps

Syntax Explanation

1. StandardScaler

  • What is it? Scales features by removing the mean and scaling to unit variance.
  • Syntax:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
  • Explanation:
    • Useful for SVM, logistic regression, PCA.
    • Transforms each column to have mean = 0, std = 1.
    • Sensitive to outliersβ€”may not be ideal for skewed data.

2. MinMaxScaler

  • What is it? Scales features to a defined range (default [0, 1]).
  • Syntax:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
X_scaled = scaler.fit_transform(X)
  • Explanation:
    • Maintains the shape of the original distribution.
    • Suitable for neural networks, KNN.
    • Affected by outliersβ€”can squash non-outlier values.

3. RobustScaler

  • What is it? Scales features using median and interquartile range.
  • Syntax:
from sklearn.preprocessing import RobustScaler
scaler = RobustScaler()
X_scaled = scaler.fit_transform(X)
  • Explanation:
    • Robust to outliers.
    • Ideal when data has extreme values.
    • Does not normalize distribution; just rescales.

4. MaxAbsScaler

  • What is it? Scales each feature by its maximum absolute value.
  • Syntax:
from sklearn.preprocessing import MaxAbsScaler
scaler = MaxAbsScaler()
X_scaled = scaler.fit_transform(X)
  • Explanation:
    • Retains sparsity in sparse data.
    • Values remain in [-1, 1] if centered around zero.
    • Fast and simpleβ€”ideal for sparse input matrices.

5. ColumnTransformer

  • What is it? Applies scalers only to numeric columns in a structured dataset.
  • Syntax:
from sklearn.compose import ColumnTransformer
transformer = ColumnTransformer([
  ('scale', StandardScaler(), numeric_cols)
])
X_transformed = transformer.fit_transform(X)
  • Explanation:
    • Keeps other columns unchanged.
    • Supports integration with pipelines and encoders.
    • Cleaner code for mixed-type datasets.

6. Pipeline Integration

  • What is it? Wraps the scaler with other steps for reuse and automation.
  • Syntax:
from sklearn.pipeline import Pipeline
pipe = Pipeline([
  ('scale', StandardScaler())
])
X_ready = pipe.fit_transform(X)
  • Explanation:
    • Chain together multiple preprocessing steps.
    • Ensures consistent transformation in training/testing.
    • Simplifies deployment and reproducibility.

Real-Life Project: Scaling Features in Housing Prices Dataset

Project Name

Preprocessing and Scaling Boston Housing Dataset

Project Overview

This project demonstrates how to apply different scaling methods on a real-world regression dataset. We will scale the numeric features of the Boston housing dataset and prepare it for linear regression modeling.

Project Goal

  • Load Boston housing dataset
  • Apply standard scaling and min-max normalization
  • Compare effect on regression model performance

Code for This Project

from sklearn.datasets import load_diabetes
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

# Load dataset
X, y = load_diabetes(return_X_y=True)

# Split dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Standard Scaling
scaler_std = StandardScaler()
X_train_std = scaler_std.fit_transform(X_train)
X_test_std = scaler_std.transform(X_test)

# Train and evaluate model
model_std = LinearRegression().fit(X_train_std, y_train)
y_pred_std = model_std.predict(X_test_std)
print("MSE with StandardScaler:", mean_squared_error(y_test, y_pred_std))

# Min-Max Scaling
scaler_minmax = MinMaxScaler()
X_train_minmax = scaler_minmax.fit_transform(X_train)
X_test_minmax = scaler_minmax.transform(X_test)

# Train and evaluate model
model_minmax = LinearRegression().fit(X_train_minmax, y_train)
y_pred_minmax = model_minmax.predict(X_test_minmax)
print("MSE with MinMaxScaler:", mean_squared_error(y_test, y_pred_minmax))

Expected Output

  • Two mean squared error (MSE) values showing the impact of each scaling method.
  • Scaled datasets ready for regression or other modeling.

Common Mistakes to Avoid

  • ❌ Scaling test data before fitting the scaler on training data
  • ❌ Forgetting to apply the same transformation to test data
  • ❌ Applying scaling to categorical features without encoding
  • ❌ Mixing scaling methods within a single dataset

Further Reading Recommendation

To gain deeper understanding and best practices for scaling and feature preparation:

πŸ“˜ Hands-On Python and Scikit-Learn: A Practical Guide to Machine Learning
by Sarful Hassan
πŸ”— Available on Amazon

Also explore: