Real-World Dataset: Boston Housing Regression using Scikit-learn

The Boston Housing dataset is a classic dataset for regression tasks. It contains housing data for various suburbs of Boston and is often used to predict median house prices. Scikit-learn includes this dataset (note: deprecated in some versions due to ethical concerns, alternatives include California Housing dataset).

Key Characteristics

  • Regression problem
  • Target: Median value of owner-occupied homes (in $1000s)
  • Features: Crime rate, NOX levels, number of rooms, etc.
  • Moderate size and easy to model

Basic Rules

  • Normalize features before applying linear models
  • Visualize data for feature-target relationships
  • Use cross-validation for reliable performance estimates
  • Replace with California dataset in newer Scikit-learn versions

Syntax Table

SL NO Step Syntax Example Description
1 Load dataset load_boston(return_X_y=True) Loads feature matrix and target (deprecated)
2 Train/test split train_test_split(X, y, test_size=0.3) Prepares data for training and testing
3 Standard scaling StandardScaler().fit_transform(X_train) Scales features
4 Train regressor LinearRegression().fit(X_train, y_train) Trains a regression model
5 Evaluate model mean_squared_error(y_test, y_pred) Measures prediction error

Syntax Explanation

1. Load Dataset

What is it?
Fetches the Boston housing dataset (deprecated).

Syntax:

from sklearn.datasets import load_boston
X, y = load_boston(return_X_y=True)

Explanation:

  • X holds features (e.g., crime rate, number of rooms)
  • y holds target median house prices

2. Train/Test Split

What is it?
Divides data into training and test sets.

Syntax:

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

Explanation:

  • Ensures the model is evaluated on unseen data

3. Standard Scaling

What is it?
Scales features to have zero mean and unit variance.

Syntax:

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

Explanation:

  • Makes feature ranges consistent for regression
  • Improves model convergence

4. Train Regressor

What is it?
Fits a linear regression model to the data.

Syntax:

from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(X_train, y_train)

Explanation:

  • Learns the relationship between features and target
  • Outputs coefficients for interpretation

5. Evaluate Model

What is it?
Quantifies how well the model predicts target values.

Syntax:

from sklearn.metrics import mean_squared_error
y_pred = model.predict(X_test)
print("MSE:", mean_squared_error(y_test, y_pred))

Explanation:

  • Measures average squared difference between actual and predicted values

Real-Life Project: House Price Prediction

Project Name

Boston Housing Price Estimator

Project Overview

Predict median home values in Boston suburbs using regression techniques.

Project Goal

Train and evaluate a regression model to understand housing price influences.

Code for This Project

from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

# Load data
X, y = load_boston(return_X_y=True)

# Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Scale
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# Train
model = LinearRegression()
model.fit(X_train, y_train)

# Predict & Evaluate
y_pred = model.predict(X_test)
print("Mean Squared Error:", mean_squared_error(y_test, y_pred))

Expected Output

  • Mean Squared Error (lower is better)
  • Insight into features influencing home prices

Common Mistakes to Avoid

  • ❌ Not scaling features before training
  • ❌ Ignoring feature correlation and multicollinearity
  • ❌ Using deprecated dataset without ethical awareness

Further Reading Recommendation

📘 Hands-On Python and Scikit-Learn: A Practical Guide to Machine Learning by Sarful Hassan

🔗 Available on Amazon