Real-World Dataset: Boston Housing Regression using Scikit-learn

The Boston Housing dataset is a classic dataset for regression tasks. It contains housing data for various suburbs of Boston and is often used to predict median house prices. Scikit-learn includes this dataset (note: deprecated in some versions due to ethical concerns, alternatives include California Housing dataset).

Key Characteristics

Regression problem
Target: Median value of owner-occupied homes (in $1000s)
Features: Crime rate, NOX levels, number of rooms, etc.
Moderate size and easy to model

Basic Rules

Normalize features before applying linear models
Visualize data for feature-target relationships
Use cross-validation for reliable performance estimates
Replace with California dataset in newer Scikit-learn versions

Syntax Table

SL NO	Step	Syntax Example	Description
1	Load dataset	`load_boston(return_X_y=True)`	Loads feature matrix and target (deprecated)
2	Train/test split	`train_test_split(X, y, test_size=0.3)`	Prepares data for training and testing
3	Standard scaling	`StandardScaler().fit_transform(X_train)`	Scales features
4	Train regressor	`LinearRegression().fit(X_train, y_train)`	Trains a regression model
5	Evaluate model	`mean_squared_error(y_test, y_pred)`	Measures prediction error

Syntax Explanation

1. Load Dataset

What is it?
Fetches the Boston housing dataset (deprecated).

Syntax:

from sklearn.datasets import load_boston
X, y = load_boston(return_X_y=True)

Explanation:

X holds features (e.g., crime rate, number of rooms)
y holds target median house prices

2. Train/Test Split

What is it?
Divides data into training and test sets.

Syntax:

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

Explanation:

Ensures the model is evaluated on unseen data

3. Standard Scaling

What is it?
Scales features to have zero mean and unit variance.

Syntax:

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

Explanation:

Makes feature ranges consistent for regression
Improves model convergence

4. Train Regressor

What is it?
Fits a linear regression model to the data.

Syntax:

from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(X_train, y_train)

Explanation:

Learns the relationship between features and target
Outputs coefficients for interpretation

5. Evaluate Model

What is it?
Quantifies how well the model predicts target values.

Syntax:

from sklearn.metrics import mean_squared_error
y_pred = model.predict(X_test)
print("MSE:", mean_squared_error(y_test, y_pred))

Explanation:

Measures average squared difference between actual and predicted values

Real-Life Project: House Price Prediction

Project Name

Boston Housing Price Estimator

Project Overview

Predict median home values in Boston suburbs using regression techniques.

Project Goal

Train and evaluate a regression model to understand housing price influences.

Code for This Project

from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

# Load data
X, y = load_boston(return_X_y=True)

# Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Scale
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# Train
model = LinearRegression()
model.fit(X_train, y_train)

# Predict & Evaluate
y_pred = model.predict(X_test)
print("Mean Squared Error:", mean_squared_error(y_test, y_pred))

Expected Output

Mean Squared Error (lower is better)
Insight into features influencing home prices

Common Mistakes to Avoid

❌ Not scaling features before training
❌ Ignoring feature correlation and multicollinearity
❌ Using deprecated dataset without ethical awareness

📘 Hands-On Python and Scikit-Learn: A Practical Guide to Machine Learning by Sarful Hassan

🔗 Available on Amazon

Key Characteristics

Basic Rules

Syntax Table

Syntax Explanation

1. Load Dataset

2. Train/Test Split

3. Standard Scaling

4. Train Regressor

5. Evaluate Model

Real-Life Project: House Price Prediction

Project Name

Project Overview

Project Goal

Code for This Project

Expected Output

Common Mistakes to Avoid

Further Reading Recommendation

📘 Hands-On Python and Scikit-Learn: A Practical Guide to Machine Learning by Sarful Hassan

Login