The Boston Housing dataset is a classic dataset for regression tasks. It contains housing data for various suburbs of Boston and is often used to predict median house prices. Scikit-learn includes this dataset (note: deprecated in some versions due to ethical concerns, alternatives include California Housing dataset).
Key Characteristics
- Regression problem
- Target: Median value of owner-occupied homes (in $1000s)
- Features: Crime rate, NOX levels, number of rooms, etc.
- Moderate size and easy to model
Basic Rules
- Normalize features before applying linear models
- Visualize data for feature-target relationships
- Use cross-validation for reliable performance estimates
- Replace with California dataset in newer Scikit-learn versions
Syntax Table
SL NO | Step | Syntax Example | Description |
---|---|---|---|
1 | Load dataset | load_boston(return_X_y=True) |
Loads feature matrix and target (deprecated) |
2 | Train/test split | train_test_split(X, y, test_size=0.3) |
Prepares data for training and testing |
3 | Standard scaling | StandardScaler().fit_transform(X_train) |
Scales features |
4 | Train regressor | LinearRegression().fit(X_train, y_train) |
Trains a regression model |
5 | Evaluate model | mean_squared_error(y_test, y_pred) |
Measures prediction error |
Syntax Explanation
1. Load Dataset
What is it?
Fetches the Boston housing dataset (deprecated).
Syntax:
from sklearn.datasets import load_boston
X, y = load_boston(return_X_y=True)
Explanation:
X
holds features (e.g., crime rate, number of rooms)y
holds target median house prices
2. Train/Test Split
What is it?
Divides data into training and test sets.
Syntax:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
Explanation:
- Ensures the model is evaluated on unseen data
3. Standard Scaling
What is it?
Scales features to have zero mean and unit variance.
Syntax:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
Explanation:
- Makes feature ranges consistent for regression
- Improves model convergence
4. Train Regressor
What is it?
Fits a linear regression model to the data.
Syntax:
from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(X_train, y_train)
Explanation:
- Learns the relationship between features and target
- Outputs coefficients for interpretation
5. Evaluate Model
What is it?
Quantifies how well the model predicts target values.
Syntax:
from sklearn.metrics import mean_squared_error
y_pred = model.predict(X_test)
print("MSE:", mean_squared_error(y_test, y_pred))
Explanation:
- Measures average squared difference between actual and predicted values
Real-Life Project: House Price Prediction
Project Name
Boston Housing Price Estimator
Project Overview
Predict median home values in Boston suburbs using regression techniques.
Project Goal
Train and evaluate a regression model to understand housing price influences.
Code for This Project
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
# Load data
X, y = load_boston(return_X_y=True)
# Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Scale
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
# Train
model = LinearRegression()
model.fit(X_train, y_train)
# Predict & Evaluate
y_pred = model.predict(X_test)
print("Mean Squared Error:", mean_squared_error(y_test, y_pred))
Expected Output
- Mean Squared Error (lower is better)
- Insight into features influencing home prices
Common Mistakes to Avoid
- ❌ Not scaling features before training
- ❌ Ignoring feature correlation and multicollinearity
- ❌ Using deprecated dataset without ethical awareness