Linear regression is one of the simplest and most interpretable algorithms in machine learning. It models the relationship between one or more input variables and a continuous output variable by fitting a straight line (in simple regression) or hyperplane (in multiple regression). Scikit-learn offers a straightforward implementation of linear regression through the LinearRegression class.
Key Characteristics of Linear Regression
- Continuous Target Variable: Predicts real-valued outputs.
- Assumes Linearity: Relationship between features and target is linear.
- Interpretability: Coefficients explain feature impact.
- No Need for Scaling: Works without feature scaling (unlike regularized versions).
- Fast and Efficient: Suitable for large datasets with linear patterns.
Basic Rules for Using Linear Regression
- Ensure features are numerically encoded.
- Check for linear relationship between inputs and output.
- Remove multicollinearity among features if possible.
- Split dataset into training and testing sets.
- Evaluate model with RMSE or R² score.
Syntax Table
| SL NO | Function | Syntax Example | Description |
|---|---|---|---|
| 1 | Import Model | from sklearn.linear_model import LinearRegression |
Loads regression model class |
| 2 | Create Model | model = LinearRegression() |
Initializes model |
| 3 | Train Model | model.fit(X_train, y_train) |
Trains model on training data |
| 4 | Make Predictions | y_pred = model.predict(X_test) |
Predicts target values |
| 5 | Evaluate RMSE | mean_squared_error(y_test, y_pred, squared=False) |
Root Mean Squared Error |
| 6 | Evaluate R² Score | r2_score(y_test, y_pred) |
Measures goodness of fit |
Syntax Explanation
1. Import and Initialize Model
- What is it? Loads and prepares the regression model.
- Syntax:
from sklearn.linear_model import LinearRegression
model = LinearRegression()
- Explanation:
- Prepares a fresh instance of linear regression.
- Default fits intercept and does not normalize features.
2. Train the Model
- What is it? Fits the linear regression model to training data.
- Syntax:
model.fit(X_train, y_train)
- Explanation:
- Learns the weights (coefficients) of input features.
- Fits a line or hyperplane that minimizes squared error.
3. Make Predictions
- What is it? Predicts target values using the trained model.
- Syntax:
y_pred = model.predict(X_test)
- Explanation:
- Applies learned coefficients to unseen data.
- Produces continuous-valued outputs.
4. Evaluate with RMSE
- What is it? Measures average prediction error in the same unit as the target.
- Syntax:
from sklearn.metrics import mean_squared_error
rmse = mean_squared_error(y_test, y_pred, squared=False)
- Explanation:
- Common metric for regression tasks.
- Lower RMSE = better model.
5. Evaluate with R² Score
- What is it? Represents how much variance in the target is explained by features.
- Syntax:
from sklearn.metrics import r2_score
r2 = r2_score(y_test, y_pred)
- Explanation:
- Ranges from 0 (poor fit) to 1 (perfect fit).
- Indicates the strength of the linear relationship.
Real-Life Project: Predicting House Prices
Project Name
House Price Prediction Using Linear Regression
Project Overview
This project demonstrates the use of linear regression to predict house prices based on features such as square footage, number of bedrooms, and location index.
Project Goal
- Build and evaluate a linear regression model
- Predict continuous house prices
- Interpret coefficients to understand feature impact
Code for This Project
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
# Load dataset
data = pd.read_csv('house_prices.csv')
X = data[['SqFt', 'Bedrooms', 'LocationIndex']]
y = data['Price']
# Split dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train model
model = LinearRegression()
model.fit(X_train, y_train)
# Predict
y_pred = model.predict(X_test)
# Evaluate
rmse = mean_squared_error(y_test, y_pred, squared=False)
r2 = r2_score(y_test, y_pred)
print("RMSE:", rmse)
print("R² Score:", r2)
Expected Output
- RMSE value indicating prediction error
- R² score showing how well features explain price
- Trained model ready for deployment or analysis
Common Mistakes to Avoid
- ❌ Using categorical variables without encoding
- ❌ Failing to check for multicollinearity
- ❌ Ignoring assumptions of linearity and homoscedasticity
- ❌ Using RMSE alone—consider visualizing residuals
Further Reading Recommendation
📘 Hands-On Python and Scikit-Learn: A Practical Guide to Machine Learning
by Sarful Hassan
🔗 Available on Amazon
Also explore:
- 🔗 Scikit-learn Linear Models Docs: https://scikit-learn.org/stable/modules/linear_model.html
- 🔗 Kaggle Regression Challenges
- 🔗 Visualizing Linear Regression with Matplotlib or Seaborn
