Linear Regression with Scikit-learn

Linear regression is one of the simplest and most interpretable algorithms in machine learning. It models the relationship between one or more input variables and a continuous output variable by fitting a straight line (in simple regression) or hyperplane (in multiple regression). Scikit-learn offers a straightforward implementation of linear regression through the LinearRegression class.

Key Characteristics of Linear Regression

Continuous Target Variable: Predicts real-valued outputs.
Assumes Linearity: Relationship between features and target is linear.
Interpretability: Coefficients explain feature impact.
No Need for Scaling: Works without feature scaling (unlike regularized versions).
Fast and Efficient: Suitable for large datasets with linear patterns.

Basic Rules for Using Linear Regression

Ensure features are numerically encoded.
Check for linear relationship between inputs and output.
Remove multicollinearity among features if possible.
Split dataset into training and testing sets.
Evaluate model with RMSE or R² score.

Syntax Table

SL NO	Function	Syntax Example	Description
1	Import Model	`from sklearn.linear_model import LinearRegression`	Loads regression model class
2	Create Model	`model = LinearRegression()`	Initializes model
3	Train Model	`model.fit(X_train, y_train)`	Trains model on training data
4	Make Predictions	`y_pred = model.predict(X_test)`	Predicts target values
5	Evaluate RMSE	`mean_squared_error(y_test, y_pred, squared=False)`	Root Mean Squared Error
6	Evaluate R² Score	`r2_score(y_test, y_pred)`	Measures goodness of fit

Syntax Explanation

1. Import and Initialize Model

What is it? Loads and prepares the regression model.
Syntax:

from sklearn.linear_model import LinearRegression
model = LinearRegression()

Explanation:
- Prepares a fresh instance of linear regression.
- Default fits intercept and does not normalize features.

2. Train the Model

What is it? Fits the linear regression model to training data.
Syntax:

model.fit(X_train, y_train)

Explanation:
- Learns the weights (coefficients) of input features.
- Fits a line or hyperplane that minimizes squared error.

3. Make Predictions

What is it? Predicts target values using the trained model.
Syntax:

y_pred = model.predict(X_test)

Explanation:
- Applies learned coefficients to unseen data.
- Produces continuous-valued outputs.

4. Evaluate with RMSE

What is it? Measures average prediction error in the same unit as the target.
Syntax:

from sklearn.metrics import mean_squared_error
rmse = mean_squared_error(y_test, y_pred, squared=False)

Explanation:
- Common metric for regression tasks.
- Lower RMSE = better model.

5. Evaluate with R² Score

What is it? Represents how much variance in the target is explained by features.
Syntax:

from sklearn.metrics import r2_score
r2 = r2_score(y_test, y_pred)

Explanation:
- Ranges from 0 (poor fit) to 1 (perfect fit).
- Indicates the strength of the linear relationship.

Real-Life Project: Predicting House Prices

Project Name

House Price Prediction Using Linear Regression

Project Overview

This project demonstrates the use of linear regression to predict house prices based on features such as square footage, number of bedrooms, and location index.

Project Goal

Build and evaluate a linear regression model
Predict continuous house prices
Interpret coefficients to understand feature impact

Code for This Project

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

# Load dataset
data = pd.read_csv('house_prices.csv')
X = data[['SqFt', 'Bedrooms', 'LocationIndex']]
y = data['Price']

# Split dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train model
model = LinearRegression()
model.fit(X_train, y_train)

# Predict
y_pred = model.predict(X_test)

# Evaluate
rmse = mean_squared_error(y_test, y_pred, squared=False)
r2 = r2_score(y_test, y_pred)
print("RMSE:", rmse)
print("R² Score:", r2)

Expected Output

RMSE value indicating prediction error
R² score showing how well features explain price
Trained model ready for deployment or analysis

Common Mistakes to Avoid

❌ Using categorical variables without encoding
❌ Failing to check for multicollinearity
❌ Ignoring assumptions of linearity and homoscedasticity
❌ Using RMSE alone—consider visualizing residuals