Introduction to Supervised Learning in Scikit-learn

Supervised learning is one of the most common machine learning paradigms, where the algorithm learns a mapping between input features and known output labels. Scikit-learn provides a rich set of tools for building and evaluating supervised learning models for both classification and regression tasks.

Key Characteristics of Supervised Learning

Labeled Training Data: Requires input-output pairs for training.
Two Main Types: Classification (categorical target) and Regression (continuous target).
Model Evaluation: Uses metrics like accuracy, precision, RMSE, etc.
Generalization: Learns patterns to make predictions on unseen data.
Scikit-learn Friendly: Offers estimators, pipelines, and evaluation tools.

Basic Rules for Supervised Learning in Scikit-learn

Split data into train and test sets using train_test_split().
Select appropriate model type (LogisticRegression, RandomForestClassifier, etc.).
Fit the model using model.fit(X_train, y_train).
Predict using model.predict(X_test).
Evaluate with relevant metrics using sklearn.metrics.

Syntax Table

SL NO	Function	Syntax Example	Description
1	Train-Test Split	`train_test_split(X, y)`	Splits data for training/testing
2	Model Training	`model.fit(X_train, y_train)`	Trains the supervised model
3	Make Predictions	`model.predict(X_test)`	Predicts outputs from test input
4	Accuracy Score	`accuracy_score(y_test, y_pred)`	Measures performance (classification)
5	RMSE Score	`mean_squared_error(y_test, y_pred, squared=False)`	Measures regression error

Syntax Explanation

1. Train-Test Split

What is it? Separates your dataset into training and testing sets.
Syntax:

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

Explanation:
- Prevents overfitting by evaluating on unseen data.
- test_size=0.2 means 20% used for testing.

2. Model Training

What is it? Fits the model on training data.
Syntax:

from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
model.fit(X_train, y_train)

Explanation:
- The model learns patterns from X_train to predict y_train.
- Applies optimization based on selected algorithm.

3. Make Predictions

What is it? Uses the trained model to make predictions.
Syntax:

y_pred = model.predict(X_test)

Explanation:
- Applies learned rules to test inputs.
- Used to evaluate accuracy, error, or other performance metrics.

4. Accuracy Score (for Classification)

What is it? Measures the percentage of correct predictions.
Syntax:

from sklearn.metrics import accuracy_score
accuracy_score(y_test, y_pred)

Explanation:
- Works for classification problems.
- 1.0 = perfect score, 0.0 = no correct predictions.

5. RMSE Score (for Regression)

What is it? Measures the average error in predictions.
Syntax:

from sklearn.metrics import mean_squared_error
rmse = mean_squared_error(y_test, y_pred, squared=False)

Explanation:
- Evaluates how far predictions are from true values.
- Lower RMSE indicates better performance.

Real-Life Project: Predicting Student Exam Pass/Fail

Project Name

Binary Classification for Exam Outcome Prediction

Project Overview

This project aims to predict whether a student will pass or fail an exam based on study hours and past performance using supervised learning.

Project Goal

Train a logistic regression classifier
Predict outcomes on new student records
Evaluate model accuracy

Code for This Project

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Load data
data = pd.read_csv('student_scores.csv')
X = data[['StudyHours', 'PastScore']]
y = data['Pass']

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

# Train model
model = LogisticRegression()
model.fit(X_train, y_train)

# Predict
y_pred = model.predict(X_test)

# Evaluate
print("Accuracy:", accuracy_score(y_test, y_pred))

Expected Output

Trained model using student study features
Predictions for pass/fail labels
Accuracy score between 0 and 1

Common Mistakes to Avoid

❌ Not splitting data properly
❌ Using regression for categorical outputs
❌ Failing to evaluate model on test data
❌ Skipping feature scaling (if needed by model type)