Supervised learning is one of the most common machine learning paradigms, where the algorithm learns a mapping between input features and known output labels. Scikit-learn provides a rich set of tools for building and evaluating supervised learning models for both classification and regression tasks.
Key Characteristics of Supervised Learning
- Labeled Training Data: Requires input-output pairs for training.
- Two Main Types: Classification (categorical target) and Regression (continuous target).
- Model Evaluation: Uses metrics like accuracy, precision, RMSE, etc.
- Generalization: Learns patterns to make predictions on unseen data.
- Scikit-learn Friendly: Offers estimators, pipelines, and evaluation tools.
Basic Rules for Supervised Learning in Scikit-learn
- Split data into train and test sets using
train_test_split(). - Select appropriate model type (
LogisticRegression,RandomForestClassifier, etc.). - Fit the model using
model.fit(X_train, y_train). - Predict using
model.predict(X_test). - Evaluate with relevant metrics using
sklearn.metrics.
Syntax Table
| SL NO | Function | Syntax Example | Description |
|---|---|---|---|
| 1 | Train-Test Split | train_test_split(X, y) |
Splits data for training/testing |
| 2 | Model Training | model.fit(X_train, y_train) |
Trains the supervised model |
| 3 | Make Predictions | model.predict(X_test) |
Predicts outputs from test input |
| 4 | Accuracy Score | accuracy_score(y_test, y_pred) |
Measures performance (classification) |
| 5 | RMSE Score | mean_squared_error(y_test, y_pred, squared=False) |
Measures regression error |
Syntax Explanation
1. Train-Test Split
- What is it? Separates your dataset into training and testing sets.
- Syntax:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
- Explanation:
- Prevents overfitting by evaluating on unseen data.
test_size=0.2means 20% used for testing.
2. Model Training
- What is it? Fits the model on training data.
- Syntax:
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
model.fit(X_train, y_train)
- Explanation:
- The model learns patterns from
X_trainto predicty_train. - Applies optimization based on selected algorithm.
- The model learns patterns from
3. Make Predictions
- What is it? Uses the trained model to make predictions.
- Syntax:
y_pred = model.predict(X_test)
- Explanation:
- Applies learned rules to test inputs.
- Used to evaluate accuracy, error, or other performance metrics.
4. Accuracy Score (for Classification)
- What is it? Measures the percentage of correct predictions.
- Syntax:
from sklearn.metrics import accuracy_score
accuracy_score(y_test, y_pred)
- Explanation:
- Works for classification problems.
- 1.0 = perfect score, 0.0 = no correct predictions.
5. RMSE Score (for Regression)
- What is it? Measures the average error in predictions.
- Syntax:
from sklearn.metrics import mean_squared_error
rmse = mean_squared_error(y_test, y_pred, squared=False)
- Explanation:
- Evaluates how far predictions are from true values.
- Lower RMSE indicates better performance.
Real-Life Project: Predicting Student Exam Pass/Fail
Project Name
Binary Classification for Exam Outcome Prediction
Project Overview
This project aims to predict whether a student will pass or fail an exam based on study hours and past performance using supervised learning.
Project Goal
- Train a logistic regression classifier
- Predict outcomes on new student records
- Evaluate model accuracy
Code for This Project
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
# Load data
data = pd.read_csv('student_scores.csv')
X = data[['StudyHours', 'PastScore']]
y = data['Pass']
# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
# Train model
model = LogisticRegression()
model.fit(X_train, y_train)
# Predict
y_pred = model.predict(X_test)
# Evaluate
print("Accuracy:", accuracy_score(y_test, y_pred))
Expected Output
- Trained model using student study features
- Predictions for pass/fail labels
- Accuracy score between 0 and 1
Common Mistakes to Avoid
- ❌ Not splitting data properly
- ❌ Using regression for categorical outputs
- ❌ Failing to evaluate model on test data
- ❌ Skipping feature scaling (if needed by model type)
Further Reading Recommendation
📘 Hands-On Python and Scikit-Learn: A Practical Guide to Machine Learning
by Sarful Hassan
🔗 Available on Amazon
Also explore:
- 🔗 Scikit-learn Supervised Learning Docs: https://scikit-learn.org/stable/supervised_learning.html
- 🔗 YouTube Playlists on Supervised ML
- 🔗 Classification vs. Regression Explained (Kaggle, Medium)
