Introduction to Supervised Learning in Scikit-learn

Supervised learning is one of the most common machine learning paradigms, where the algorithm learns a mapping between input features and known output labels. Scikit-learn provides a rich set of tools for building and evaluating supervised learning models for both classification and regression tasks.

Key Characteristics of Supervised Learning

  • Labeled Training Data: Requires input-output pairs for training.
  • Two Main Types: Classification (categorical target) and Regression (continuous target).
  • Model Evaluation: Uses metrics like accuracy, precision, RMSE, etc.
  • Generalization: Learns patterns to make predictions on unseen data.
  • Scikit-learn Friendly: Offers estimators, pipelines, and evaluation tools.

Basic Rules for Supervised Learning in Scikit-learn

  • Split data into train and test sets using train_test_split().
  • Select appropriate model type (LogisticRegression, RandomForestClassifier, etc.).
  • Fit the model using model.fit(X_train, y_train).
  • Predict using model.predict(X_test).
  • Evaluate with relevant metrics using sklearn.metrics.

Syntax Table

SL NO Function Syntax Example Description
1 Train-Test Split train_test_split(X, y) Splits data for training/testing
2 Model Training model.fit(X_train, y_train) Trains the supervised model
3 Make Predictions model.predict(X_test) Predicts outputs from test input
4 Accuracy Score accuracy_score(y_test, y_pred) Measures performance (classification)
5 RMSE Score mean_squared_error(y_test, y_pred, squared=False) Measures regression error

Syntax Explanation

1. Train-Test Split

  • What is it? Separates your dataset into training and testing sets.
  • Syntax:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
  • Explanation:
    • Prevents overfitting by evaluating on unseen data.
    • test_size=0.2 means 20% used for testing.

2. Model Training

  • What is it? Fits the model on training data.
  • Syntax:
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
model.fit(X_train, y_train)
  • Explanation:
    • The model learns patterns from X_train to predict y_train.
    • Applies optimization based on selected algorithm.

3. Make Predictions

  • What is it? Uses the trained model to make predictions.
  • Syntax:
y_pred = model.predict(X_test)
  • Explanation:
    • Applies learned rules to test inputs.
    • Used to evaluate accuracy, error, or other performance metrics.

4. Accuracy Score (for Classification)

  • What is it? Measures the percentage of correct predictions.
  • Syntax:
from sklearn.metrics import accuracy_score
accuracy_score(y_test, y_pred)
  • Explanation:
    • Works for classification problems.
    • 1.0 = perfect score, 0.0 = no correct predictions.

5. RMSE Score (for Regression)

  • What is it? Measures the average error in predictions.
  • Syntax:
from sklearn.metrics import mean_squared_error
rmse = mean_squared_error(y_test, y_pred, squared=False)
  • Explanation:
    • Evaluates how far predictions are from true values.
    • Lower RMSE indicates better performance.

Real-Life Project: Predicting Student Exam Pass/Fail

Project Name

Binary Classification for Exam Outcome Prediction

Project Overview

This project aims to predict whether a student will pass or fail an exam based on study hours and past performance using supervised learning.

Project Goal

  • Train a logistic regression classifier
  • Predict outcomes on new student records
  • Evaluate model accuracy

Code for This Project

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Load data
data = pd.read_csv('student_scores.csv')
X = data[['StudyHours', 'PastScore']]
y = data['Pass']

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

# Train model
model = LogisticRegression()
model.fit(X_train, y_train)

# Predict
y_pred = model.predict(X_test)

# Evaluate
print("Accuracy:", accuracy_score(y_test, y_pred))

Expected Output

  • Trained model using student study features
  • Predictions for pass/fail labels
  • Accuracy score between 0 and 1

Common Mistakes to Avoid

  • ❌ Not splitting data properly
  • ❌ Using regression for categorical outputs
  • ❌ Failing to evaluate model on test data
  • ❌ Skipping feature scaling (if needed by model type)

Further Reading Recommendation

📘 Hands-On Python and Scikit-Learn: A Practical Guide to Machine Learning
by Sarful Hassan
🔗 Available on Amazon

Also explore: