Logistic Regression for Classification in Scikit-learn

Logistic regression is a fundamental classification algorithm that models the probability of class membership using a logistic (sigmoid) function. Despite its name, logistic regression is used for binary and multi-class classification tasks. Scikit-learn offers a robust implementation through the LogisticRegression class.

Key Characteristics of Logistic Regression

  • Classification, Not Regression: Used for binary or multi-class classification.
  • Outputs Probabilities: Estimates the likelihood of each class.
  • Sigmoid Function: Converts linear combination of inputs to probability.
  • Interpretable Coefficients: Feature weights indicate importance.
  • Supports Regularization: Includes L1 and L2 penalties for generalization.

Basic Rules for Logistic Regression

  • Target variable should be categorical (e.g., 0/1, or class labels).
  • Scale features for better convergence.
  • For multi-class, use multi_class='multinomial'.
  • Use solver='liblinear', saga, or lbfgs depending on dataset size and penalty.
  • Evaluate using metrics like accuracy, precision, recall, and ROC-AUC.

Syntax Table

SL NO Function Syntax Example Description
1 Import Model from sklearn.linear_model import LogisticRegression Loads logistic regression class
2 Create Model LogisticRegression() Initializes logistic classifier
3 Train Model model.fit(X_train, y_train) Fits model to training data
4 Predict Labels model.predict(X_test) Predicts class labels
5 Predict Probabilities model.predict_proba(X_test) Gives class probabilities
6 Evaluate Accuracy accuracy_score(y_test, y_pred) Measures classification performance

Syntax Explanation

1. Import and Initialize Model

  • What is it? Loads the logistic regression model for classification.
  • Syntax:
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
  • Explanation:
    • Supports binary and multi-class classification.
    • You can set regularization type using penalty and solver method.

2. Fit the Model

  • What is it? Trains the model on labeled data.
  • Syntax:
model.fit(X_train, y_train)
  • Explanation:
    • Learns the coefficients of the logistic model.
    • Uses the sigmoid/logit function internally.

3. Predict Class Labels

  • What is it? Predicts the most likely class for new data.
  • Syntax:
y_pred = model.predict(X_test)
  • Explanation:
    • Returns 0 or 1 (or more for multi-class).
    • Useful for final decisions.

4. Predict Probabilities

  • What is it? Outputs the probability of each class.
  • Syntax:
probs = model.predict_proba(X_test)
  • Explanation:
    • Each row contains probabilities for each class.
    • Used in ROC curves and threshold tuning.

5. Evaluate Accuracy

  • What is it? Measures how often the model predicts correctly.
  • Syntax:
from sklearn.metrics import accuracy_score
accuracy = accuracy_score(y_test, y_pred)
  • Explanation:
    • Compares predicted and actual labels.
    • Good for balanced datasets; use F1 or ROC-AUC for imbalanced ones.

Real-Life Project: Spam Email Classification

Project Name

Spam Detector with Logistic Regression

Project Overview

This project classifies email messages as spam or not spam based on word frequencies and text features. Logistic regression offers a fast, interpretable, and effective solution.

Project Goal

  • Transform email text into numeric features
  • Train logistic model on labeled dataset
  • Evaluate prediction quality on unseen messages

Code for This Project

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report

# Load data
data = pd.read_csv('emails.csv')
X = data['message']
y = data['label']  # 0 = not spam, 1 = spam

# Text to numeric
vectorizer = CountVectorizer()
X_vec = vectorizer.fit_transform(X)

# Split
X_train, X_test, y_train, y_test = train_test_split(X_vec, y, test_size=0.2, random_state=42)

# Train
model = LogisticRegression()
model.fit(X_train, y_train)

# Predict
y_pred = model.predict(X_test)

# Evaluate
print("Accuracy:", accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))

Expected Output

  • Accuracy score
  • Precision, recall, and F1-score report
  • Working binary spam classifier

Common Mistakes to Avoid

  • ❌ Not scaling numeric features if present
  • ❌ Using wrong solver for large datasets
  • ❌ Ignoring precision/recall on imbalanced data
  • ❌ Overfitting by including too many irrelevant features

Further Reading Recommendation

📘 Hands-On Python and Scikit-Learn: A Practical Guide to Machine Learning
by Sarful Hassan
🔗 Available on Amazon

Also explore: