Logistic regression is a fundamental classification algorithm that models the probability of class membership using a logistic (sigmoid) function. Despite its name, logistic regression is used for binary and multi-class classification tasks. Scikit-learn offers a robust implementation through the LogisticRegression class.
Key Characteristics of Logistic Regression
- Classification, Not Regression: Used for binary or multi-class classification.
- Outputs Probabilities: Estimates the likelihood of each class.
- Sigmoid Function: Converts linear combination of inputs to probability.
- Interpretable Coefficients: Feature weights indicate importance.
- Supports Regularization: Includes L1 and L2 penalties for generalization.
Basic Rules for Logistic Regression
- Target variable should be categorical (e.g., 0/1, or class labels).
- Scale features for better convergence.
- For multi-class, use
multi_class='multinomial'. - Use
solver='liblinear',saga, orlbfgsdepending on dataset size and penalty. - Evaluate using metrics like accuracy, precision, recall, and ROC-AUC.
Syntax Table
| SL NO | Function | Syntax Example | Description |
|---|---|---|---|
| 1 | Import Model | from sklearn.linear_model import LogisticRegression |
Loads logistic regression class |
| 2 | Create Model | LogisticRegression() |
Initializes logistic classifier |
| 3 | Train Model | model.fit(X_train, y_train) |
Fits model to training data |
| 4 | Predict Labels | model.predict(X_test) |
Predicts class labels |
| 5 | Predict Probabilities | model.predict_proba(X_test) |
Gives class probabilities |
| 6 | Evaluate Accuracy | accuracy_score(y_test, y_pred) |
Measures classification performance |
Syntax Explanation
1. Import and Initialize Model
- What is it? Loads the logistic regression model for classification.
- Syntax:
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
- Explanation:
- Supports binary and multi-class classification.
- You can set regularization type using
penaltyand solver method.
2. Fit the Model
- What is it? Trains the model on labeled data.
- Syntax:
model.fit(X_train, y_train)
- Explanation:
- Learns the coefficients of the logistic model.
- Uses the sigmoid/logit function internally.
3. Predict Class Labels
- What is it? Predicts the most likely class for new data.
- Syntax:
y_pred = model.predict(X_test)
- Explanation:
- Returns 0 or 1 (or more for multi-class).
- Useful for final decisions.
4. Predict Probabilities
- What is it? Outputs the probability of each class.
- Syntax:
probs = model.predict_proba(X_test)
- Explanation:
- Each row contains probabilities for each class.
- Used in ROC curves and threshold tuning.
5. Evaluate Accuracy
- What is it? Measures how often the model predicts correctly.
- Syntax:
from sklearn.metrics import accuracy_score
accuracy = accuracy_score(y_test, y_pred)
- Explanation:
- Compares predicted and actual labels.
- Good for balanced datasets; use F1 or ROC-AUC for imbalanced ones.
Real-Life Project: Spam Email Classification
Project Name
Spam Detector with Logistic Regression
Project Overview
This project classifies email messages as spam or not spam based on word frequencies and text features. Logistic regression offers a fast, interpretable, and effective solution.
Project Goal
- Transform email text into numeric features
- Train logistic model on labeled dataset
- Evaluate prediction quality on unseen messages
Code for This Project
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report
# Load data
data = pd.read_csv('emails.csv')
X = data['message']
y = data['label'] # 0 = not spam, 1 = spam
# Text to numeric
vectorizer = CountVectorizer()
X_vec = vectorizer.fit_transform(X)
# Split
X_train, X_test, y_train, y_test = train_test_split(X_vec, y, test_size=0.2, random_state=42)
# Train
model = LogisticRegression()
model.fit(X_train, y_train)
# Predict
y_pred = model.predict(X_test)
# Evaluate
print("Accuracy:", accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))
Expected Output
- Accuracy score
- Precision, recall, and F1-score report
- Working binary spam classifier
Common Mistakes to Avoid
- ❌ Not scaling numeric features if present
- ❌ Using wrong solver for large datasets
- ❌ Ignoring precision/recall on imbalanced data
- ❌ Overfitting by including too many irrelevant features
Further Reading Recommendation
📘 Hands-On Python and Scikit-Learn: A Practical Guide to Machine Learning
by Sarful Hassan
🔗 Available on Amazon
Also explore:
- 🔗 Scikit-learn LogisticRegression Docs: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html
- 🔗 ROC and AUC metrics explained (YouTube/Kaggle)
- 🔗 Comparing classifiers in Scikit-learn
