Naive Bayes is a family of simple yet powerful probabilistic classifiers based on applying Bayes’ theorem with strong (naive) independence assumptions between features. Despite its simplicity, it performs remarkably well on text classification and spam filtering tasks.
Key Characteristics of Naive Bayes
- Probabilistic Model: Computes class probabilities based on feature likelihoods.
- Fast and Scalable: Suitable for large datasets.
- Works Well for Text: Ideal for word count or TF-IDF features.
- Assumes Feature Independence: May underperform when features are correlated.
- Variants Available: Includes GaussianNB, MultinomialNB, and BernoulliNB.
Basic Rules for Using Naive Bayes
- Use MultinomialNB for text data with count or TF-IDF features.
- Use GaussianNB for continuous features with normal distribution.
- Preprocess categorical features into numeric form.
- Features should be conditionally independent.
- Avoid fitting on very sparse data with too many zeros (for GaussianNB).
Syntax Table
| SL NO | Function | Syntax Example | Description |
|---|---|---|---|
| 1 | Import Classifier | from sklearn.naive_bayes import MultinomialNB |
Imports Naive Bayes for discrete data |
| 2 | Instantiate Model | model = MultinomialNB() |
Initializes the model |
| 3 | Fit Model | model.fit(X_train, y_train) |
Trains the classifier |
| 4 | Predict Labels | model.predict(X_test) |
Predicts class labels |
| 5 | Predict Probabilities | model.predict_proba(X_test) |
Returns class probabilities |
Syntax Explanation
1. Import and Instantiate
- What is it? Loads and creates a Naive Bayes classifier.
- Syntax:
from sklearn.naive_bayes import MultinomialNB
model = MultinomialNB()
- Explanation:
- Suitable for features like word counts.
alphaparameter can be used for Laplace smoothing.
2. Fit the Model
- What is it? Trains the classifier with feature-label pairs.
- Syntax:
model.fit(X_train, y_train)
- Explanation:
- Learns prior and likelihood from training data.
3. Predict Labels
- What is it? Predicts the class labels of new data.
- Syntax:
y_pred = model.predict(X_test)
- Explanation:
- Chooses the class with the highest posterior probability.
4. Predict Probabilities
- What is it? Returns predicted class probabilities.
- Syntax:
probs = model.predict_proba(X_test)
- Explanation:
- Outputs likelihood of each class.
- Useful for probabilistic thresholding.
Real-Life Project: News Article Categorization
Project Name
Text Classification Using Naive Bayes
Project Overview
Use a Naive Bayes model to classify news articles into topics using TF-IDF features from the text.
Project Goal
- Transform text into feature vectors
- Train a classifier to predict categories
- Evaluate accuracy and precision
Code for This Project
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report
# Load dataset
data = pd.read_csv('news.csv')
X = data['text']
y = data['category']
# Feature extraction
vectorizer = TfidfVectorizer()
X_vec = vectorizer.fit_transform(X)
# Split data
X_train, X_test, y_train, y_test = train_test_split(X_vec, y, test_size=0.3, random_state=42)
# Train model
model = MultinomialNB()
model.fit(X_train, y_train)
# Predict
y_pred = model.predict(X_test)
# Evaluate
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Classification Report:\n", classification_report(y_test, y_pred))
Expected Output
- High accuracy for topic classification
- Detailed classification report with precision and recall
- Efficient model suitable for text pipelines
Common Mistakes to Avoid
- ❌ Using MultinomialNB for continuous data
- ❌ Ignoring Laplace smoothing (
alpha=1default) - ❌ Forgetting to vectorize text before fitting
- ❌ Applying on highly correlated numeric features
Further Reading Recommendation
📘 Hands-On Python and Scikit-Learn: A Practical Guide to Machine Learning
by Sarful Hassan
🔗 Available on Amazon
Also explore:
- 🔗 Scikit-learn Naive Bayes Docs: https://scikit-learn.org/stable/modules/naive_bayes.html
- 🔗 Text Classification Projects on Kaggle
- 🔗 NLP Pipelines using Scikit-learn
