Naive Bayes Classifier in Scikit-learn

Naive Bayes is a family of simple yet powerful probabilistic classifiers based on applying Bayes’ theorem with strong (naive) independence assumptions between features. Despite its simplicity, it performs remarkably well on text classification and spam filtering tasks.

Key Characteristics of Naive Bayes

Probabilistic Model: Computes class probabilities based on feature likelihoods.
Fast and Scalable: Suitable for large datasets.
Works Well for Text: Ideal for word count or TF-IDF features.
Assumes Feature Independence: May underperform when features are correlated.
Variants Available: Includes GaussianNB, MultinomialNB, and BernoulliNB.

Basic Rules for Using Naive Bayes

Use MultinomialNB for text data with count or TF-IDF features.
Use GaussianNB for continuous features with normal distribution.
Preprocess categorical features into numeric form.
Features should be conditionally independent.
Avoid fitting on very sparse data with too many zeros (for GaussianNB).

Syntax Table

SL NO	Function	Syntax Example	Description
1	Import Classifier	`from sklearn.naive_bayes import MultinomialNB`	Imports Naive Bayes for discrete data
2	Instantiate Model	`model = MultinomialNB()`	Initializes the model
3	Fit Model	`model.fit(X_train, y_train)`	Trains the classifier
4	Predict Labels	`model.predict(X_test)`	Predicts class labels
5	Predict Probabilities	`model.predict_proba(X_test)`	Returns class probabilities

Syntax Explanation

1. Import and Instantiate

What is it? Loads and creates a Naive Bayes classifier.
Syntax:

from sklearn.naive_bayes import MultinomialNB
model = MultinomialNB()

Explanation:
- Suitable for features like word counts.
- alpha parameter can be used for Laplace smoothing.

2. Fit the Model

What is it? Trains the classifier with feature-label pairs.
Syntax:

model.fit(X_train, y_train)

Explanation:
- Learns prior and likelihood from training data.

3. Predict Labels

What is it? Predicts the class labels of new data.
Syntax:

y_pred = model.predict(X_test)

Explanation:
- Chooses the class with the highest posterior probability.

4. Predict Probabilities

What is it? Returns predicted class probabilities.
Syntax:

probs = model.predict_proba(X_test)

Explanation:
- Outputs likelihood of each class.
- Useful for probabilistic thresholding.

Real-Life Project: News Article Categorization

Project Name

Text Classification Using Naive Bayes

Project Overview

Use a Naive Bayes model to classify news articles into topics using TF-IDF features from the text.

Project Goal

Transform text into feature vectors
Train a classifier to predict categories
Evaluate accuracy and precision

Code for This Project

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report

# Load dataset
data = pd.read_csv('news.csv')
X = data['text']
y = data['category']

# Feature extraction
vectorizer = TfidfVectorizer()
X_vec = vectorizer.fit_transform(X)

# Split data
X_train, X_test, y_train, y_test = train_test_split(X_vec, y, test_size=0.3, random_state=42)

# Train model
model = MultinomialNB()
model.fit(X_train, y_train)

# Predict
y_pred = model.predict(X_test)

# Evaluate
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Classification Report:\n", classification_report(y_test, y_pred))

Expected Output

High accuracy for topic classification
Detailed classification report with precision and recall
Efficient model suitable for text pipelines

Common Mistakes to Avoid

❌ Using MultinomialNB for continuous data
❌ Ignoring Laplace smoothing (alpha=1 default)
❌ Forgetting to vectorize text before fitting
❌ Applying on highly correlated numeric features