Naive Bayes Classifier in Scikit-learn

Naive Bayes is a family of simple yet powerful probabilistic classifiers based on applying Bayes’ theorem with strong (naive) independence assumptions between features. Despite its simplicity, it performs remarkably well on text classification and spam filtering tasks.

Key Characteristics of Naive Bayes

  • Probabilistic Model: Computes class probabilities based on feature likelihoods.
  • Fast and Scalable: Suitable for large datasets.
  • Works Well for Text: Ideal for word count or TF-IDF features.
  • Assumes Feature Independence: May underperform when features are correlated.
  • Variants Available: Includes GaussianNB, MultinomialNB, and BernoulliNB.

Basic Rules for Using Naive Bayes

  • Use MultinomialNB for text data with count or TF-IDF features.
  • Use GaussianNB for continuous features with normal distribution.
  • Preprocess categorical features into numeric form.
  • Features should be conditionally independent.
  • Avoid fitting on very sparse data with too many zeros (for GaussianNB).

Syntax Table

SL NO Function Syntax Example Description
1 Import Classifier from sklearn.naive_bayes import MultinomialNB Imports Naive Bayes for discrete data
2 Instantiate Model model = MultinomialNB() Initializes the model
3 Fit Model model.fit(X_train, y_train) Trains the classifier
4 Predict Labels model.predict(X_test) Predicts class labels
5 Predict Probabilities model.predict_proba(X_test) Returns class probabilities

Syntax Explanation

1. Import and Instantiate

  • What is it? Loads and creates a Naive Bayes classifier.
  • Syntax:
from sklearn.naive_bayes import MultinomialNB
model = MultinomialNB()
  • Explanation:
    • Suitable for features like word counts.
    • alpha parameter can be used for Laplace smoothing.

2. Fit the Model

  • What is it? Trains the classifier with feature-label pairs.
  • Syntax:
model.fit(X_train, y_train)
  • Explanation:
    • Learns prior and likelihood from training data.

3. Predict Labels

  • What is it? Predicts the class labels of new data.
  • Syntax:
y_pred = model.predict(X_test)
  • Explanation:
    • Chooses the class with the highest posterior probability.

4. Predict Probabilities

  • What is it? Returns predicted class probabilities.
  • Syntax:
probs = model.predict_proba(X_test)
  • Explanation:
    • Outputs likelihood of each class.
    • Useful for probabilistic thresholding.

Real-Life Project: News Article Categorization

Project Name

Text Classification Using Naive Bayes

Project Overview

Use a Naive Bayes model to classify news articles into topics using TF-IDF features from the text.

Project Goal

  • Transform text into feature vectors
  • Train a classifier to predict categories
  • Evaluate accuracy and precision

Code for This Project

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report

# Load dataset
data = pd.read_csv('news.csv')
X = data['text']
y = data['category']

# Feature extraction
vectorizer = TfidfVectorizer()
X_vec = vectorizer.fit_transform(X)

# Split data
X_train, X_test, y_train, y_test = train_test_split(X_vec, y, test_size=0.3, random_state=42)

# Train model
model = MultinomialNB()
model.fit(X_train, y_train)

# Predict
y_pred = model.predict(X_test)

# Evaluate
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Classification Report:\n", classification_report(y_test, y_pred))

Expected Output

  • High accuracy for topic classification
  • Detailed classification report with precision and recall
  • Efficient model suitable for text pipelines

Common Mistakes to Avoid

  • ❌ Using MultinomialNB for continuous data
  • ❌ Ignoring Laplace smoothing (alpha=1 default)
  • ❌ Forgetting to vectorize text before fitting
  • ❌ Applying on highly correlated numeric features

Further Reading Recommendation

📘 Hands-On Python and Scikit-Learn: A Practical Guide to Machine Learning
by Sarful Hassan
🔗 Available on Amazon

Also explore: