News Classification Project with Scikit-learn

News classification is a supervised machine learning task where the goal is to assign news articles or headlines into predefined categories such as sports, politics, business, technology, etc. Scikit-learn provides a convenient framework to preprocess, vectorize, and model text data using pipelines.

Key Characteristics

  • Works with both short text (e.g., headlines) and full-length articles
  • Suitable for real-time applications like content filtering or topic tagging
  • Involves data preprocessing, text vectorization, and model evaluation
  • Scikit-learn supports multiple models including Naive Bayes, SVM, and Logistic Regression

Basic Rules

  • Always clean and preprocess text (e.g., lowercasing, stopword removal)
  • Use TfidfVectorizer or CountVectorizer for numerical representation
  • Choose the right model based on your dataset size and complexity
  • Split data into training and testing to evaluate generalization

Syntax Table

SL NO Technique Syntax Example Description
1 Import Modules from sklearn.feature_extraction.text import TfidfVectorizer Load TF-IDF vectorizer
2 Vectorize Text X = TfidfVectorizer().fit_transform(corpus) Convert text into numeric features
3 Train/Test Split train_test_split(X, y, test_size=0.3) Prepare train and test sets
4 Train Model model = MultinomialNB().fit(X_train, y_train) Train classifier on vectorized text
5 Evaluate Accuracy accuracy_score(y_test, model.predict(X_test)) Evaluate model performance

Syntax Explanation

1. Import Modules

What is it?
Imports necessary preprocessing and classification tools from Scikit-learn.

Syntax:

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score

Explanation:

  • Brings in all necessary functions to vectorize text, split data, build the model, and evaluate it.
  • Keeps the workflow modular and efficient.

2. Vectorize Text

What is it?
Converts the text documents into TF-IDF weighted feature vectors.

Syntax:

vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)

Explanation:

  • TfidfVectorizer helps convert text to feature vectors with scaled weights.
  • .fit_transform() learns vocabulary and applies transformation.
  • Corpus should be preprocessed (e.g., lowercased, punctuation removed).

3. Train/Test Split

What is it?
Divides the dataset into training and testing sets.

Syntax:

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

Explanation:

  • Ensures model evaluation on unseen data.
  • random_state=42 makes the split reproducible.
  • test_size=0.3 means 30% of data is used for testing.

4. Train Model

What is it?
Fits a Naive Bayes classifier to the training data.

Syntax:

model = MultinomialNB()
model.fit(X_train, y_train)

Explanation:

  • MultinomialNB is ideal for text classification with count or TF-IDF features.
  • Learns conditional probabilities for each class.
  • Fast and performs well on text datasets.

5. Evaluate Accuracy

What is it?
Measures how well the model performs on unseen data.

Syntax:

accuracy = accuracy_score(y_test, model.predict(X_test))

Explanation:

  • Compares predicted vs actual labels to compute accuracy.
  • Higher accuracy indicates better model performance.
  • Can also use confusion matrix or classification report for deeper insight.

Real-Life Project: Classifying BBC News Headlines

Project Overview

Classify BBC headlines into business, politics, tech, etc., using TF-IDF and Naive Bayes.

Code Example

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

corpus = ["Apple unveils new iPhone", "Government passes tax reform", "Championship ends in a tie"]
y = ["tech", "politics", "sports"]

vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
model = MultinomialNB()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

print("Accuracy:", accuracy_score(y_test, y_pred))

Expected Output

  • Accuracy score printed in terminal (e.g., Accuracy: 1.0 for small test)
  • Optional: Confusion matrix or classification report

Common Mistakes to Avoid

  • ❌ Using fit_transform() on test data
  • ❌ Not preprocessing input corpus
  • ❌ Imbalanced class distributions without addressing it

Further Reading Recommendation

📘 Hands-On Python and Scikit-Learn: A Practical Guide to Machine Learning by Sarful Hassan

🔗 Available on Amazon