News classification is a supervised machine learning task where the goal is to assign news articles or headlines into predefined categories such as sports, politics, business, technology, etc. Scikit-learn provides a convenient framework to preprocess, vectorize, and model text data using pipelines.
Key Characteristics
- Works with both short text (e.g., headlines) and full-length articles
- Suitable for real-time applications like content filtering or topic tagging
- Involves data preprocessing, text vectorization, and model evaluation
- Scikit-learn supports multiple models including Naive Bayes, SVM, and Logistic Regression
Basic Rules
- Always clean and preprocess text (e.g., lowercasing, stopword removal)
- Use
TfidfVectorizer
orCountVectorizer
for numerical representation - Choose the right model based on your dataset size and complexity
- Split data into training and testing to evaluate generalization
Syntax Table
SL NO | Technique | Syntax Example | Description |
---|---|---|---|
1 | Import Modules | from sklearn.feature_extraction.text import TfidfVectorizer |
Load TF-IDF vectorizer |
2 | Vectorize Text | X = TfidfVectorizer().fit_transform(corpus) |
Convert text into numeric features |
3 | Train/Test Split | train_test_split(X, y, test_size=0.3) |
Prepare train and test sets |
4 | Train Model | model = MultinomialNB().fit(X_train, y_train) |
Train classifier on vectorized text |
5 | Evaluate Accuracy | accuracy_score(y_test, model.predict(X_test)) |
Evaluate model performance |
Syntax Explanation
1. Import Modules
What is it?
Imports necessary preprocessing and classification tools from Scikit-learn.
Syntax:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score
Explanation:
- Brings in all necessary functions to vectorize text, split data, build the model, and evaluate it.
- Keeps the workflow modular and efficient.
2. Vectorize Text
What is it?
Converts the text documents into TF-IDF weighted feature vectors.
Syntax:
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)
Explanation:
TfidfVectorizer
helps convert text to feature vectors with scaled weights..fit_transform()
learns vocabulary and applies transformation.- Corpus should be preprocessed (e.g., lowercased, punctuation removed).
3. Train/Test Split
What is it?
Divides the dataset into training and testing sets.
Syntax:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
Explanation:
- Ensures model evaluation on unseen data.
random_state=42
makes the split reproducible.test_size=0.3
means 30% of data is used for testing.
4. Train Model
What is it?
Fits a Naive Bayes classifier to the training data.
Syntax:
model = MultinomialNB()
model.fit(X_train, y_train)
Explanation:
MultinomialNB
is ideal for text classification with count or TF-IDF features.- Learns conditional probabilities for each class.
- Fast and performs well on text datasets.
5. Evaluate Accuracy
What is it?
Measures how well the model performs on unseen data.
Syntax:
accuracy = accuracy_score(y_test, model.predict(X_test))
Explanation:
- Compares predicted vs actual labels to compute accuracy.
- Higher accuracy indicates better model performance.
- Can also use confusion matrix or classification report for deeper insight.
Real-Life Project: Classifying BBC News Headlines
Project Overview
Classify BBC headlines into business, politics, tech, etc., using TF-IDF and Naive Bayes.
Code Example
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
corpus = ["Apple unveils new iPhone", "Government passes tax reform", "Championship ends in a tie"]
y = ["tech", "politics", "sports"]
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
model = MultinomialNB()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))
Expected Output
- Accuracy score printed in terminal (e.g.,
Accuracy: 1.0
for small test) - Optional: Confusion matrix or classification report
Common Mistakes to Avoid
- ❌ Using
fit_transform()
on test data - ❌ Not preprocessing input corpus
- ❌ Imbalanced class distributions without addressing it