News Classification Project with Scikit-learn

News classification is a supervised machine learning task where the goal is to assign news articles or headlines into predefined categories such as sports, politics, business, technology, etc. Scikit-learn provides a convenient framework to preprocess, vectorize, and model text data using pipelines.

Key Characteristics

Works with both short text (e.g., headlines) and full-length articles
Suitable for real-time applications like content filtering or topic tagging
Involves data preprocessing, text vectorization, and model evaluation
Scikit-learn supports multiple models including Naive Bayes, SVM, and Logistic Regression

Basic Rules

Always clean and preprocess text (e.g., lowercasing, stopword removal)
Use TfidfVectorizer or CountVectorizer for numerical representation
Choose the right model based on your dataset size and complexity
Split data into training and testing to evaluate generalization

Syntax Table

SL NO	Technique	Syntax Example	Description
1	Import Modules	`from sklearn.feature_extraction.text import TfidfVectorizer`	Load TF-IDF vectorizer
2	Vectorize Text	`X = TfidfVectorizer().fit_transform(corpus)`	Convert text into numeric features
3	Train/Test Split	`train_test_split(X, y, test_size=0.3)`	Prepare train and test sets
4	Train Model	`model = MultinomialNB().fit(X_train, y_train)`	Train classifier on vectorized text
5	Evaluate Accuracy	`accuracy_score(y_test, model.predict(X_test))`	Evaluate model performance

Syntax Explanation

1. Import Modules

What is it?
Imports necessary preprocessing and classification tools from Scikit-learn.

Syntax:

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score

Explanation:

Brings in all necessary functions to vectorize text, split data, build the model, and evaluate it.
Keeps the workflow modular and efficient.

2. Vectorize Text

What is it?
Converts the text documents into TF-IDF weighted feature vectors.

Syntax:

vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)

Explanation:

TfidfVectorizer helps convert text to feature vectors with scaled weights.
.fit_transform() learns vocabulary and applies transformation.
Corpus should be preprocessed (e.g., lowercased, punctuation removed).

3. Train/Test Split

What is it?
Divides the dataset into training and testing sets.

Syntax:

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

Explanation:

Ensures model evaluation on unseen data.
random_state=42 makes the split reproducible.
test_size=0.3 means 30% of data is used for testing.

4. Train Model

What is it?
Fits a Naive Bayes classifier to the training data.

Syntax:

model = MultinomialNB()
model.fit(X_train, y_train)

Explanation:

MultinomialNB is ideal for text classification with count or TF-IDF features.
Learns conditional probabilities for each class.
Fast and performs well on text datasets.

5. Evaluate Accuracy

What is it?
Measures how well the model performs on unseen data.

Syntax:

accuracy = accuracy_score(y_test, model.predict(X_test))

Explanation:

Compares predicted vs actual labels to compute accuracy.
Higher accuracy indicates better model performance.
Can also use confusion matrix or classification report for deeper insight.

Real-Life Project: Classifying BBC News Headlines

Project Overview

Classify BBC headlines into business, politics, tech, etc., using TF-IDF and Naive Bayes.

Code Example

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

corpus = ["Apple unveils new iPhone", "Government passes tax reform", "Championship ends in a tie"]
y = ["tech", "politics", "sports"]

vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
model = MultinomialNB()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

print("Accuracy:", accuracy_score(y_test, y_pred))

Expected Output

Accuracy score printed in terminal (e.g., Accuracy: 1.0 for small test)
Optional: Confusion matrix or classification report

Common Mistakes to Avoid

❌ Using fit_transform() on test data
❌ Not preprocessing input corpus
❌ Imbalanced class distributions without addressing it

📘 Hands-On Python and Scikit-Learn: A Practical Guide to Machine Learning by Sarful Hassan

🔗 Available on Amazon

Key Characteristics

Basic Rules

Syntax Table

Syntax Explanation

1. Import Modules

2. Vectorize Text

3. Train/Test Split

4. Train Model

5. Evaluate Accuracy

Real-Life Project: Classifying BBC News Headlines

Project Overview

Code Example

Expected Output

Common Mistakes to Avoid

Further Reading Recommendation

📘 Hands-On Python and Scikit-Learn: A Practical Guide to Machine Learning by Sarful Hassan

Login