Text Classification with TF-IDF and Scikit-learn

TF-IDF (Term Frequency-Inverse Document Frequency) is a popular method to convert text into numerical features. Combined with Scikit-learn’s machine learning tools, it enables building robust text classification models.

Key Characteristics

Converts raw text into a sparse numerical matrix
Weights common terms lower, rare terms higher
Useful for spam detection, sentiment analysis, etc.
Works well with linear models (e.g., Logistic Regression, SVM)

Basic Rules

Always apply preprocessing (lowercase, remove stopwords)
Use TfidfVectorizer to transform text
Fit the classifier after vectorization
Evaluate with cross-validation or test split

Syntax Table

SL NO	Technique	Syntax Example	Description
1	Import TF-IDF	`from sklearn.feature_extraction.text import TfidfVectorizer`	Loads TF-IDF vectorizer
2	Initialize Vectorizer	`vectorizer = TfidfVectorizer()`	Creates TF-IDF object
3	Fit Transform	`X_tfidf = vectorizer.fit_transform(corpus)`	Converts text to numeric TF-IDF matrix
4	Train Classifier	`model.fit(X_tfidf, y)`	Trains model using TF-IDF features
5	Predict on New Text	`model.predict(vectorizer.transform([text]))`	Predicts label from new raw input

Syntax Explanation

1. Import TF-IDF

What is it?
Imports the TF-IDF transformer module to convert text into weighted numeric features.

Syntax:

from sklearn.feature_extraction.text import TfidfVectorizer

Explanation:

Required to access the feature extraction module.
Converts text to matrix of TF-IDF features, which scales word frequency inversely with document frequency.

2. Initialize Vectorizer

What is it?
Creates a TF-IDF vectorizer instance with optional preprocessing options.

Syntax:

vectorizer = TfidfVectorizer(stop_words='english', max_features=1000)

Explanation:

stop_words='english' removes commonly used words that provide little semantic value.
max_features=1000 restricts feature space to the most frequent 1000 terms, improving efficiency.
Other options include ngram_range, min_df, and max_df to control vocabulary granularity.

3. Fit Transform Corpus

What is it?
Learns the vocabulary and computes TF-IDF values from a list of text documents.

Syntax:

X_tfidf = vectorizer.fit_transform(corpus)

Explanation:

Fits the model on text corpus and returns the TF-IDF matrix.
Output is a sparse matrix with shape (n_samples, n_features).
Retains word positioning and importance without creating dense memory structures.

4. Train Classifier

What is it?
Trains a supervised model using TF-IDF-transformed input.

Syntax:

from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
model.fit(X_tfidf, y)

Explanation:

Any classifier from Scikit-learn can be trained using the TF-IDF matrix.
Logistic Regression is commonly used for text classification due to linearity and efficiency.
Label vector y must align with TF-IDF matrix rows.

5. Predict on New Input

What is it?
Transforms raw text using the vectorizer and predicts its label with the trained model.

Syntax:

new_prediction = model.predict(vectorizer.transform(["This is a test text"]))

Explanation:

Ensures new input follows the same vectorization logic as training data.
Outputs label prediction (e.g., spam vs. ham, positive vs. negative).
Can be integrated into apps, APIs, or chatbots for real-time text classification.

Real-Life Project: Spam Email Classifier

Project Overview

Classify emails as spam or not using TF-IDF and logistic regression.

Code Example

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import classification_report

# Sample data
corpus = ["Free money now!!!", "Important meeting tomorrow", "Win cash instantly", "Let's schedule a call"]
y = [1, 0, 1, 0]  # 1 = spam, 0 = ham

# TF-IDF
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)

# Train/Test Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=42)

# Train
model = LogisticRegression()
model.fit(X_train, y_train)

# Predict and evaluate
y_pred = model.predict(X_test)
print(classification_report(y_test, y_pred))

Expected Output

Precision, Recall, F1-score for each class
Clear distinction between spam and ham messages

Common Mistakes to Avoid

❌ Not preprocessing text (case-folding, stopwords, etc.)
❌ Forgetting to transform test data with the same vectorizer
❌ Overfitting with too many max_features

📘 Hands-On Python and Scikit-Learn: A Practical Guide to Machine Learning by Sarful Hassan

🔗 Available on Amazon

Key Characteristics

Basic Rules

Syntax Table

Syntax Explanation

1. Import TF-IDF

2. Initialize Vectorizer

3. Fit Transform Corpus

4. Train Classifier

5. Predict on New Input

Real-Life Project: Spam Email Classifier

Project Overview

Code Example

Expected Output

Common Mistakes to Avoid

Further Reading Recommendation

📘 Hands-On Python and Scikit-Learn: A Practical Guide to Machine Learning by Sarful Hassan

Login