Text Classification with TF-IDF and Scikit-learn

TF-IDF (Term Frequency-Inverse Document Frequency) is a popular method to convert text into numerical features. Combined with Scikit-learn’s machine learning tools, it enables building robust text classification models.

Key Characteristics

  • Converts raw text into a sparse numerical matrix
  • Weights common terms lower, rare terms higher
  • Useful for spam detection, sentiment analysis, etc.
  • Works well with linear models (e.g., Logistic Regression, SVM)

Basic Rules

  • Always apply preprocessing (lowercase, remove stopwords)
  • Use TfidfVectorizer to transform text
  • Fit the classifier after vectorization
  • Evaluate with cross-validation or test split

Syntax Table

SL NO Technique Syntax Example Description
1 Import TF-IDF from sklearn.feature_extraction.text import TfidfVectorizer Loads TF-IDF vectorizer
2 Initialize Vectorizer vectorizer = TfidfVectorizer() Creates TF-IDF object
3 Fit Transform X_tfidf = vectorizer.fit_transform(corpus) Converts text to numeric TF-IDF matrix
4 Train Classifier model.fit(X_tfidf, y) Trains model using TF-IDF features
5 Predict on New Text model.predict(vectorizer.transform([text])) Predicts label from new raw input

Syntax Explanation

1. Import TF-IDF

What is it?
Imports the TF-IDF transformer module to convert text into weighted numeric features.

Syntax:

from sklearn.feature_extraction.text import TfidfVectorizer

Explanation:

  • Required to access the feature extraction module.
  • Converts text to matrix of TF-IDF features, which scales word frequency inversely with document frequency.

2. Initialize Vectorizer

What is it?
Creates a TF-IDF vectorizer instance with optional preprocessing options.

Syntax:

vectorizer = TfidfVectorizer(stop_words='english', max_features=1000)

Explanation:

  • stop_words='english' removes commonly used words that provide little semantic value.
  • max_features=1000 restricts feature space to the most frequent 1000 terms, improving efficiency.
  • Other options include ngram_range, min_df, and max_df to control vocabulary granularity.

3. Fit Transform Corpus

What is it?
Learns the vocabulary and computes TF-IDF values from a list of text documents.

Syntax:

X_tfidf = vectorizer.fit_transform(corpus)

Explanation:

  • Fits the model on text corpus and returns the TF-IDF matrix.
  • Output is a sparse matrix with shape (n_samples, n_features).
  • Retains word positioning and importance without creating dense memory structures.

4. Train Classifier

What is it?
Trains a supervised model using TF-IDF-transformed input.

Syntax:

from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
model.fit(X_tfidf, y)

Explanation:

  • Any classifier from Scikit-learn can be trained using the TF-IDF matrix.
  • Logistic Regression is commonly used for text classification due to linearity and efficiency.
  • Label vector y must align with TF-IDF matrix rows.

5. Predict on New Input

What is it?
Transforms raw text using the vectorizer and predicts its label with the trained model.

Syntax:

new_prediction = model.predict(vectorizer.transform(["This is a test text"]))

Explanation:

  • Ensures new input follows the same vectorization logic as training data.
  • Outputs label prediction (e.g., spam vs. ham, positive vs. negative).
  • Can be integrated into apps, APIs, or chatbots for real-time text classification.

Real-Life Project: Spam Email Classifier

Project Overview

Classify emails as spam or not using TF-IDF and logistic regression.

Code Example

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import classification_report

# Sample data
corpus = ["Free money now!!!", "Important meeting tomorrow", "Win cash instantly", "Let's schedule a call"]
y = [1, 0, 1, 0]  # 1 = spam, 0 = ham

# TF-IDF
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)

# Train/Test Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=42)

# Train
model = LogisticRegression()
model.fit(X_train, y_train)

# Predict and evaluate
y_pred = model.predict(X_test)
print(classification_report(y_test, y_pred))

Expected Output

  • Precision, Recall, F1-score for each class
  • Clear distinction between spam and ham messages

Common Mistakes to Avoid

  • ❌ Not preprocessing text (case-folding, stopwords, etc.)
  • ❌ Forgetting to transform test data with the same vectorizer
  • ❌ Overfitting with too many max_features

Further Reading Recommendation

📘 Hands-On Python and Scikit-Learn: A Practical Guide to Machine Learning by Sarful Hassan

🔗 Available on Amazon