TF-IDF (Term Frequency-Inverse Document Frequency) is a popular method to convert text into numerical features. Combined with Scikit-learn’s machine learning tools, it enables building robust text classification models.
Key Characteristics
- Converts raw text into a sparse numerical matrix
- Weights common terms lower, rare terms higher
- Useful for spam detection, sentiment analysis, etc.
- Works well with linear models (e.g., Logistic Regression, SVM)
Basic Rules
- Always apply preprocessing (lowercase, remove stopwords)
- Use
TfidfVectorizer
to transform text - Fit the classifier after vectorization
- Evaluate with cross-validation or test split
Syntax Table
SL NO | Technique | Syntax Example | Description |
---|---|---|---|
1 | Import TF-IDF | from sklearn.feature_extraction.text import TfidfVectorizer |
Loads TF-IDF vectorizer |
2 | Initialize Vectorizer | vectorizer = TfidfVectorizer() |
Creates TF-IDF object |
3 | Fit Transform | X_tfidf = vectorizer.fit_transform(corpus) |
Converts text to numeric TF-IDF matrix |
4 | Train Classifier | model.fit(X_tfidf, y) |
Trains model using TF-IDF features |
5 | Predict on New Text | model.predict(vectorizer.transform([text])) |
Predicts label from new raw input |
Syntax Explanation
1. Import TF-IDF
What is it?
Imports the TF-IDF transformer module to convert text into weighted numeric features.
Syntax:
from sklearn.feature_extraction.text import TfidfVectorizer
Explanation:
- Required to access the feature extraction module.
- Converts text to matrix of TF-IDF features, which scales word frequency inversely with document frequency.
2. Initialize Vectorizer
What is it?
Creates a TF-IDF vectorizer instance with optional preprocessing options.
Syntax:
vectorizer = TfidfVectorizer(stop_words='english', max_features=1000)
Explanation:
stop_words='english'
removes commonly used words that provide little semantic value.max_features=1000
restricts feature space to the most frequent 1000 terms, improving efficiency.- Other options include
ngram_range
,min_df
, andmax_df
to control vocabulary granularity.
3. Fit Transform Corpus
What is it?
Learns the vocabulary and computes TF-IDF values from a list of text documents.
Syntax:
X_tfidf = vectorizer.fit_transform(corpus)
Explanation:
- Fits the model on text corpus and returns the TF-IDF matrix.
- Output is a sparse matrix with shape (n_samples, n_features).
- Retains word positioning and importance without creating dense memory structures.
4. Train Classifier
What is it?
Trains a supervised model using TF-IDF-transformed input.
Syntax:
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
model.fit(X_tfidf, y)
Explanation:
- Any classifier from Scikit-learn can be trained using the TF-IDF matrix.
- Logistic Regression is commonly used for text classification due to linearity and efficiency.
- Label vector
y
must align with TF-IDF matrix rows.
5. Predict on New Input
What is it?
Transforms raw text using the vectorizer and predicts its label with the trained model.
Syntax:
new_prediction = model.predict(vectorizer.transform(["This is a test text"]))
Explanation:
- Ensures new input follows the same vectorization logic as training data.
- Outputs label prediction (e.g., spam vs. ham, positive vs. negative).
- Can be integrated into apps, APIs, or chatbots for real-time text classification.
Real-Life Project: Spam Email Classifier
Project Overview
Classify emails as spam or not using TF-IDF and logistic regression.
Code Example
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import classification_report
# Sample data
corpus = ["Free money now!!!", "Important meeting tomorrow", "Win cash instantly", "Let's schedule a call"]
y = [1, 0, 1, 0] # 1 = spam, 0 = ham
# TF-IDF
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)
# Train/Test Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=42)
# Train
model = LogisticRegression()
model.fit(X_train, y_train)
# Predict and evaluate
y_pred = model.predict(X_test)
print(classification_report(y_test, y_pred))
Expected Output
- Precision, Recall, F1-score for each class
- Clear distinction between spam and ham messages
Common Mistakes to Avoid
- ❌ Not preprocessing text (case-folding, stopwords, etc.)
- ❌ Forgetting to transform test data with the same vectorizer
- ❌ Overfitting with too many
max_features
Further Reading Recommendation
- Scikit-learn TF-IDF Docs
- Text Classification with Scikit-learn (Kaggle)
- NLTK and SpaCy for preprocessing (optional)