CountVectorizer
and TfidfVectorizer
are two fundamental techniques for converting raw text data into numerical features. Both are part of Scikit-learn’s feature_extraction.text
module and are commonly used in natural language processing pipelines.
Key Characteristics
- CountVectorizer counts the number of times each word appears.
- TfidfVectorizer scales term frequency by how rare the word is across all documents.
- Both output sparse matrices used for model training.
- Work well with linear models like Logistic Regression and Naive Bayes.
Basic Rules
- Always apply preprocessing before vectorization (e.g., lowercasing, stopword removal).
- Use
fit_transform()
on training text; usetransform()
on test text. - Choose
TfidfVectorizer
for better weighting and improved model performance in many tasks.
Syntax Table
SL NO | Technique | Syntax Example | Description |
---|---|---|---|
1 | Import Vectorizers | from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer |
Load vectorizer modules |
2 | CountVectorizer | cv = CountVectorizer() |
Initializes a count vectorizer |
3 | TfidfVectorizer | tfidf = TfidfVectorizer() |
Initializes a TF-IDF vectorizer |
4 | Fit and Transform | X_cv = cv.fit_transform(corpus) |
Converts raw text to numeric feature matrix |
5 | Transform New Text | X_new = tfidf.transform(["new example"]) |
Applies previously learned vocabulary to new text |
Syntax Explanation
1. Import Vectorizers
What is it?
Imports text vectorization classes from Scikit-learn.
Syntax:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
Explanation:
CountVectorizer
transforms documents into count vectors.TfidfVectorizer
does the same but applies term frequency-inverse document frequency weighting.- Essential imports to use these tools in any text classification or NLP pipeline.
- Allows creation of bag-of-words and weighted feature representations.
- Can be imported once and reused in multiple scripts or notebooks.
2. CountVectorizer
What is it?
Initializes a word-count-based vectorizer to convert text into frequency-based numerical features.
Syntax:
cv = CountVectorizer(stop_words='english')
Explanation:
- Filters out common English stopwords to reduce noise.
- Converts each document into a sparse row vector of word counts.
- Commonly used for Naive Bayes and simple linear classifiers.
- You can customize it with:
ngram_range=(1,2)
to capture unigrams and bigramsmax_df
ormin_df
to filter out overly common or rare termsmax_features
to limit the vocabulary size
- Requires text preprocessing beforehand for optimal results (e.g., punctuation removal).
3. TfidfVectorizer
What is it?
Creates a TF-IDF vectorizer to assign weights to words based on frequency and rarity.
Syntax:
tfidf = TfidfVectorizer(ngram_range=(1,2), max_features=1000)
Explanation:
- Computes TF-IDF: the product of term frequency and inverse document frequency.
ngram_range=(1,2)
captures unigrams and bigrams, increasing context awareness.max_features=1000
restricts the vocabulary size for performance and overfitting control.- More advanced than
CountVectorizer
—helps reduce the impact of frequently occurring non-informative terms. - Automatically normalizes feature values for better compatibility with models like SVM or logistic regression.
- Can incorporate sublinear TF scaling or L2 normalization.
4. Fit and Transform Text
What is it?
Learns the vocabulary from training corpus and converts it into numerical matrix.
Syntax:
X_cv = cv.fit_transform(corpus)
X_tfidf = tfidf.fit_transform(corpus)
Explanation:
fit_transform()
both fits the vectorizer to the data and transforms it into a matrix.- The result is a sparse matrix which conserves memory.
- Each row is a document, each column is a token or n-gram.
.toarray()
can be used to convert the result to dense format for inspection.- Repeated use on updated data should use
.fit()
followed by.transform()
to avoid data leakage. - Essential for converting raw text into a form suitable for scikit-learn classifiers.
5. Transform New Data
What is it?
Applies the trained vectorizer to convert new/unseen text into numeric features.
Syntax:
X_new = tfidf.transform(["Sample input text"])
Explanation:
- Transforms new documents into the same feature space learned during training.
- Prevents vocabulary mismatch and ensures feature alignment.
- Returns a sparse matrix that can directly be used for prediction.
- Important for evaluating model performance on test sets or live data.
- You must not use
fit_transform()
on new data to avoid overwriting learned vocabulary.
Real-Life Project: News Headline Classification
Project Overview
Classify news headlines into categories like sports, business, politics.
Code Example
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
# Sample corpus
corpus = ["Stocks rally as markets close", "Election results declared", "Team wins championship"]
y = ["business", "politics", "sports"]
# TF-IDF
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)
# Train/test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Model
model = MultinomialNB()
model.fit(X_train, y_train)
# Predict
y_pred = model.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))
Expected Output
- Accuracy score based on classification
- Optionally, show confusion matrix or classification report
Common Mistakes to Avoid
- ❌ Re-fitting the vectorizer on test data (causes data leakage)
- ❌ Using raw strings without token preprocessing
- ❌ Forgetting to use
.transform()
for new/unseen text