Vectorizing text is the process of converting textual data into numerical form that machine learning models can understand. Scikit-learn offers multiple vectorization techniques such as Count Vectorizer and TF-IDF Vectorizer to extract meaningful features from text.
Key Characteristics
- Converts unstructured text into structured numerical data
- Supports bag-of-words and frequency-based encoding
- Compatible with pipelines and transformers
- Enables use of classifiers, regressors, and clustering on text
Basic Rules
- Preprocess text: lowercase, remove punctuation, stopwords
- Choose appropriate vectorizer (Count vs TF-IDF)
- Fit vectorizer on training data only
- Transform test data with same vectorizer instance
Syntax Table
| SL NO | Technique | Syntax Example | Description |
|---|---|---|---|
| 1 | Import CountVectorizer | from sklearn.feature_extraction.text import CountVectorizer |
Loads count vectorizer class |
| 2 | Initialize Vectorizer | vectorizer = CountVectorizer() |
Creates bag-of-words transformer |
| 3 | Fit and Transform | X = vectorizer.fit_transform(corpus) |
Learns vocab and transforms text into vectors |
| 4 | Get Feature Names | vectorizer.get_feature_names_out() |
Lists vocabulary terms used in the model |
| 5 | Use in Pipeline | Pipeline([...]) |
Combines vectorizer and classifier in one workflow |
Syntax Explanation
1. Import CountVectorizer
What is it?
Imports the CountVectorizer class used for bag-of-words encoding.
Syntax:
from sklearn.feature_extraction.text import CountVectorizer
Explanation:
- Essential to access text vectorization functionality.
- Converts each document into a fixed-length vector based on word counts.
2. Initialize Vectorizer
What is it?
Creates an instance of the vectorizer with optional preprocessing parameters.
Syntax:
vectorizer = CountVectorizer(stop_words='english', max_features=1000)
Explanation:
- Removes common stopwords from English.
- Limits vocabulary to 1000 most frequent words.
- Can be customized with n-grams, min_df, max_df, and tokenizer settings.
3. Fit and Transform
What is it?
Fits the vectorizer on the training corpus and transforms it into a sparse matrix.
Syntax:
X = vectorizer.fit_transform(corpus)
Explanation:
- Learns vocabulary and word counts from corpus.
- Transforms raw text into matrix form for model training.
- Output matrix is sparse and memory-efficient.
4. Get Feature Names
What is it?
Retrieves the list of vocabulary terms generated by the vectorizer.
Syntax:
features = vectorizer.get_feature_names_out()
Explanation:
- Helps in understanding feature space and interpreting model coefficients.
- Useful for feature analysis or model explanations.
5. Use in Pipeline
What is it?
Combines the vectorizer with a classifier or regressor in a single ML workflow.
Syntax:
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
pipeline = Pipeline([
('vect', CountVectorizer()),
('clf', LogisticRegression())
])
Explanation:
- Ensures reproducibility and reduces preprocessing errors.
- Can be used with
GridSearchCVorcross_val_score. - Simplifies model training and prediction.
Real-Life Project: News Topic Classification
Project Overview
Classify news headlines into topics using CountVectorizer and logistic regression.
Code Example
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
# Sample data
corpus = ["Economy hits record growth", "New sports championship announced", "Politics heat up in elections"]
y = ["business", "sports", "politics"]
# Train/test split
X_train, X_test, y_train, y_test = train_test_split(corpus, y, test_size=0.33, random_state=42)
# Pipeline
pipeline = Pipeline([
('vect', CountVectorizer()),
('clf', LogisticRegression())
])
# Train and evaluate
pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)
print(classification_report(y_test, y_pred))
Expected Output
- Precision, recall, and F1-scores per class
- Accuracy of overall text classification
Common Mistakes to Avoid
- ❌ Not fitting vectorizer only on training data
- ❌ Using raw strings instead of preprocessed tokens
- ❌ Skipping lowercase or punctuation removal
