Vectorizing Text Data using Scikit-learn

Vectorizing text is the process of converting textual data into numerical form that machine learning models can understand. Scikit-learn offers multiple vectorization techniques such as Count Vectorizer and TF-IDF Vectorizer to extract meaningful features from text.

Key Characteristics

  • Converts unstructured text into structured numerical data
  • Supports bag-of-words and frequency-based encoding
  • Compatible with pipelines and transformers
  • Enables use of classifiers, regressors, and clustering on text

Basic Rules

  • Preprocess text: lowercase, remove punctuation, stopwords
  • Choose appropriate vectorizer (Count vs TF-IDF)
  • Fit vectorizer on training data only
  • Transform test data with same vectorizer instance

Syntax Table

SL NO Technique Syntax Example Description
1 Import CountVectorizer from sklearn.feature_extraction.text import CountVectorizer Loads count vectorizer class
2 Initialize Vectorizer vectorizer = CountVectorizer() Creates bag-of-words transformer
3 Fit and Transform X = vectorizer.fit_transform(corpus) Learns vocab and transforms text into vectors
4 Get Feature Names vectorizer.get_feature_names_out() Lists vocabulary terms used in the model
5 Use in Pipeline Pipeline([...]) Combines vectorizer and classifier in one workflow

Syntax Explanation

1. Import CountVectorizer

What is it?
Imports the CountVectorizer class used for bag-of-words encoding.

Syntax:

from sklearn.feature_extraction.text import CountVectorizer

Explanation:

  • Essential to access text vectorization functionality.
  • Converts each document into a fixed-length vector based on word counts.

2. Initialize Vectorizer

What is it?
Creates an instance of the vectorizer with optional preprocessing parameters.

Syntax:

vectorizer = CountVectorizer(stop_words='english', max_features=1000)

Explanation:

  • Removes common stopwords from English.
  • Limits vocabulary to 1000 most frequent words.
  • Can be customized with n-grams, min_df, max_df, and tokenizer settings.

3. Fit and Transform

What is it?
Fits the vectorizer on the training corpus and transforms it into a sparse matrix.

Syntax:

X = vectorizer.fit_transform(corpus)

Explanation:

  • Learns vocabulary and word counts from corpus.
  • Transforms raw text into matrix form for model training.
  • Output matrix is sparse and memory-efficient.

4. Get Feature Names

What is it?
Retrieves the list of vocabulary terms generated by the vectorizer.

Syntax:

features = vectorizer.get_feature_names_out()

Explanation:

  • Helps in understanding feature space and interpreting model coefficients.
  • Useful for feature analysis or model explanations.

5. Use in Pipeline

What is it?
Combines the vectorizer with a classifier or regressor in a single ML workflow.

Syntax:

from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression

pipeline = Pipeline([
    ('vect', CountVectorizer()),
    ('clf', LogisticRegression())
])

Explanation:

  • Ensures reproducibility and reduces preprocessing errors.
  • Can be used with GridSearchCV or cross_val_score.
  • Simplifies model training and prediction.

Real-Life Project: News Topic Classification

Project Overview

Classify news headlines into topics using CountVectorizer and logistic regression.

Code Example

from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

# Sample data
corpus = ["Economy hits record growth", "New sports championship announced", "Politics heat up in elections"]
y = ["business", "sports", "politics"]

# Train/test split
X_train, X_test, y_train, y_test = train_test_split(corpus, y, test_size=0.33, random_state=42)

# Pipeline
pipeline = Pipeline([
    ('vect', CountVectorizer()),
    ('clf', LogisticRegression())
])

# Train and evaluate
pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)
print(classification_report(y_test, y_pred))

Expected Output

  • Precision, recall, and F1-scores per class
  • Accuracy of overall text classification

Common Mistakes to Avoid

  • ❌ Not fitting vectorizer only on training data
  • ❌ Using raw strings instead of preprocessed tokens
  • ❌ Skipping lowercase or punctuation removal

Further Reading Recommendation

📘 Hands-On Python and Scikit-Learn: A Practical Guide to Machine Learning by Sarful Hassan

🔗 Available on Amazon