Vectorizing Text Data using Scikit-learn

Vectorizing text is the process of converting textual data into numerical form that machine learning models can understand. Scikit-learn offers multiple vectorization techniques such as Count Vectorizer and TF-IDF Vectorizer to extract meaningful features from text.

Key Characteristics

Converts unstructured text into structured numerical data
Supports bag-of-words and frequency-based encoding
Compatible with pipelines and transformers
Enables use of classifiers, regressors, and clustering on text

Basic Rules

Preprocess text: lowercase, remove punctuation, stopwords
Choose appropriate vectorizer (Count vs TF-IDF)
Fit vectorizer on training data only
Transform test data with same vectorizer instance

Syntax Table

SL NO	Technique	Syntax Example	Description
1	Import CountVectorizer	`from sklearn.feature_extraction.text import CountVectorizer`	Loads count vectorizer class
2	Initialize Vectorizer	`vectorizer = CountVectorizer()`	Creates bag-of-words transformer
3	Fit and Transform	`X = vectorizer.fit_transform(corpus)`	Learns vocab and transforms text into vectors
4	Get Feature Names	`vectorizer.get_feature_names_out()`	Lists vocabulary terms used in the model
5	Use in Pipeline	`Pipeline([...])`	Combines vectorizer and classifier in one workflow

Syntax Explanation

1. Import CountVectorizer

What is it?
Imports the CountVectorizer class used for bag-of-words encoding.

Syntax:

from sklearn.feature_extraction.text import CountVectorizer

Explanation:

Essential to access text vectorization functionality.
Converts each document into a fixed-length vector based on word counts.

2. Initialize Vectorizer

What is it?
Creates an instance of the vectorizer with optional preprocessing parameters.

Syntax:

vectorizer = CountVectorizer(stop_words='english', max_features=1000)

Explanation:

Removes common stopwords from English.
Limits vocabulary to 1000 most frequent words.
Can be customized with n-grams, min_df, max_df, and tokenizer settings.

3. Fit and Transform

What is it?
Fits the vectorizer on the training corpus and transforms it into a sparse matrix.

Syntax:

X = vectorizer.fit_transform(corpus)

Explanation:

Learns vocabulary and word counts from corpus.
Transforms raw text into matrix form for model training.
Output matrix is sparse and memory-efficient.

4. Get Feature Names

What is it?
Retrieves the list of vocabulary terms generated by the vectorizer.

Syntax:

features = vectorizer.get_feature_names_out()

Explanation:

Helps in understanding feature space and interpreting model coefficients.
Useful for feature analysis or model explanations.

5. Use in Pipeline

What is it?
Combines the vectorizer with a classifier or regressor in a single ML workflow.

Syntax:

from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression

pipeline = Pipeline([
    ('vect', CountVectorizer()),
    ('clf', LogisticRegression())
])

Explanation:

Ensures reproducibility and reduces preprocessing errors.
Can be used with GridSearchCV or cross_val_score.
Simplifies model training and prediction.

Real-Life Project: News Topic Classification

Project Overview

Classify news headlines into topics using CountVectorizer and logistic regression.

Code Example

from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

# Sample data
corpus = ["Economy hits record growth", "New sports championship announced", "Politics heat up in elections"]
y = ["business", "sports", "politics"]

# Train/test split
X_train, X_test, y_train, y_test = train_test_split(corpus, y, test_size=0.33, random_state=42)

# Pipeline
pipeline = Pipeline([
    ('vect', CountVectorizer()),
    ('clf', LogisticRegression())
])

# Train and evaluate
pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)
print(classification_report(y_test, y_pred))

Expected Output

Precision, recall, and F1-scores per class
Accuracy of overall text classification

Common Mistakes to Avoid

❌ Not fitting vectorizer only on training data
❌ Using raw strings instead of preprocessed tokens
❌ Skipping lowercase or punctuation removal

📘 Hands-On Python and Scikit-Learn: A Practical Guide to Machine Learning by Sarful Hassan

🔗 Available on Amazon

Key Characteristics

Basic Rules

Syntax Table

Syntax Explanation

1. Import CountVectorizer

2. Initialize Vectorizer

3. Fit and Transform

4. Get Feature Names

5. Use in Pipeline

Real-Life Project: News Topic Classification

Project Overview

Code Example

Expected Output

Common Mistakes to Avoid

Further Reading Recommendation

📘 Hands-On Python and Scikit-Learn: A Practical Guide to Machine Learning by Sarful Hassan

Login