Sentiment Analysis Project using Scikit-learn

CountVectorizer and TfidfVectorizer are two fundamental techniques for converting raw text data into numerical features. Both are part of Scikit-learn’s feature_extraction.text module and are commonly used in natural language processing pipelines.

Key Characteristics

CountVectorizer counts the number of times each word appears.
TfidfVectorizer scales term frequency by how rare the word is across all documents.
Both output sparse matrices used for model training.
Work well with linear models like Logistic Regression and Naive Bayes.

Basic Rules

Always apply preprocessing before vectorization (e.g., lowercasing, stopword removal).
Use fit_transform() on training text; use transform() on test text.
Choose TfidfVectorizer for better weighting and improved model performance in many tasks.

Syntax Table

SL NO	Technique	Syntax Example	Description
1	Import Vectorizers	`from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer`	Load vectorizer modules
2	CountVectorizer	`cv = CountVectorizer()`	Initializes a count vectorizer
3	TfidfVectorizer	`tfidf = TfidfVectorizer()`	Initializes a TF-IDF vectorizer
4	Fit and Transform	`X_cv = cv.fit_transform(corpus)`	Converts raw text to numeric feature matrix
5	Transform New Text	`X_new = tfidf.transform(["new example"])`	Applies previously learned vocabulary to new text

Syntax Explanation

1. Import Vectorizers

What is it?
Imports text vectorization classes from Scikit-learn.

Syntax:

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

Explanation:

CountVectorizer transforms documents into count vectors.
TfidfVectorizer does the same but applies term frequency-inverse document frequency weighting.
Essential imports to use these tools in any text classification or NLP pipeline.
Allows creation of bag-of-words and weighted feature representations.
Can be imported once and reused in multiple scripts or notebooks.

2. CountVectorizer

What is it?
Initializes a word-count-based vectorizer to convert text into frequency-based numerical features.

Syntax:

cv = CountVectorizer(stop_words='english')

Explanation:

Filters out common English stopwords to reduce noise.
Converts each document into a sparse row vector of word counts.
Commonly used for Naive Bayes and simple linear classifiers.
You can customize it with:
- ngram_range=(1,2) to capture unigrams and bigrams
- max_df or min_df to filter out overly common or rare terms
- max_features to limit the vocabulary size
Requires text preprocessing beforehand for optimal results (e.g., punctuation removal).

3. TfidfVectorizer

What is it?
Creates a TF-IDF vectorizer to assign weights to words based on frequency and rarity.

Syntax:

tfidf = TfidfVectorizer(ngram_range=(1,2), max_features=1000)

Explanation:

Computes TF-IDF: the product of term frequency and inverse document frequency.
ngram_range=(1,2) captures unigrams and bigrams, increasing context awareness.
max_features=1000 restricts the vocabulary size for performance and overfitting control.
More advanced than CountVectorizer—helps reduce the impact of frequently occurring non-informative terms.
Automatically normalizes feature values for better compatibility with models like SVM or logistic regression.
Can incorporate sublinear TF scaling or L2 normalization.

4. Fit and Transform Text

What is it?
Learns the vocabulary from training corpus and converts it into numerical matrix.

Syntax:

X_cv = cv.fit_transform(corpus)
X_tfidf = tfidf.fit_transform(corpus)

Explanation:

fit_transform() both fits the vectorizer to the data and transforms it into a matrix.
The result is a sparse matrix which conserves memory.
Each row is a document, each column is a token or n-gram.
.toarray() can be used to convert the result to dense format for inspection.
Repeated use on updated data should use .fit() followed by .transform() to avoid data leakage.
Essential for converting raw text into a form suitable for scikit-learn classifiers.

5. Transform New Data

What is it?
Applies the trained vectorizer to convert new/unseen text into numeric features.

Syntax:

X_new = tfidf.transform(["Sample input text"])

Explanation:

Transforms new documents into the same feature space learned during training.
Prevents vocabulary mismatch and ensures feature alignment.
Returns a sparse matrix that can directly be used for prediction.
Important for evaluating model performance on test sets or live data.
You must not use fit_transform() on new data to avoid overwriting learned vocabulary.

Real-Life Project: News Headline Classification

Project Overview

Classify news headlines into categories like sports, business, politics.

Code Example

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Sample corpus
corpus = ["Stocks rally as markets close", "Election results declared", "Team wins championship"]
y = ["business", "politics", "sports"]

# TF-IDF
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)

# Train/test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Model
model = MultinomialNB()
model.fit(X_train, y_train)

# Predict
y_pred = model.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))

Expected Output

Accuracy score based on classification
Optionally, show confusion matrix or classification report

Common Mistakes to Avoid

❌ Re-fitting the vectorizer on test data (causes data leakage)
❌ Using raw strings without token preprocessing
❌ Forgetting to use .transform() for new/unseen text

📘 Hands-On Python and Scikit-Learn: A Practical Guide to Machine Learning by Sarful Hassan

🔗 Available on Amazon

Key Characteristics

Basic Rules

Syntax Table

Syntax Explanation

1. Import Vectorizers

2. CountVectorizer

3. TfidfVectorizer

4. Fit and Transform Text

5. Transform New Data

Real-Life Project: News Headline Classification

Project Overview

Code Example

Expected Output

Common Mistakes to Avoid

Further Reading Recommendation

📘 Hands-On Python and Scikit-Learn: A Practical Guide to Machine Learning by Sarful Hassan

Login