Sentiment Analysis Project using Scikit-learn

CountVectorizer and TfidfVectorizer are two fundamental techniques for converting raw text data into numerical features. Both are part of Scikit-learn’s feature_extraction.text module and are commonly used in natural language processing pipelines.

Key Characteristics

  • CountVectorizer counts the number of times each word appears.
  • TfidfVectorizer scales term frequency by how rare the word is across all documents.
  • Both output sparse matrices used for model training.
  • Work well with linear models like Logistic Regression and Naive Bayes.

Basic Rules

  • Always apply preprocessing before vectorization (e.g., lowercasing, stopword removal).
  • Use fit_transform() on training text; use transform() on test text.
  • Choose TfidfVectorizer for better weighting and improved model performance in many tasks.

Syntax Table

SL NO Technique Syntax Example Description
1 Import Vectorizers from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer Load vectorizer modules
2 CountVectorizer cv = CountVectorizer() Initializes a count vectorizer
3 TfidfVectorizer tfidf = TfidfVectorizer() Initializes a TF-IDF vectorizer
4 Fit and Transform X_cv = cv.fit_transform(corpus) Converts raw text to numeric feature matrix
5 Transform New Text X_new = tfidf.transform(["new example"]) Applies previously learned vocabulary to new text

Syntax Explanation

1. Import Vectorizers

What is it?
Imports text vectorization classes from Scikit-learn.

Syntax:

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

Explanation:

  • CountVectorizer transforms documents into count vectors.
  • TfidfVectorizer does the same but applies term frequency-inverse document frequency weighting.
  • Essential imports to use these tools in any text classification or NLP pipeline.
  • Allows creation of bag-of-words and weighted feature representations.
  • Can be imported once and reused in multiple scripts or notebooks.

2. CountVectorizer

What is it?
Initializes a word-count-based vectorizer to convert text into frequency-based numerical features.

Syntax:

cv = CountVectorizer(stop_words='english')

Explanation:

  • Filters out common English stopwords to reduce noise.
  • Converts each document into a sparse row vector of word counts.
  • Commonly used for Naive Bayes and simple linear classifiers.
  • You can customize it with:
    • ngram_range=(1,2) to capture unigrams and bigrams
    • max_df or min_df to filter out overly common or rare terms
    • max_features to limit the vocabulary size
  • Requires text preprocessing beforehand for optimal results (e.g., punctuation removal).

3. TfidfVectorizer

What is it?
Creates a TF-IDF vectorizer to assign weights to words based on frequency and rarity.

Syntax:

tfidf = TfidfVectorizer(ngram_range=(1,2), max_features=1000)

Explanation:

  • Computes TF-IDF: the product of term frequency and inverse document frequency.
  • ngram_range=(1,2) captures unigrams and bigrams, increasing context awareness.
  • max_features=1000 restricts the vocabulary size for performance and overfitting control.
  • More advanced than CountVectorizer—helps reduce the impact of frequently occurring non-informative terms.
  • Automatically normalizes feature values for better compatibility with models like SVM or logistic regression.
  • Can incorporate sublinear TF scaling or L2 normalization.

4. Fit and Transform Text

What is it?
Learns the vocabulary from training corpus and converts it into numerical matrix.

Syntax:

X_cv = cv.fit_transform(corpus)
X_tfidf = tfidf.fit_transform(corpus)

Explanation:

  • fit_transform() both fits the vectorizer to the data and transforms it into a matrix.
  • The result is a sparse matrix which conserves memory.
  • Each row is a document, each column is a token or n-gram.
  • .toarray() can be used to convert the result to dense format for inspection.
  • Repeated use on updated data should use .fit() followed by .transform() to avoid data leakage.
  • Essential for converting raw text into a form suitable for scikit-learn classifiers.

5. Transform New Data

What is it?
Applies the trained vectorizer to convert new/unseen text into numeric features.

Syntax:

X_new = tfidf.transform(["Sample input text"])

Explanation:

  • Transforms new documents into the same feature space learned during training.
  • Prevents vocabulary mismatch and ensures feature alignment.
  • Returns a sparse matrix that can directly be used for prediction.
  • Important for evaluating model performance on test sets or live data.
  • You must not use fit_transform() on new data to avoid overwriting learned vocabulary.

Real-Life Project: News Headline Classification

Project Overview

Classify news headlines into categories like sports, business, politics.

Code Example

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Sample corpus
corpus = ["Stocks rally as markets close", "Election results declared", "Team wins championship"]
y = ["business", "politics", "sports"]

# TF-IDF
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)

# Train/test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Model
model = MultinomialNB()
model.fit(X_train, y_train)

# Predict
y_pred = model.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))

Expected Output

  • Accuracy score based on classification
  • Optionally, show confusion matrix or classification report

Common Mistakes to Avoid

  • ❌ Re-fitting the vectorizer on test data (causes data leakage)
  • ❌ Using raw strings without token preprocessing
  • ❌ Forgetting to use .transform() for new/unseen text

Further Reading Recommendation

📘 Hands-On Python and Scikit-Learn: A Practical Guide to Machine Learning by Sarful Hassan

🔗 Available on Amazon