Dimensionality Reduction with PCA in Scikit-learn

Principal Component Analysis (PCA) is a widely used dimensionality reduction technique that transforms high-dimensional data into a lower-dimensional form while retaining as much variance as possible. In Scikit-learn, PCA is implemented through the PCA class in the sklearn.decomposition module.

Key Characteristics of PCA

  • Variance Preservation: Maximizes the variance retained in the reduced dimensions.
  • Linear Transformation: Projects data onto orthogonal axes (principal components).
  • Unsupervised Technique: Does not use class labels.
  • Useful for Visualization: Reduces data to 2D or 3D for plotting.
  • Preprocessing Step: Often used before clustering or classification.

Basic Rules for Using PCA

  • Standardize the features before applying PCA.
  • Use PCA primarily for numerical, continuous data.
  • Choose the number of components to retain based on explained variance.
  • PCA is sensitive to outliers.
  • Avoid applying PCA blindly—check interpretability and effectiveness.

Syntax Table

SL NO Function Syntax Example Description
1 Import PCA from sklearn.decomposition import PCA Load the PCA class
2 Instantiate PCA pca = PCA(n_components=2) Set the number of components
3 Fit and Transform X_pca = pca.fit_transform(X_scaled) Reduce dimensionality of data
4 Explained Variance pca.explained_variance_ratio_ View variance retained by each component
5 Components Matrix pca.components_ Access principal component vectors

Syntax Explanation

1. Import PCA

What is it? Load the PCA class from the decomposition module.

from sklearn.decomposition import PCA

Explanation:

  • This statement gives you access to the PCA toolset in Scikit-learn.
  • It allows importing the class necessary to perform dimensionality reduction on a dataset using principal component analysis.

2. Instantiate PCA

What is it? Define how many principal components to retain in the reduced dataset.

pca = PCA(n_components=2)

Explanation:

  • n_components determines the dimensionality of the output:
    • If an integer: the number of principal components to keep.
    • If a float between 0 and 1: the amount of total variance to preserve.
  • Optional parameters:
    • svd_solver: Auto-selects solver based on input.
    • whiten: When set to True, makes components uncorrelated and scaled.
  • Choosing an optimal number of components helps maintain information while simplifying the model.

3. Fit and Transform

What is it? Learn the principal components from the dataset and apply the transformation.

X_pca = pca.fit_transform(X_scaled)

Explanation:

  • X_scaled is the standardized dataset.
  • fit_transform does two tasks:
    1. Learns the principal components.
    2. Projects the data onto those components.
  • The result X_pca is a 2D array with the shape (samples, components).
  • This output can be used for further modeling or visualization.

4. Explained Variance

What is it? Shows the proportion of dataset variance explained by each component.

pca.explained_variance_ratio_

Explanation:

  • This attribute returns a list of floats.
  • Each float corresponds to a principal component.
  • Use this to analyze how many components are sufficient (e.g., keep enough components to explain 90–95% variance).
  • Plotting a scree plot is a common practice to visualize this distribution.

5. Components Matrix

What is it? Shows the directions (vectors) of the principal components.

pca.components_

Explanation:

  • A matrix of shape (n_components, n_features).
  • Each row is a principal component; each column corresponds to the feature weight in that component.
  • Useful for understanding feature importance and directions in reduced space.

Real-Life Project: Visualizing Customer Segments with PCA

Project Name

PCA-Based Dimensionality Reduction for Mall Customers

Project Overview

This project uses PCA to reduce a customer dataset with multiple attributes to 2D for visualization purposes. This aids in understanding patterns and relationships in the data.

Project Goal

  • Reduce dimensionality of customer features.
  • Visualize customer distribution in 2D.
  • Prepare data for clustering or classification tasks.

Code for This Project

import pandas as pd
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

# Load dataset
data = pd.read_csv('Mall_Customers.csv')
X = data[['Age', 'Annual Income (k$)', 'Spending Score (1-100)']]

# Standardize the data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Apply PCA
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)

# Plot results
plt.figure(figsize=(8, 6))
plt.scatter(X_pca[:, 0], X_pca[:, 1], edgecolor='k')
plt.title('PCA of Mall Customer Features')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.grid(True)
plt.show()

Expected Output

  • A 2D scatter plot displaying customer distribution after PCA.
  • Simplified representation of complex customer data.
  • Visual clues for potential clustering.

Common Mistakes to Avoid

  • ❌ Not standardizing data before applying PCA.
  • ❌ Misinterpreting component axes as original features.
  • ❌ Choosing too few components and losing important variance.
  • ❌ Using PCA when interpretability of features is crucial.

Further Reading Recommendation

Hands-On Python and Scikit-Learn: A Practical Guide to Machine Learning by Sarful Hassan Buy on Amazon