Principal Component Analysis (PCA) is a widely used dimensionality reduction technique that transforms high-dimensional data into a lower-dimensional form while retaining as much variance as possible. In Scikit-learn, PCA is implemented through the PCA class in the sklearn.decomposition module.
Key Characteristics of PCA
- Variance Preservation: Maximizes the variance retained in the reduced dimensions.
- Linear Transformation: Projects data onto orthogonal axes (principal components).
- Unsupervised Technique: Does not use class labels.
- Useful for Visualization: Reduces data to 2D or 3D for plotting.
- Preprocessing Step: Often used before clustering or classification.
Basic Rules for Using PCA
- Standardize the features before applying PCA.
- Use PCA primarily for numerical, continuous data.
- Choose the number of components to retain based on explained variance.
- PCA is sensitive to outliers.
- Avoid applying PCA blindly—check interpretability and effectiveness.
Syntax Table
| SL NO | Function | Syntax Example | Description |
|---|---|---|---|
| 1 | Import PCA | from sklearn.decomposition import PCA |
Load the PCA class |
| 2 | Instantiate PCA | pca = PCA(n_components=2) |
Set the number of components |
| 3 | Fit and Transform | X_pca = pca.fit_transform(X_scaled) |
Reduce dimensionality of data |
| 4 | Explained Variance | pca.explained_variance_ratio_ |
View variance retained by each component |
| 5 | Components Matrix | pca.components_ |
Access principal component vectors |
Syntax Explanation
1. Import PCA
What is it? Load the PCA class from the decomposition module.
from sklearn.decomposition import PCA
Explanation:
- This statement gives you access to the PCA toolset in Scikit-learn.
- It allows importing the class necessary to perform dimensionality reduction on a dataset using principal component analysis.
2. Instantiate PCA
What is it? Define how many principal components to retain in the reduced dataset.
pca = PCA(n_components=2)
Explanation:
n_componentsdetermines the dimensionality of the output:- If an integer: the number of principal components to keep.
- If a float between 0 and 1: the amount of total variance to preserve.
- Optional parameters:
svd_solver: Auto-selects solver based on input.whiten: When set to True, makes components uncorrelated and scaled.
- Choosing an optimal number of components helps maintain information while simplifying the model.
3. Fit and Transform
What is it? Learn the principal components from the dataset and apply the transformation.
X_pca = pca.fit_transform(X_scaled)
Explanation:
X_scaledis the standardized dataset.fit_transformdoes two tasks:- Learns the principal components.
- Projects the data onto those components.
- The result
X_pcais a 2D array with the shape (samples, components). - This output can be used for further modeling or visualization.
4. Explained Variance
What is it? Shows the proportion of dataset variance explained by each component.
pca.explained_variance_ratio_
Explanation:
- This attribute returns a list of floats.
- Each float corresponds to a principal component.
- Use this to analyze how many components are sufficient (e.g., keep enough components to explain 90–95% variance).
- Plotting a scree plot is a common practice to visualize this distribution.
5. Components Matrix
What is it? Shows the directions (vectors) of the principal components.
pca.components_
Explanation:
- A matrix of shape
(n_components, n_features). - Each row is a principal component; each column corresponds to the feature weight in that component.
- Useful for understanding feature importance and directions in reduced space.
Real-Life Project: Visualizing Customer Segments with PCA
Project Name
PCA-Based Dimensionality Reduction for Mall Customers
Project Overview
This project uses PCA to reduce a customer dataset with multiple attributes to 2D for visualization purposes. This aids in understanding patterns and relationships in the data.
Project Goal
- Reduce dimensionality of customer features.
- Visualize customer distribution in 2D.
- Prepare data for clustering or classification tasks.
Code for This Project
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
# Load dataset
data = pd.read_csv('Mall_Customers.csv')
X = data[['Age', 'Annual Income (k$)', 'Spending Score (1-100)']]
# Standardize the data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Apply PCA
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)
# Plot results
plt.figure(figsize=(8, 6))
plt.scatter(X_pca[:, 0], X_pca[:, 1], edgecolor='k')
plt.title('PCA of Mall Customer Features')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.grid(True)
plt.show()
Expected Output
- A 2D scatter plot displaying customer distribution after PCA.
- Simplified representation of complex customer data.
- Visual clues for potential clustering.
Common Mistakes to Avoid
- ❌ Not standardizing data before applying PCA.
- ❌ Misinterpreting component axes as original features.
- ❌ Choosing too few components and losing important variance.
- ❌ Using PCA when interpretability of features is crucial.
Further Reading Recommendation
Hands-On Python and Scikit-Learn: A Practical Guide to Machine Learning by Sarful Hassan Buy on Amazon
