Introduction to Unsupervised Learning in Scikit-learn

Unsupervised learning is a type of machine learning that deals with unlabeled data. Unlike supervised learning, where the algorithm learns from input-output pairs, unsupervised learning algorithms aim to find patterns, groupings, or structure in the data without predefined labels. Scikit-learn provides robust tools to perform unsupervised tasks such as clustering, dimensionality reduction, and anomaly detection.

Key Characteristics of Unsupervised Learning

No Labels Required: Operates purely on input features.
Pattern Discovery: Identifies structure, groupings, or trends in the data.
Versatile Applications: Used in customer segmentation, recommendation engines, and anomaly detection.
Techniques Include: Clustering (e.g., KMeans), PCA, DBSCAN, and more.
Scalability: Many algorithms scale well with larger datasets.

Basic Rules for Using Unsupervised Learning

Standardize or normalize features before applying clustering or PCA.
Use dimensionality reduction to visualize high-dimensional data.
Select the number of clusters using domain knowledge or metrics like the silhouette score.
Avoid applying supervised metrics like accuracy directly—use clustering-specific scores.
Evaluate cluster validity and stability with different random states.

Syntax Table

SL NO	Function	Syntax Example	Description
1	KMeans Clustering	`KMeans(n_clusters=3)`	Clusters data into specified groups
2	Principal Component Analysis	`PCA(n_components=2)`	Reduces data to lower dimensions
3	Fit Model	`model.fit(X)`	Learns structure from input features
4	Predict or Transform	`model.transform(X)` / `model.predict(X)`	Transforms or assigns labels to data
5	Silhouette Score	`silhouette_score(X, labels)`	Evaluates clustering quality

Syntax Explanation

1. KMeans Clustering

What is it? Partition data into k distinct groups based on similarity using centroids.
Syntax:

from sklearn.cluster import KMeans
model = KMeans(n_clusters=3, random_state=42)
model.fit(X)
labels = model.labels_

Explanation:
- n_clusters: The number of desired clusters.
- random_state: Ensures reproducibility of results.
- fit(): Computes cluster centers and assigns labels.
- labels_: Array containing the cluster index for each sample.
- inertia_: The sum of squared distances of samples to their closest cluster center, used to evaluate the compactness of the clusters.
- KMeans assumes spherical clusters and can be sensitive to feature scaling, hence standardizing data before applying is essential.

2. Principal Component Analysis (PCA)

What is it? A linear technique to reduce the number of features in a dataset while preserving the most variance.
Syntax:

from sklearn.decomposition import PCA
pca = PCA(n_components=2)
X_reduced = pca.fit_transform(X)

Explanation:
- n_components: Number of new dimensions to keep; useful for visualization (2D or 3D).
- fit_transform(): Learns the projection and applies it.
- explained_variance_ratio_: Shows how much variance each component captures, useful for deciding the number of components to retain.
- PCA is sensitive to scale, so standardizing features before applying is a best practice.
- Helps visualize clusters and reduce noise in high-dimensional data.

3. Fit Model (General)

What is it? The step where the unsupervised algorithm learns from the input data.
Syntax:

model.fit(X)

Explanation:
- This method applies to clustering, dimensionality reduction, and decomposition models.
- The model identifies intrinsic patterns without using labeled outputs.
- After fitting, models like KMeans expose .labels_ or .cluster_centers_; PCA exposes .components_.
- Fitting is a key part of the training phase and often followed by transform or prediction steps.

4. Predict or Transform

What is it? After training, the model can predict cluster labels or transform the data to a new space.
Syntax:

# Clustering
labels = model.predict(X)

# Dimensionality Reduction
X_new = model.transform(X)

Explanation:
- predict(): Assigns cluster indices to each sample (e.g., in KMeans).
- transform(): Projects original data to a new feature space (e.g., in PCA).
- Ensures model reusability after initial training (fit).
- fit_transform() combines both steps but is used only once on training data.

5. Silhouette Score

What is it? A metric to evaluate the quality of clustering by measuring how close each point is to points in its own cluster vs other clusters.
Syntax:

from sklearn.metrics import silhouette_score
score = silhouette_score(X, labels)

Explanation:
- Score ranges from -1 to 1; higher is better.
- Values near +1 indicate well-clustered data; 0 suggests overlapping clusters; -1 implies incorrect clustering.
- Useful for determining the ideal number of clusters in KMeans.
- Best used with standard-scaled data for consistent measurement.

Real-Life Project: Customer Segmentation

Project Name

Clustering E-Commerce Customers

Project Overview

Segment e-commerce customers into distinct groups based on their spending behavior using KMeans.

Project Goal

Cluster customers by annual income and spending score.
Visualize and interpret group characteristics.
Optimize cluster count using silhouette score.

Code for This Project

import pandas as pd
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import silhouette_score

# Load and prepare data
data = pd.read_csv('ecommerce_customers.csv')
X = data[['Annual Income', 'Spending Score']]
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Fit KMeans
model = KMeans(n_clusters=5, random_state=42)
labels = model.fit_predict(X_scaled)

# Evaluate
score = silhouette_score(X_scaled, labels)
print("Silhouette Score:", score)

# Visualize
plt.scatter(X_scaled[:, 0], X_scaled[:, 1], c=labels, cmap='viridis')
plt.title('Customer Segments')
plt.xlabel('Annual Income (scaled)')
plt.ylabel('Spending Score (scaled)')
plt.show()

Expected Output

5 well-separated clusters
Visual scatter plot showing segments
Silhouette score indicating clustering quality

Common Mistakes to Avoid

❌ Not scaling data before clustering
❌ Choosing too many/few clusters arbitrarily
❌ Using KMeans on non-spherical distributions
❌ Relying on cluster labels as ground truth for supervised models