Introduction to Unsupervised Learning in Scikit-learn

Unsupervised learning is a type of machine learning that deals with unlabeled data. Unlike supervised learning, where the algorithm learns from input-output pairs, unsupervised learning algorithms aim to find patterns, groupings, or structure in the data without predefined labels. Scikit-learn provides robust tools to perform unsupervised tasks such as clustering, dimensionality reduction, and anomaly detection.

Key Characteristics of Unsupervised Learning

  • No Labels Required: Operates purely on input features.
  • Pattern Discovery: Identifies structure, groupings, or trends in the data.
  • Versatile Applications: Used in customer segmentation, recommendation engines, and anomaly detection.
  • Techniques Include: Clustering (e.g., KMeans), PCA, DBSCAN, and more.
  • Scalability: Many algorithms scale well with larger datasets.

Basic Rules for Using Unsupervised Learning

  • Standardize or normalize features before applying clustering or PCA.
  • Use dimensionality reduction to visualize high-dimensional data.
  • Select the number of clusters using domain knowledge or metrics like the silhouette score.
  • Avoid applying supervised metrics like accuracy directly—use clustering-specific scores.
  • Evaluate cluster validity and stability with different random states.

Syntax Table

SL NO Function Syntax Example Description
1 KMeans Clustering KMeans(n_clusters=3) Clusters data into specified groups
2 Principal Component Analysis PCA(n_components=2) Reduces data to lower dimensions
3 Fit Model model.fit(X) Learns structure from input features
4 Predict or Transform model.transform(X) / model.predict(X) Transforms or assigns labels to data
5 Silhouette Score silhouette_score(X, labels) Evaluates clustering quality

Syntax Explanation

1. KMeans Clustering

  • What is it? Partition data into k distinct groups based on similarity using centroids.
  • Syntax:
from sklearn.cluster import KMeans
model = KMeans(n_clusters=3, random_state=42)
model.fit(X)
labels = model.labels_
  • Explanation:
    • n_clusters: The number of desired clusters.
    • random_state: Ensures reproducibility of results.
    • fit(): Computes cluster centers and assigns labels.
    • labels_: Array containing the cluster index for each sample.
    • inertia_: The sum of squared distances of samples to their closest cluster center, used to evaluate the compactness of the clusters.
    • KMeans assumes spherical clusters and can be sensitive to feature scaling, hence standardizing data before applying is essential.

2. Principal Component Analysis (PCA)

  • What is it? A linear technique to reduce the number of features in a dataset while preserving the most variance.
  • Syntax:
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
X_reduced = pca.fit_transform(X)
  • Explanation:
    • n_components: Number of new dimensions to keep; useful for visualization (2D or 3D).
    • fit_transform(): Learns the projection and applies it.
    • explained_variance_ratio_: Shows how much variance each component captures, useful for deciding the number of components to retain.
    • PCA is sensitive to scale, so standardizing features before applying is a best practice.
    • Helps visualize clusters and reduce noise in high-dimensional data.

3. Fit Model (General)

  • What is it? The step where the unsupervised algorithm learns from the input data.
  • Syntax:
model.fit(X)
  • Explanation:
    • This method applies to clustering, dimensionality reduction, and decomposition models.
    • The model identifies intrinsic patterns without using labeled outputs.
    • After fitting, models like KMeans expose .labels_ or .cluster_centers_; PCA exposes .components_.
    • Fitting is a key part of the training phase and often followed by transform or prediction steps.

4. Predict or Transform

  • What is it? After training, the model can predict cluster labels or transform the data to a new space.
  • Syntax:
# Clustering
labels = model.predict(X)

# Dimensionality Reduction
X_new = model.transform(X)
  • Explanation:
    • predict(): Assigns cluster indices to each sample (e.g., in KMeans).
    • transform(): Projects original data to a new feature space (e.g., in PCA).
    • Ensures model reusability after initial training (fit).
    • fit_transform() combines both steps but is used only once on training data.

5. Silhouette Score

  • What is it? A metric to evaluate the quality of clustering by measuring how close each point is to points in its own cluster vs other clusters.
  • Syntax:
from sklearn.metrics import silhouette_score
score = silhouette_score(X, labels)
  • Explanation:
    • Score ranges from -1 to 1; higher is better.
    • Values near +1 indicate well-clustered data; 0 suggests overlapping clusters; -1 implies incorrect clustering.
    • Useful for determining the ideal number of clusters in KMeans.
    • Best used with standard-scaled data for consistent measurement.

Real-Life Project: Customer Segmentation

Project Name

Clustering E-Commerce Customers

Project Overview

Segment e-commerce customers into distinct groups based on their spending behavior using KMeans.

Project Goal

  • Cluster customers by annual income and spending score.
  • Visualize and interpret group characteristics.
  • Optimize cluster count using silhouette score.

Code for This Project

import pandas as pd
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import silhouette_score

# Load and prepare data
data = pd.read_csv('ecommerce_customers.csv')
X = data[['Annual Income', 'Spending Score']]
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Fit KMeans
model = KMeans(n_clusters=5, random_state=42)
labels = model.fit_predict(X_scaled)

# Evaluate
score = silhouette_score(X_scaled, labels)
print("Silhouette Score:", score)

# Visualize
plt.scatter(X_scaled[:, 0], X_scaled[:, 1], c=labels, cmap='viridis')
plt.title('Customer Segments')
plt.xlabel('Annual Income (scaled)')
plt.ylabel('Spending Score (scaled)')
plt.show()

Expected Output

  • 5 well-separated clusters
  • Visual scatter plot showing segments
  • Silhouette score indicating clustering quality

Common Mistakes to Avoid

  • ❌ Not scaling data before clustering
  • ❌ Choosing too many/few clusters arbitrarily
  • ❌ Using KMeans on non-spherical distributions
  • ❌ Relying on cluster labels as ground truth for supervised models

Further Reading Recommendation

📘 Hands-On Python and Scikit-Learn: A Practical Guide to Machine Learning
by Sarful Hassan
🔗 Available on Amazon

Also explore: