Hierarchical Clustering with Scikit-learn

Hierarchical clustering is a method of unsupervised learning that builds a hierarchy of clusters either in a bottom-up (agglomerative) or top-down (divisive) fashion. Scikit-learn provides AgglomerativeClustering for bottom-up hierarchical clustering.

Key Characteristics of Hierarchical Clustering

  • Tree-Based Structure: Produces a dendrogram-like cluster tree.
  • No Need to Specify k Initially: You can decide the number of clusters by cutting the dendrogram.
  • Agglomerative Method: Merges the closest pairs of clusters iteratively.
  • Distance-Based Clustering: Clustering based on pairwise distances.
  • Useful for Hierarchical Grouping: Good for nested grouping like taxonomies.

Basic Rules for Using Hierarchical Clustering

  • Standardize data before clustering.
  • Use linkage criteria such as ‘ward’, ‘complete’, or ‘average’.
  • Visualize dendrogram for better interpretation.
  • Suitable for small to medium-sized datasets (due to computation cost).
  • AgglomerativeClustering is deterministic with a set random_state.

Syntax Table

SL NO Function Syntax Example Description
1 Import Class from sklearn.cluster import AgglomerativeClustering Load the clustering class
2 Instantiate Model model = AgglomerativeClustering(n_clusters=3) Define number of clusters
3 Fit Model & Get Labels labels = model.fit_predict(X) Train model and assign cluster labels
4 Linkage Matrix (optional) linkage(X, method='ward') Needed for dendrogram visualization (from scipy)

Syntax Explanation (Expanded)

1. Import Class

  • What is it? Load the required hierarchical clustering class from scikit-learn.
  • Syntax:
from sklearn.cluster import AgglomerativeClustering
  • Explanation:
    • This is the main class for performing bottom-up hierarchical clustering.
    • No fitting needed for the class itself—fit_predict is called on the dataset.
    • Easy to integrate into a scikit-learn pipeline.

2. Instantiate the Model

  • What is it? Define how clustering should be performed (number of clusters, distance metric, linkage strategy).
  • Syntax:
model = AgglomerativeClustering(n_clusters=3, affinity='euclidean', linkage='ward')
  • Explanation:
    • n_clusters: Number of groups to form. Determines how many terminal branches the hierarchy will have.
    • affinity: Distance metric used for clustering (usually ‘euclidean’; ‘precomputed’ can also be used).
    • linkage: Strategy for merging—
      • 'ward': minimizes variance (works only with ‘euclidean’).
      • 'complete': uses the maximum distance between points.
      • 'average': uses the average of distances between all pairs.
      • 'single': uses the minimum distance between points.
    • Choosing the wrong linkage/affinity combo can lead to an error or suboptimal results.

3. Fit Model & Get Labels

  • What is it? Execute the clustering algorithm and assign labels to each data point.
  • Syntax:
labels = model.fit_predict(X)
  • Explanation:
    • X should be a numeric dataset, typically 2D.
    • Internally, the model builds the tree from individual points up.
    • At each step, the closest clusters (according to linkage) are merged.
    • Returns an array where each entry corresponds to the cluster ID of a sample.
    • You can visualize the final assignments with scatter plots or heatmaps.

4. Linkage Matrix and Dendrogram (Visualization)

  • What is it? Generate a tree plot (dendrogram) to visually represent the hierarchy.
  • Syntax:
from scipy.cluster.hierarchy import dendrogram, linkage
linked = linkage(X, 'ward')
dendrogram(linked)
  • Explanation:
    • Not part of sklearn, but very useful for understanding cluster structure.
    • linkage() returns the hierarchical merging information.
    • dendrogram() draws the actual tree.
    • Use truncate_mode='lastp' to limit how many clusters are shown.
    • Helps to select n_clusters visually (by cutting at an appropriate height).

Let me know if you want to include dendrogram annotations or clustering heatmaps next!

Real-Life Project: Customer Segmentation Using Hierarchical Clustering

Project Name

Hierarchical Clustering of Mall Customers

Project Overview

This project applies agglomerative hierarchical clustering on mall customer data to identify distinct customer groups based on annual income and spending score. The project also demonstrates dendrogram visualization for understanding cluster formation.

Project Goal

  • Segment customers into logical clusters based on income and spending patterns.
  • Visualize cluster hierarchies using dendrograms.
  • Apply data preprocessing for optimal cluster formation.

Code for This Project

import pandas as pd
import matplotlib.pyplot as plt
from sklearn.cluster import AgglomerativeClustering
from scipy.cluster.hierarchy import dendrogram, linkage
from sklearn.preprocessing import StandardScaler

# Load dataset
data = pd.read_csv('Mall_Customers.csv')
X = data[['Annual Income (k$)', 'Spending Score (1-100)']]

# Standardize the features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Hierarchical Clustering
model = AgglomerativeClustering(n_clusters=5)
labels = model.fit_predict(X_scaled)

# Visualize dendrogram
linked = linkage(X_scaled, method='ward')
plt.figure(figsize=(10, 7))
dendrogram(linked)
plt.title('Customer Dendrogram')
plt.xlabel('Customer Index')
plt.ylabel('Distance')
plt.show()

Expected Output

  • A dendrogram displaying hierarchical groupings of customers.
  • A label array classifying each customer into one of the defined clusters.
  • Insightful customer segmentation for marketing or personalization.

Common Mistakes to Avoid

  • ❌ Not scaling input features before clustering.
  • ❌ Using inappropriate linkage method for data type.
  • ❌ Misinterpreting dendrogram distances and splits.
  • ❌ Applying hierarchical clustering on large datasets (high computational cost).

Further Reading Recommendation

📘 Hands-On Python and Scikit-Learn: A Practical Guide to Machine Learning
by Sarful Hassan
🔗 Available on Amazon

Also explore: