Hierarchical clustering is a method of unsupervised learning that builds a hierarchy of clusters either in a bottom-up (agglomerative) or top-down (divisive) fashion. Scikit-learn provides AgglomerativeClustering for bottom-up hierarchical clustering.
Key Characteristics of Hierarchical Clustering
- Tree-Based Structure: Produces a dendrogram-like cluster tree.
- No Need to Specify k Initially: You can decide the number of clusters by cutting the dendrogram.
- Agglomerative Method: Merges the closest pairs of clusters iteratively.
- Distance-Based Clustering: Clustering based on pairwise distances.
- Useful for Hierarchical Grouping: Good for nested grouping like taxonomies.
Basic Rules for Using Hierarchical Clustering
- Standardize data before clustering.
- Use linkage criteria such as ‘ward’, ‘complete’, or ‘average’.
- Visualize dendrogram for better interpretation.
- Suitable for small to medium-sized datasets (due to computation cost).
AgglomerativeClusteringis deterministic with a setrandom_state.
Syntax Table
| SL NO | Function | Syntax Example | Description |
|---|---|---|---|
| 1 | Import Class | from sklearn.cluster import AgglomerativeClustering |
Load the clustering class |
| 2 | Instantiate Model | model = AgglomerativeClustering(n_clusters=3) |
Define number of clusters |
| 3 | Fit Model & Get Labels | labels = model.fit_predict(X) |
Train model and assign cluster labels |
| 4 | Linkage Matrix (optional) | linkage(X, method='ward') |
Needed for dendrogram visualization (from scipy) |
Syntax Explanation (Expanded)
1. Import Class
- What is it? Load the required hierarchical clustering class from scikit-learn.
- Syntax:
from sklearn.cluster import AgglomerativeClustering
- Explanation:
- This is the main class for performing bottom-up hierarchical clustering.
- No fitting needed for the class itself—
fit_predictis called on the dataset. - Easy to integrate into a scikit-learn pipeline.
2. Instantiate the Model
- What is it? Define how clustering should be performed (number of clusters, distance metric, linkage strategy).
- Syntax:
model = AgglomerativeClustering(n_clusters=3, affinity='euclidean', linkage='ward')
- Explanation:
n_clusters: Number of groups to form. Determines how many terminal branches the hierarchy will have.affinity: Distance metric used for clustering (usually ‘euclidean’; ‘precomputed’ can also be used).linkage: Strategy for merging—'ward': minimizes variance (works only with ‘euclidean’).'complete': uses the maximum distance between points.'average': uses the average of distances between all pairs.'single': uses the minimum distance between points.
- Choosing the wrong linkage/affinity combo can lead to an error or suboptimal results.
3. Fit Model & Get Labels
- What is it? Execute the clustering algorithm and assign labels to each data point.
- Syntax:
labels = model.fit_predict(X)
- Explanation:
Xshould be a numeric dataset, typically 2D.- Internally, the model builds the tree from individual points up.
- At each step, the closest clusters (according to linkage) are merged.
- Returns an array where each entry corresponds to the cluster ID of a sample.
- You can visualize the final assignments with scatter plots or heatmaps.
4. Linkage Matrix and Dendrogram (Visualization)
- What is it? Generate a tree plot (dendrogram) to visually represent the hierarchy.
- Syntax:
from scipy.cluster.hierarchy import dendrogram, linkage
linked = linkage(X, 'ward')
dendrogram(linked)
- Explanation:
- Not part of
sklearn, but very useful for understanding cluster structure. linkage()returns the hierarchical merging information.dendrogram()draws the actual tree.- Use
truncate_mode='lastp'to limit how many clusters are shown. - Helps to select
n_clustersvisually (by cutting at an appropriate height).
- Not part of
Let me know if you want to include dendrogram annotations or clustering heatmaps next!
Real-Life Project: Customer Segmentation Using Hierarchical Clustering
Project Name
Hierarchical Clustering of Mall Customers
Project Overview
This project applies agglomerative hierarchical clustering on mall customer data to identify distinct customer groups based on annual income and spending score. The project also demonstrates dendrogram visualization for understanding cluster formation.
Project Goal
- Segment customers into logical clusters based on income and spending patterns.
- Visualize cluster hierarchies using dendrograms.
- Apply data preprocessing for optimal cluster formation.
Code for This Project
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.cluster import AgglomerativeClustering
from scipy.cluster.hierarchy import dendrogram, linkage
from sklearn.preprocessing import StandardScaler
# Load dataset
data = pd.read_csv('Mall_Customers.csv')
X = data[['Annual Income (k$)', 'Spending Score (1-100)']]
# Standardize the features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Hierarchical Clustering
model = AgglomerativeClustering(n_clusters=5)
labels = model.fit_predict(X_scaled)
# Visualize dendrogram
linked = linkage(X_scaled, method='ward')
plt.figure(figsize=(10, 7))
dendrogram(linked)
plt.title('Customer Dendrogram')
plt.xlabel('Customer Index')
plt.ylabel('Distance')
plt.show()
Expected Output
- A dendrogram displaying hierarchical groupings of customers.
- A label array classifying each customer into one of the defined clusters.
- Insightful customer segmentation for marketing or personalization.
Common Mistakes to Avoid
- ❌ Not scaling input features before clustering.
- ❌ Using inappropriate linkage method for data type.
- ❌ Misinterpreting dendrogram distances and splits.
- ❌ Applying hierarchical clustering on large datasets (high computational cost).
Further Reading Recommendation
📘 Hands-On Python and Scikit-Learn: A Practical Guide to Machine Learning
by Sarful Hassan
🔗 Available on Amazon
Also explore:
- 🔗 Scikit-learn Hierarchical Clustering Documentation
- 🔗 SciPy Hierarchy API Reference
- 🔗 Dendrogram Visualization Examples on Kaggle
