Unsupervised learning is a type of machine learning that deals with unlabeled data. Unlike supervised learning, where the algorithm learns from input-output pairs, unsupervised learning algorithms aim to find patterns, groupings, or structure in the data without predefined labels. Scikit-learn provides robust tools to perform unsupervised tasks such as clustering, dimensionality reduction, and anomaly detection.
Key Characteristics of Unsupervised Learning
- No Labels Required: Operates purely on input features.
- Pattern Discovery: Identifies structure, groupings, or trends in the data.
- Versatile Applications: Used in customer segmentation, recommendation engines, and anomaly detection.
- Techniques Include: Clustering (e.g., KMeans), PCA, DBSCAN, and more.
- Scalability: Many algorithms scale well with larger datasets.
Basic Rules for Using Unsupervised Learning
- Standardize or normalize features before applying clustering or PCA.
- Use dimensionality reduction to visualize high-dimensional data.
- Select the number of clusters using domain knowledge or metrics like the silhouette score.
- Avoid applying supervised metrics like accuracy directly—use clustering-specific scores.
- Evaluate cluster validity and stability with different random states.
Syntax Table
| SL NO | Function | Syntax Example | Description |
|---|---|---|---|
| 1 | KMeans Clustering | KMeans(n_clusters=3) |
Clusters data into specified groups |
| 2 | Principal Component Analysis | PCA(n_components=2) |
Reduces data to lower dimensions |
| 3 | Fit Model | model.fit(X) |
Learns structure from input features |
| 4 | Predict or Transform | model.transform(X) / model.predict(X) |
Transforms or assigns labels to data |
| 5 | Silhouette Score | silhouette_score(X, labels) |
Evaluates clustering quality |
Syntax Explanation
1. KMeans Clustering
- What is it? Partition data into k distinct groups based on similarity using centroids.
- Syntax:
from sklearn.cluster import KMeans
model = KMeans(n_clusters=3, random_state=42)
model.fit(X)
labels = model.labels_
- Explanation:
n_clusters: The number of desired clusters.random_state: Ensures reproducibility of results.fit(): Computes cluster centers and assigns labels.labels_: Array containing the cluster index for each sample.inertia_: The sum of squared distances of samples to their closest cluster center, used to evaluate the compactness of the clusters.- KMeans assumes spherical clusters and can be sensitive to feature scaling, hence standardizing data before applying is essential.
2. Principal Component Analysis (PCA)
- What is it? A linear technique to reduce the number of features in a dataset while preserving the most variance.
- Syntax:
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
X_reduced = pca.fit_transform(X)
- Explanation:
n_components: Number of new dimensions to keep; useful for visualization (2D or 3D).fit_transform(): Learns the projection and applies it.explained_variance_ratio_: Shows how much variance each component captures, useful for deciding the number of components to retain.- PCA is sensitive to scale, so standardizing features before applying is a best practice.
- Helps visualize clusters and reduce noise in high-dimensional data.
3. Fit Model (General)
- What is it? The step where the unsupervised algorithm learns from the input data.
- Syntax:
model.fit(X)
- Explanation:
- This method applies to clustering, dimensionality reduction, and decomposition models.
- The model identifies intrinsic patterns without using labeled outputs.
- After fitting, models like KMeans expose
.labels_or.cluster_centers_; PCA exposes.components_. - Fitting is a key part of the training phase and often followed by transform or prediction steps.
4. Predict or Transform
- What is it? After training, the model can predict cluster labels or transform the data to a new space.
- Syntax:
# Clustering
labels = model.predict(X)
# Dimensionality Reduction
X_new = model.transform(X)
- Explanation:
predict(): Assigns cluster indices to each sample (e.g., in KMeans).transform(): Projects original data to a new feature space (e.g., in PCA).- Ensures model reusability after initial training (fit).
fit_transform()combines both steps but is used only once on training data.
5. Silhouette Score
- What is it? A metric to evaluate the quality of clustering by measuring how close each point is to points in its own cluster vs other clusters.
- Syntax:
from sklearn.metrics import silhouette_score
score = silhouette_score(X, labels)
- Explanation:
- Score ranges from -1 to 1; higher is better.
- Values near +1 indicate well-clustered data; 0 suggests overlapping clusters; -1 implies incorrect clustering.
- Useful for determining the ideal number of clusters in KMeans.
- Best used with standard-scaled data for consistent measurement.
Real-Life Project: Customer Segmentation
Project Name
Clustering E-Commerce Customers
Project Overview
Segment e-commerce customers into distinct groups based on their spending behavior using KMeans.
Project Goal
- Cluster customers by annual income and spending score.
- Visualize and interpret group characteristics.
- Optimize cluster count using silhouette score.
Code for This Project
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import silhouette_score
# Load and prepare data
data = pd.read_csv('ecommerce_customers.csv')
X = data[['Annual Income', 'Spending Score']]
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Fit KMeans
model = KMeans(n_clusters=5, random_state=42)
labels = model.fit_predict(X_scaled)
# Evaluate
score = silhouette_score(X_scaled, labels)
print("Silhouette Score:", score)
# Visualize
plt.scatter(X_scaled[:, 0], X_scaled[:, 1], c=labels, cmap='viridis')
plt.title('Customer Segments')
plt.xlabel('Annual Income (scaled)')
plt.ylabel('Spending Score (scaled)')
plt.show()
Expected Output
- 5 well-separated clusters
- Visual scatter plot showing segments
- Silhouette score indicating clustering quality
Common Mistakes to Avoid
- ❌ Not scaling data before clustering
- ❌ Choosing too many/few clusters arbitrarily
- ❌ Using KMeans on non-spherical distributions
- ❌ Relying on cluster labels as ground truth for supervised models
Further Reading Recommendation
📘 Hands-On Python and Scikit-Learn: A Practical Guide to Machine Learning
by Sarful Hassan
🔗 Available on Amazon
Also explore:
- 🔗 Scikit-learn Clustering Guide: https://scikit-learn.org/stable/modules/clustering.html
- 🔗 PCA Guide: https://scikit-learn.org/stable/modules/decomposition.html#pca
- 🔗 Unsupervised learning projects on Kaggle
