K-Means Clustering using Scikit-learn

K-Means is one of the most popular clustering algorithms in unsupervised machine learning. It groups unlabeled data into $k$ clusters based on feature similarity. Scikit-learn provides an easy-to-use and efficient implementation through the KMeans class.

Key Characteristics of K-Means Clustering

Centroid-Based: Clusters are formed by minimizing distance to centroids.
Iterative Optimization: Uses Expectation-Maximization to refine clusters.
Scalable: Performs well on medium to large datasets.
Unsupervised: Doesn’t require labeled data.
Deterministic (with random_state): Reproducibility ensured with a fixed seed.

Basic Rules for Using K-Means

Standardize features for better clustering performance.
Choose $k$ wisely using metrics like inertia or silhouette score.
Avoid K-Means for non-spherical or unevenly sized clusters.
Run multiple initializations to avoid local minima.
Use .fit_predict() to cluster and label in one step.

Syntax Table

SL NO	Function	Syntax Example	Description
1	Import KMeans	`from sklearn.cluster import KMeans`	Imports KMeans class
2	Instantiate Model	`model = KMeans(n_clusters=3)`	Creates clustering model
3	Fit Model	`model.fit(X)`	Learns clusters from data
4	Predict Labels	`labels = model.predict(X)`	Predicts cluster index for each sample
5	Fit and Predict	`labels = model.fit_predict(X)`	Combines fitting and predicting
6	Cluster Centers	`model.cluster_centers_`	Returns array of centroid coordinates
7	Inertia	`model.inertia_`	Measures within-cluster sum of squares

Syntax Explanation (Expanded)

1. Import and Instantiate KMeans

What is it? Load the KMeans class and create a model instance for clustering.
Syntax:

from sklearn.cluster import KMeans
model = KMeans(n_clusters=3, random_state=0)

Explanation:
- n_clusters: Number of clusters to form. Must be an integer ≥ 1.
- random_state: Fixes random number generation for reproducibility.
- init: Method for initializing centroids. Default is 'k-means++', which speeds up convergence.
- n_init: Number of time the algorithm will be run with different centroid seeds. Best practice is ≥ 10.
- max_iter: Maximum number of iterations per run. Default is 300.
- tol: Tolerance value for convergence. Lower values increase precision but may require more iterations.

2. Fit the Model

What is it? Train the model using the given input data.
Syntax:

model.fit(X)

Explanation:
- X should be a 2D NumPy array or DataFrame with numeric values.
- During fitting, the algorithm:
  - Randomly initializes centroids.
  - Assigns each data point to the nearest centroid.
  - Recomputes centroids as the mean of assigned points.
  - Repeats until convergence (no significant change in centroids or reaching max_iter).
- Use .fit() when you don’t immediately need the labels.

3. Predict Labels

What is it? Assign each sample to the nearest learned cluster.
Syntax:

labels = model.predict(X)

Explanation:
- After fitting, use predict() to label new or existing data points.
- Returns an array of integers, each representing a cluster.
- Ideal for applying learned clusters to unseen data (e.g., test sets).

4. Fit and Predict Together

What is it? Train the model and assign cluster labels in one step.
Syntax:

labels = model.fit_predict(X)

Explanation:
- Equivalent to calling .fit(X) followed by .predict(X).
- Efficient and concise for training data.
- Preferred when immediate access to cluster labels is required.

5. Cluster Centers

What is it? Retrieve the coordinates of the centroids for each cluster.
Syntax:

centroids = model.cluster_centers_

Explanation:
- Returns a 2D array with shape (n_clusters, n_features).
- Each row is a centroid’s coordinate in feature space.
- Use for visualization (e.g., plot centroids over data).
- Helpful for interpreting what each cluster represents.

6. Inertia

What is it? Sum of squared distances from each point to its assigned cluster center.
Syntax:

inertia = model.inertia_

Explanation:
- Measures compactness of clusters.
- Lower inertia means tighter clusters (better clustering).
- Use the “Elbow Method” by plotting inertia across various k values to select the optimal cluster count.
- Keep in mind: inertia always decreases as k increases, so combine with silhouette score or domain knowledge.

Real-Life Project: Customer Segmentation

Project Name

Clustering E-Commerce Customers

Project Overview

Segment e-commerce customers into distinct groups based on their spending behavior using KMeans. This project illustrates how to clean, standardize, and cluster customer data for business insights and marketing strategies.

Project Goal

Identify unique customer segments from income and spending behavior.
Optimize the number of clusters using silhouette analysis.
Visualize customer groups and interpret segment characteristics.

Code for This Project

import pandas as pd
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import silhouette_score

# Load and prepare data
data = pd.read_csv('ecommerce_customers.csv')
X = data[['Annual Income', 'Spending Score']]
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Fit KMeans
model = KMeans(n_clusters=5, random_state=42)
labels = model.fit_predict(X_scaled)

# Evaluate
score = silhouette_score(X_scaled, labels)
print("Silhouette Score:", score)

# Visualize
plt.scatter(X_scaled[:, 0], X_scaled[:, 1], c=labels, cmap='viridis')
plt.title('Customer Segments')
plt.xlabel('Annual Income (scaled)')
plt.ylabel('Spending Score (scaled)')
plt.show()

Expected Output

Five color-coded customer segments displayed in a scatter plot.
Silhouette score to assess clustering quality.
Centroid positions identifying average behavior for each group.

Common Mistakes to Avoid

❌ Not standardizing data before clustering.
❌ Arbitrarily selecting number of clusters without validation.
❌ Using non-numeric features without encoding or transformation.
❌ Misinterpreting cluster labels as true categories.