K-Nearest Neighbors Classifier with Scikit-learn

The K-Nearest Neighbors (KNN) algorithm is a simple, non-parametric classification method used for both classification and regression. In classification tasks, it assigns the label based on the most common class among the k nearest neighbors in the training set. It’s intuitive and highly effective for low-dimensional datasets.

Key Characteristics of KNN Classifier

Lazy Learning: No model is built during training; it memorizes the training dataset.
Instance-Based: Makes predictions based on the distance to training examples.
Distance Metric: Typically uses Euclidean distance.
Non-Linear Decision Boundaries: Effective for non-linear classification problems.
No Assumptions: Works well when data is not linearly separable.

Basic Rules for KNN Classification

Always scale features before applying KNN.
Choose an odd value for k when classes are binary.
Use cross-validation to find the optimal k.
Avoid high-dimensional data (curse of dimensionality).
KNN is sensitive to irrelevant or redundant features.

Syntax Table

SL NO	Function	Syntax Example	Description
1	Import KNN Class	`from sklearn.neighbors import KNeighborsClassifier`	Imports the KNN classifier
2	Instantiate Model	`knn = KNeighborsClassifier(n_neighbors=5)`	Creates a KNN model with 5 neighbors
3	Fit Model	`knn.fit(X_train, y_train)`	Trains the model using training data
4	Predict	`knn.predict(X_test)`	Predicts labels for test data
5	Probability Score	`knn.predict_proba(X_test)`	Returns class probabilities

Syntax Explanation

1. Import and Instantiate

What is it? Load the KNN class and set the number of neighbors.
Syntax:

from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=5)

Explanation:
- n_neighbors=5 sets the number of neighbors used for prediction.
- Smaller k means more flexible decision boundary; larger k gives smoother predictions.

2. Training the Model

What is it? Fits the KNN model to the training dataset.
Syntax:

knn.fit(X_train, y_train)

Explanation:
- Memorizes training data for use during prediction.
- No actual training/parameter estimation is performed.

3. Making Predictions

What is it? Predicts class labels for test data.
Syntax:

y_pred = knn.predict(X_test)

Explanation:
- Class label is determined by majority vote from the k nearest neighbors.

4. Getting Class Probabilities

What is it? Predicts the probability for each class.
Syntax:

proba = knn.predict_proba(X_test)

Explanation:
- Gives insight into model confidence for each prediction.
- Useful for ROC curves and threshold tuning.

Real-Life Project: Customer Segmentation

Project Name

Predicting Customer Segments with KNN

Project Overview

This project uses the K-Nearest Neighbors algorithm to classify customers into different marketing segments based on demographic and behavioral data.

Project Goal

Classify customers into known segments.
Use cross-validation to find the best k value.
Evaluate classification accuracy and confusion matrix.

Code for This Project

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Load dataset
data = pd.read_csv('customer_data.csv')
X = data.drop('Segment', axis=1)
y = data['Segment']

# Split dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train KNN
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train_scaled, y_train)

# Predict and Evaluate
y_pred = knn.predict(X_test_scaled)
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
print("Classification Report:\n", classification_report(y_test, y_pred))

Expected Output

Accuracy score of predictions
Confusion matrix visualization
Classification report with precision, recall, and F1

Common Mistakes to Avoid

❌ Skipping feature scaling → KNN is distance-based
❌ Not tuning k → Default k=5 may not be optimal
❌ Applying KNN to high-dimensional data
❌ Using categorical variables without encoding