K-Nearest Neighbors Classifier with Scikit-learn

The K-Nearest Neighbors (KNN) algorithm is a simple, non-parametric classification method used for both classification and regression. In classification tasks, it assigns the label based on the most common class among the k nearest neighbors in the training set. It’s intuitive and highly effective for low-dimensional datasets.

Key Characteristics of KNN Classifier

  • Lazy Learning: No model is built during training; it memorizes the training dataset.
  • Instance-Based: Makes predictions based on the distance to training examples.
  • Distance Metric: Typically uses Euclidean distance.
  • Non-Linear Decision Boundaries: Effective for non-linear classification problems.
  • No Assumptions: Works well when data is not linearly separable.

Basic Rules for KNN Classification

  • Always scale features before applying KNN.
  • Choose an odd value for k when classes are binary.
  • Use cross-validation to find the optimal k.
  • Avoid high-dimensional data (curse of dimensionality).
  • KNN is sensitive to irrelevant or redundant features.

Syntax Table

SL NO Function Syntax Example Description
1 Import KNN Class from sklearn.neighbors import KNeighborsClassifier Imports the KNN classifier
2 Instantiate Model knn = KNeighborsClassifier(n_neighbors=5) Creates a KNN model with 5 neighbors
3 Fit Model knn.fit(X_train, y_train) Trains the model using training data
4 Predict knn.predict(X_test) Predicts labels for test data
5 Probability Score knn.predict_proba(X_test) Returns class probabilities

Syntax Explanation

1. Import and Instantiate

  • What is it? Load the KNN class and set the number of neighbors.
  • Syntax:
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=5)
  • Explanation:
    • n_neighbors=5 sets the number of neighbors used for prediction.
    • Smaller k means more flexible decision boundary; larger k gives smoother predictions.

2. Training the Model

  • What is it? Fits the KNN model to the training dataset.
  • Syntax:
knn.fit(X_train, y_train)
  • Explanation:
    • Memorizes training data for use during prediction.
    • No actual training/parameter estimation is performed.

3. Making Predictions

  • What is it? Predicts class labels for test data.
  • Syntax:
y_pred = knn.predict(X_test)
  • Explanation:
    • Class label is determined by majority vote from the k nearest neighbors.

4. Getting Class Probabilities

  • What is it? Predicts the probability for each class.
  • Syntax:
proba = knn.predict_proba(X_test)
  • Explanation:
    • Gives insight into model confidence for each prediction.
    • Useful for ROC curves and threshold tuning.

Real-Life Project: Customer Segmentation

Project Name

Predicting Customer Segments with KNN

Project Overview

This project uses the K-Nearest Neighbors algorithm to classify customers into different marketing segments based on demographic and behavioral data.

Project Goal

  • Classify customers into known segments.
  • Use cross-validation to find the best k value.
  • Evaluate classification accuracy and confusion matrix.

Code for This Project

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Load dataset
data = pd.read_csv('customer_data.csv')
X = data.drop('Segment', axis=1)
y = data['Segment']

# Split dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train KNN
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train_scaled, y_train)

# Predict and Evaluate
y_pred = knn.predict(X_test_scaled)
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
print("Classification Report:\n", classification_report(y_test, y_pred))

Expected Output

  • Accuracy score of predictions
  • Confusion matrix visualization
  • Classification report with precision, recall, and F1

Common Mistakes to Avoid

  • ❌ Skipping feature scaling → KNN is distance-based
  • ❌ Not tuning k → Default k=5 may not be optimal
  • ❌ Applying KNN to high-dimensional data
  • ❌ Using categorical variables without encoding

Further Reading Recommendation

📘 Hands-On Python and Scikit-Learn: A Practical Guide to Machine Learning
by Sarful Hassan
🔗 Available on Amazon

Also explore: