The K-Nearest Neighbors (KNN) algorithm is a simple, non-parametric classification method used for both classification and regression. In classification tasks, it assigns the label based on the most common class among the k nearest neighbors in the training set. It’s intuitive and highly effective for low-dimensional datasets.
Key Characteristics of KNN Classifier
- Lazy Learning: No model is built during training; it memorizes the training dataset.
- Instance-Based: Makes predictions based on the distance to training examples.
- Distance Metric: Typically uses Euclidean distance.
- Non-Linear Decision Boundaries: Effective for non-linear classification problems.
- No Assumptions: Works well when data is not linearly separable.
Basic Rules for KNN Classification
- Always scale features before applying KNN.
- Choose an odd value for
kwhen classes are binary. - Use cross-validation to find the optimal
k. - Avoid high-dimensional data (curse of dimensionality).
- KNN is sensitive to irrelevant or redundant features.
Syntax Table
| SL NO | Function | Syntax Example | Description |
|---|---|---|---|
| 1 | Import KNN Class | from sklearn.neighbors import KNeighborsClassifier |
Imports the KNN classifier |
| 2 | Instantiate Model | knn = KNeighborsClassifier(n_neighbors=5) |
Creates a KNN model with 5 neighbors |
| 3 | Fit Model | knn.fit(X_train, y_train) |
Trains the model using training data |
| 4 | Predict | knn.predict(X_test) |
Predicts labels for test data |
| 5 | Probability Score | knn.predict_proba(X_test) |
Returns class probabilities |
Syntax Explanation
1. Import and Instantiate
- What is it? Load the KNN class and set the number of neighbors.
- Syntax:
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=5)
- Explanation:
n_neighbors=5sets the number of neighbors used for prediction.- Smaller
kmeans more flexible decision boundary; largerkgives smoother predictions.
2. Training the Model
- What is it? Fits the KNN model to the training dataset.
- Syntax:
knn.fit(X_train, y_train)
- Explanation:
- Memorizes training data for use during prediction.
- No actual training/parameter estimation is performed.
3. Making Predictions
- What is it? Predicts class labels for test data.
- Syntax:
y_pred = knn.predict(X_test)
- Explanation:
- Class label is determined by majority vote from the
knearest neighbors.
- Class label is determined by majority vote from the
4. Getting Class Probabilities
- What is it? Predicts the probability for each class.
- Syntax:
proba = knn.predict_proba(X_test)
- Explanation:
- Gives insight into model confidence for each prediction.
- Useful for ROC curves and threshold tuning.
Real-Life Project: Customer Segmentation
Project Name
Predicting Customer Segments with KNN
Project Overview
This project uses the K-Nearest Neighbors algorithm to classify customers into different marketing segments based on demographic and behavioral data.
Project Goal
- Classify customers into known segments.
- Use cross-validation to find the best
kvalue. - Evaluate classification accuracy and confusion matrix.
Code for This Project
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
# Load dataset
data = pd.read_csv('customer_data.csv')
X = data.drop('Segment', axis=1)
y = data['Segment']
# Split dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# Train KNN
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train_scaled, y_train)
# Predict and Evaluate
y_pred = knn.predict(X_test_scaled)
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
print("Classification Report:\n", classification_report(y_test, y_pred))
Expected Output
- Accuracy score of predictions
- Confusion matrix visualization
- Classification report with precision, recall, and F1
Common Mistakes to Avoid
- ❌ Skipping feature scaling → KNN is distance-based
- ❌ Not tuning
k→ Defaultk=5may not be optimal - ❌ Applying KNN to high-dimensional data
- ❌ Using categorical variables without encoding
Further Reading Recommendation
📘 Hands-On Python and Scikit-Learn: A Practical Guide to Machine Learning
by Sarful Hassan
🔗 Available on Amazon
Also explore:
- 🔗 Scikit-learn KNN Docs: https://scikit-learn.org/stable/modules/neighbors.html
- 🔗 KNN Visual Intuition (YouTube, TowardsDataScience)
- 🔗 Hyperparameter tuning with GridSearchCV
