Anomaly Detection Techniques using Scikit-learn

Anomaly detection is the process of identifying rare items, events, or observations that differ significantly from the majority of the data. Scikit-learn provides several models to perform unsupervised anomaly detection, including One-Class SVM, Isolation Forest, and Elliptic Envelope.

Key Characteristics

Detects outliers or rare events in datasets
Often used in fraud detection, network security, and monitoring systems
Works in unsupervised settings (without labeled data)
Sensitive to feature scaling and data distribution

Basic Rules

Normalize or standardize features before applying models
Use domain knowledge to validate detected anomalies
Evaluate using precision-recall or domain-specific metrics
Suitable for high-dimensional data when models are properly tuned

Syntax Table

SL NO	Technique	Syntax Example	Description
1	One-Class SVM	`OneClassSVM(kernel='rbf', nu=0.1)`	Learns a decision function for outliers
2	Isolation Forest	`IsolationForest(contamination=0.1)`	Isolates anomalies based on tree splits
3	Elliptic Envelope	`EllipticEnvelope(contamination=0.1)`	Assumes Gaussian distribution
4	Fit Model	`model.fit(X)`	Trains on normal (inlier) data
5	Predict Outliers	`model.predict(X) # -1 = outlier, 1 = inlier`	Identifies anomalies in the dataset

Syntax Explanation

1. One-Class SVM

What is it? Learns a boundary that surrounds the inliers in feature space.

from sklearn.svm import OneClassSVM
model = OneClassSVM(kernel='rbf', nu=0.05)

nu controls the fraction of outliers
Sensitive to kernel choice and scaling

2. Isolation Forest

What is it? Randomly partitions data and isolates anomalies with fewer splits.

from sklearn.ensemble import IsolationForest
model = IsolationForest(contamination=0.1)

Works well with high-dimensional data
Fast and efficient for large datasets

3. Elliptic Envelope

What is it? Fits a Gaussian distribution to the dataset and detects outliers as points far from the center.

from sklearn.covariance import EllipticEnvelope
model = EllipticEnvelope(contamination=0.1)

Best for data with normal distribution
Requires features to be normally distributed

4. Fit the Model

model.fit(X_train)

Trains the model assuming X_train contains only inliers (normal cases)

5. Predict Outliers

pred = model.predict(X_test)  # -1 = anomaly, 1 = normal

Use output for further investigation or alert systems

Real-Life Project: Detecting Fraudulent Transactions

Project Overview

Identify potentially fraudulent financial transactions using Isolation Forest.

Code Example

import pandas as pd
from sklearn.ensemble import IsolationForest
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report

# Load dataset
df = pd.read_csv('transactions.csv')
X = df.drop(columns=['is_fraud'])
y = df['is_fraud']

# Scale features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Fit model
model = IsolationForest(contamination=0.05, random_state=42)
model.fit(X_scaled)

# Predict
y_pred = model.predict(X_scaled)
y_pred = [1 if p == -1 else 0 for p in y_pred]  # convert -1 to 1 (fraud)

print(classification_report(y, y_pred))

Expected Output

High recall for fraud class (1)
Balanced precision depending on contamination setting

Common Mistakes to Avoid

❌ Using raw unscaled data
❌ Ignoring contamination rate tuning
❌ Applying to labeled supervised data (better handled with classifiers)

📘 Hands-On Python and Scikit-Learn: A Practical Guide to Machine Learning by Sarful Hassan

🔗 Available on Amazon