Anomaly Detection Techniques using Scikit-learn

Anomaly detection is the process of identifying rare items, events, or observations that differ significantly from the majority of the data. Scikit-learn provides several models to perform unsupervised anomaly detection, including One-Class SVM, Isolation Forest, and Elliptic Envelope.

Key Characteristics

  • Detects outliers or rare events in datasets
  • Often used in fraud detection, network security, and monitoring systems
  • Works in unsupervised settings (without labeled data)
  • Sensitive to feature scaling and data distribution

Basic Rules

  • Normalize or standardize features before applying models
  • Use domain knowledge to validate detected anomalies
  • Evaluate using precision-recall or domain-specific metrics
  • Suitable for high-dimensional data when models are properly tuned

Syntax Table

SL NO Technique Syntax Example Description
1 One-Class SVM OneClassSVM(kernel='rbf', nu=0.1) Learns a decision function for outliers
2 Isolation Forest IsolationForest(contamination=0.1) Isolates anomalies based on tree splits
3 Elliptic Envelope EllipticEnvelope(contamination=0.1) Assumes Gaussian distribution
4 Fit Model model.fit(X) Trains on normal (inlier) data
5 Predict Outliers model.predict(X) # -1 = outlier, 1 = inlier Identifies anomalies in the dataset

Syntax Explanation

1. One-Class SVM

What is it? Learns a boundary that surrounds the inliers in feature space.

from sklearn.svm import OneClassSVM
model = OneClassSVM(kernel='rbf', nu=0.05)
  • nu controls the fraction of outliers
  • Sensitive to kernel choice and scaling

2. Isolation Forest

What is it? Randomly partitions data and isolates anomalies with fewer splits.

from sklearn.ensemble import IsolationForest
model = IsolationForest(contamination=0.1)
  • Works well with high-dimensional data
  • Fast and efficient for large datasets

3. Elliptic Envelope

What is it? Fits a Gaussian distribution to the dataset and detects outliers as points far from the center.

from sklearn.covariance import EllipticEnvelope
model = EllipticEnvelope(contamination=0.1)
  • Best for data with normal distribution
  • Requires features to be normally distributed

4. Fit the Model

model.fit(X_train)
  • Trains the model assuming X_train contains only inliers (normal cases)

5. Predict Outliers

pred = model.predict(X_test)  # -1 = anomaly, 1 = normal
  • Use output for further investigation or alert systems

Real-Life Project: Detecting Fraudulent Transactions

Project Overview

Identify potentially fraudulent financial transactions using Isolation Forest.

Code Example

import pandas as pd
from sklearn.ensemble import IsolationForest
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report

# Load dataset
df = pd.read_csv('transactions.csv')
X = df.drop(columns=['is_fraud'])
y = df['is_fraud']

# Scale features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Fit model
model = IsolationForest(contamination=0.05, random_state=42)
model.fit(X_scaled)

# Predict
y_pred = model.predict(X_scaled)
y_pred = [1 if p == -1 else 0 for p in y_pred]  # convert -1 to 1 (fraud)

print(classification_report(y, y_pred))

Expected Output

  • High recall for fraud class (1)
  • Balanced precision depending on contamination setting

Common Mistakes to Avoid

  • ❌ Using raw unscaled data
  • ❌ Ignoring contamination rate tuning
  • ❌ Applying to labeled supervised data (better handled with classifiers)

Further Reading Recommendation

📘 Hands-On Python and Scikit-Learn: A Practical Guide to Machine Learning by Sarful Hassan

🔗 Available on Amazon