Anomaly detection is the process of identifying rare items, events, or observations that differ significantly from the majority of the data. Scikit-learn provides several models to perform unsupervised anomaly detection, including One-Class SVM, Isolation Forest, and Elliptic Envelope.
Key Characteristics
- Detects outliers or rare events in datasets
- Often used in fraud detection, network security, and monitoring systems
- Works in unsupervised settings (without labeled data)
- Sensitive to feature scaling and data distribution
Basic Rules
- Normalize or standardize features before applying models
- Use domain knowledge to validate detected anomalies
- Evaluate using precision-recall or domain-specific metrics
- Suitable for high-dimensional data when models are properly tuned
Syntax Table
SL NO | Technique | Syntax Example | Description |
---|---|---|---|
1 | One-Class SVM | OneClassSVM(kernel='rbf', nu=0.1) |
Learns a decision function for outliers |
2 | Isolation Forest | IsolationForest(contamination=0.1) |
Isolates anomalies based on tree splits |
3 | Elliptic Envelope | EllipticEnvelope(contamination=0.1) |
Assumes Gaussian distribution |
4 | Fit Model | model.fit(X) |
Trains on normal (inlier) data |
5 | Predict Outliers | model.predict(X) # -1 = outlier, 1 = inlier |
Identifies anomalies in the dataset |
Syntax Explanation
1. One-Class SVM
What is it? Learns a boundary that surrounds the inliers in feature space.
from sklearn.svm import OneClassSVM
model = OneClassSVM(kernel='rbf', nu=0.05)
nu
controls the fraction of outliers- Sensitive to kernel choice and scaling
2. Isolation Forest
What is it? Randomly partitions data and isolates anomalies with fewer splits.
from sklearn.ensemble import IsolationForest
model = IsolationForest(contamination=0.1)
- Works well with high-dimensional data
- Fast and efficient for large datasets
3. Elliptic Envelope
What is it? Fits a Gaussian distribution to the dataset and detects outliers as points far from the center.
from sklearn.covariance import EllipticEnvelope
model = EllipticEnvelope(contamination=0.1)
- Best for data with normal distribution
- Requires features to be normally distributed
4. Fit the Model
model.fit(X_train)
- Trains the model assuming X_train contains only inliers (normal cases)
5. Predict Outliers
pred = model.predict(X_test) # -1 = anomaly, 1 = normal
- Use output for further investigation or alert systems
Real-Life Project: Detecting Fraudulent Transactions
Project Overview
Identify potentially fraudulent financial transactions using Isolation Forest.
Code Example
import pandas as pd
from sklearn.ensemble import IsolationForest
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report
# Load dataset
df = pd.read_csv('transactions.csv')
X = df.drop(columns=['is_fraud'])
y = df['is_fraud']
# Scale features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Fit model
model = IsolationForest(contamination=0.05, random_state=42)
model.fit(X_scaled)
# Predict
y_pred = model.predict(X_scaled)
y_pred = [1 if p == -1 else 0 for p in y_pred] # convert -1 to 1 (fraud)
print(classification_report(y, y_pred))
Expected Output
- High recall for fraud class (1)
- Balanced precision depending on contamination setting
Common Mistakes to Avoid
- ❌ Using raw unscaled data
- ❌ Ignoring contamination rate tuning
- ❌ Applying to labeled supervised data (better handled with classifiers)