Support Vector Machines (SVM) with Scikit-learn

Support Vector Machines (SVMs) are powerful and versatile classifiers that aim to find the optimal hyperplane separating different classes. SVMs are particularly effective in high-dimensional spaces and for datasets with a clear margin of separation. Scikit-learn provides SVC for classification tasks.

Key Characteristics of SVM

Effective in High Dimensions: Works well even with thousands of features.
Margin Maximization: Finds the widest margin between classes.
Kernel Trick: Supports linear and non-linear classification using kernels.
Robust to Overfitting: Especially when regularization is tuned.
Binary Classifier: Can be extended to multi-class with one-vs-rest strategy.

Basic Rules for Using SVM

Use StandardScaler to normalize features before training.
Select kernel type (linear, rbf, poly) based on problem.
Tune C and gamma for better performance.
For large datasets, use LinearSVC for speed.
Always evaluate with cross-validation.

Syntax Table

SL NO	Function	Syntax Example	Description
1	Import SVM	`from sklearn.svm import SVC`	Imports the SVM classifier
2	Instantiate Model	`model = SVC(kernel='rbf')`	Initializes the SVM with RBF kernel
3	Fit Model	`model.fit(X_train, y_train)`	Trains the model
4	Predict Labels	`model.predict(X_test)`	Predicts class labels
5	Feature Scaling	`StandardScaler().fit_transform(X)`	Standardizes features for SVM

Syntax Explanation

1. Import and Instantiate

What is it? Load the SVM classifier with specified kernel.
Syntax:

from sklearn.svm import SVC
model = SVC(kernel='rbf', C=1.0, gamma='scale')

Explanation:
- kernel='rbf' enables non-linear decision boundaries.
- C controls margin trade-off; gamma defines kernel width.

2. Fit the Model

What is it? Train the classifier.
Syntax:

model.fit(X_train, y_train)

Explanation:
- Finds the optimal hyperplane using support vectors.

3. Predict Labels

What is it? Predict class of unseen instances.
Syntax:

y_pred = model.predict(X_test)

Explanation:
- Uses the learned boundary to classify inputs.

4. Feature Scaling

What is it? Normalize input features.
Syntax:

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

Explanation:
- Improves SVM performance by centering and scaling features.

Real-Life Project: Spam Email Detection with SVM

Project Name

SVM-based Spam Classifier

Project Overview

Build a binary classifier using SVM to distinguish between spam and legitimate emails using TF-IDF features.

Project Goal

Train an SVM model on textual email data
Evaluate using precision, recall, and F1
Apply feature scaling before model training

Code for This Project

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn.metrics import classification_report, accuracy_score

# Load dataset
data = pd.read_csv('emails.csv')
X = data['text']
y = data['label']  # spam or ham

# Text vectorization
vectorizer = TfidfVectorizer()
X_vec = vectorizer.fit_transform(X)

# Scaling (optional for sparse matrices, but shown for completeness)
# scaler = StandardScaler(with_mean=False)
# X_scaled = scaler.fit_transform(X_vec)

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X_vec, y, test_size=0.3, random_state=42)

# Train model
model = SVC(kernel='linear')
model.fit(X_train, y_train)

# Predict
y_pred = model.predict(X_test)

# Evaluate
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Classification Report:\n", classification_report(y_test, y_pred))

Expected Output

High accuracy for spam detection
Detailed precision/recall/F1 report
Linear kernel SVM trained on email features

Common Mistakes to Avoid

❌ Using unscaled features → reduces performance
❌ Not tuning hyperparameters (C, gamma, kernel)
❌ Using SVM on large datasets without approximation (slow training)
❌ Ignoring class imbalance in spam/ham datasets