Support Vector Machines (SVM) with Scikit-learn

Support Vector Machines (SVMs) are powerful and versatile classifiers that aim to find the optimal hyperplane separating different classes. SVMs are particularly effective in high-dimensional spaces and for datasets with a clear margin of separation. Scikit-learn provides SVC for classification tasks.

Key Characteristics of SVM

  • Effective in High Dimensions: Works well even with thousands of features.
  • Margin Maximization: Finds the widest margin between classes.
  • Kernel Trick: Supports linear and non-linear classification using kernels.
  • Robust to Overfitting: Especially when regularization is tuned.
  • Binary Classifier: Can be extended to multi-class with one-vs-rest strategy.

Basic Rules for Using SVM

  • Use StandardScaler to normalize features before training.
  • Select kernel type (linear, rbf, poly) based on problem.
  • Tune C and gamma for better performance.
  • For large datasets, use LinearSVC for speed.
  • Always evaluate with cross-validation.

Syntax Table

SL NO Function Syntax Example Description
1 Import SVM from sklearn.svm import SVC Imports the SVM classifier
2 Instantiate Model model = SVC(kernel='rbf') Initializes the SVM with RBF kernel
3 Fit Model model.fit(X_train, y_train) Trains the model
4 Predict Labels model.predict(X_test) Predicts class labels
5 Feature Scaling StandardScaler().fit_transform(X) Standardizes features for SVM

Syntax Explanation

1. Import and Instantiate

  • What is it? Load the SVM classifier with specified kernel.
  • Syntax:
from sklearn.svm import SVC
model = SVC(kernel='rbf', C=1.0, gamma='scale')
  • Explanation:
    • kernel='rbf' enables non-linear decision boundaries.
    • C controls margin trade-off; gamma defines kernel width.

2. Fit the Model

  • What is it? Train the classifier.
  • Syntax:
model.fit(X_train, y_train)
  • Explanation:
    • Finds the optimal hyperplane using support vectors.

3. Predict Labels

  • What is it? Predict class of unseen instances.
  • Syntax:
y_pred = model.predict(X_test)
  • Explanation:
    • Uses the learned boundary to classify inputs.

4. Feature Scaling

  • What is it? Normalize input features.
  • Syntax:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
  • Explanation:
    • Improves SVM performance by centering and scaling features.

Real-Life Project: Spam Email Detection with SVM

Project Name

SVM-based Spam Classifier

Project Overview

Build a binary classifier using SVM to distinguish between spam and legitimate emails using TF-IDF features.

Project Goal

  • Train an SVM model on textual email data
  • Evaluate using precision, recall, and F1
  • Apply feature scaling before model training

Code for This Project

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn.metrics import classification_report, accuracy_score

# Load dataset
data = pd.read_csv('emails.csv')
X = data['text']
y = data['label']  # spam or ham

# Text vectorization
vectorizer = TfidfVectorizer()
X_vec = vectorizer.fit_transform(X)

# Scaling (optional for sparse matrices, but shown for completeness)
# scaler = StandardScaler(with_mean=False)
# X_scaled = scaler.fit_transform(X_vec)

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X_vec, y, test_size=0.3, random_state=42)

# Train model
model = SVC(kernel='linear')
model.fit(X_train, y_train)

# Predict
y_pred = model.predict(X_test)

# Evaluate
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Classification Report:\n", classification_report(y_test, y_pred))

Expected Output

  • High accuracy for spam detection
  • Detailed precision/recall/F1 report
  • Linear kernel SVM trained on email features

Common Mistakes to Avoid

  • ❌ Using unscaled features → reduces performance
  • ❌ Not tuning hyperparameters (C, gamma, kernel)
  • ❌ Using SVM on large datasets without approximation (slow training)
  • ❌ Ignoring class imbalance in spam/ham datasets

Further Reading Recommendation

📘 Hands-On Python and Scikit-Learn: A Practical Guide to Machine Learning
by Sarful Hassan
🔗 Available on Amazon

Also explore: