Integration of Scikit-learn with Matplotlib and Seaborn

Integrating Scikit-learn with Matplotlib and Seaborn allows users to visualize data distributions, model performance, feature relationships, and decision boundaries. These visual insights are crucial for model evaluation, diagnostics, and presentations.

Key Characteristics

  • Enhances interpretability through visualizations
  • Useful for EDA (Exploratory Data Analysis) and model diagnostics
  • Compatible with Scikit-learn’s outputs like predictions, feature importance, confusion matrices, etc.
  • Enables plotting decision boundaries, correlation heatmaps, and distribution plots

Basic Rules

  • Use Matplotlib for low-level, customizable plotting
  • Use Seaborn for high-level, attractive statistical plots
  • Integrate visualizations at various steps: before training (EDA), during model evaluation, and after prediction
  • Convert NumPy arrays or Scikit-learn outputs into Pandas DataFrames for Seaborn compatibility

Syntax Table

SL NO Task Syntax Example Description
1 Import Libraries import matplotlib.pyplot as plt Loads Matplotlib for plotting
import seaborn as sns Loads Seaborn for statistical plots
2 Plot Confusion Matrix sns.heatmap(cm, annot=True) Visualizes classification performance
3 Plot Feature Distribution sns.histplot(df['feature']) Shows distribution of a single feature
4 Scatter Plot with Hue sns.scatterplot(x=..., y=..., hue=...) Visualizes feature relationships
5 Decision Boundary (2D) plt.contourf(xx, yy, Z) Plots classifier decision boundaries

Syntax Explanation

1. Import Libraries

What is it?
Loads Matplotlib and Seaborn.

Syntax:

import matplotlib.pyplot as plt
import seaborn as sns

Explanation:

  • matplotlib.pyplot is used for flexible, low-level charting.
  • seaborn is built on top of Matplotlib, offering a simplified interface for statistical plots with built-in themes.

2. Plot Confusion Matrix

What is it?
Displays confusion matrix results as a heatmap.

Syntax:

sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
plt.xlabel("Predicted")
plt.ylabel("Actual")

Explanation:

  • cm is typically obtained via confusion_matrix(y_test, y_pred).
  • annot=True displays the numbers inside cells.
  • fmt='d' specifies integer format.
  • cmap='Blues' applies a blue gradient for clarity.

3. Plot Feature Distribution

What is it?
Visualizes the distribution of a single feature or class.

Syntax:

sns.histplot(df['feature'], kde=True)

Explanation:

  • Shows the frequency of data points within intervals.
  • kde=True overlays a Kernel Density Estimate curve.
  • Helpful for checking normality or skew in data.

4. Scatter Plot with Hue

What is it?
Plots relationships between two numeric features, colored by class.

Syntax:

sns.scatterplot(x='feature1', y='feature2', hue='label', data=df)

Explanation:

  • Useful for visualizing separation or clusters by label.
  • hue defines color mapping based on categorical column.
  • Common in binary or multiclass classification visuals.

5. Plot Decision Boundary

What is it?
Shows the boundary regions learned by a classifier in 2D.

Syntax:

plt.contourf(xx, yy, Z, cmap=plt.cm.RdBu, alpha=0.6)
plt.scatter(X[:, 0], X[:, 1], c=y, edgecolors='k')

Explanation:

  • Requires meshgrid (xx, yy) and predictions Z = model.predict(...).
  • contourf() fills the regions separated by class.
  • Effective for classifiers like SVM, Logistic Regression, KNN in 2D.

Real-Life Project: Visualizing Decision Boundaries in Iris Dataset

Project Overview

Visualize how a classifier (e.g., Logistic Regression) separates classes in the Iris dataset.

Code Example

from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Load and prepare data
data = load_iris()
df = pd.DataFrame(data.data, columns=data.feature_names)
df['target'] = data.target

X = df.iloc[:, [2, 3]].values  # use petal length and width
y = df['target']

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, random_state=42)
model = LogisticRegression()
model.fit(X_train, y_train)

# Meshgrid for plotting
x_min, x_max = X_scaled[:, 0].min() - 1, X_scaled[:, 0].max() + 1
y_min, y_max = X_scaled[:, 1].min() - 1, X_scaled[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.01),
                     np.arange(y_min, y_max, 0.01))
Z = model.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)

# Plot
plt.figure(figsize=(10,6))
plt.contourf(xx, yy, Z, alpha=0.3)
sns.scatterplot(x=X_scaled[:, 0], y=X_scaled[:, 1], hue=y, palette="deep")
plt.title("Decision Boundary - Logistic Regression on Iris")
plt.xlabel("Petal Length (standardized)")
plt.ylabel("Petal Width (standardized)")
plt.show()

Expected Output

  • Scatter plot overlaid with decision regions
  • Differentiated classes via color-coded hues

Common Mistakes to Avoid

  • ❌ Using raw NumPy arrays directly in Seaborn (prefer Pandas DataFrames)
  • ❌ Not standardizing data before plotting decision boundaries
  • ❌ Forgetting to adjust figure size or labels for clarity

Further Reading Recommendation

πŸ“˜ Hands-On Python and Scikit-Learn: A Practical Guide to Machine Learning by Sarful Hassan

πŸ”— Available on Amazon