Integrating Scikit-learn with Matplotlib and Seaborn allows users to visualize data distributions, model performance, feature relationships, and decision boundaries. These visual insights are crucial for model evaluation, diagnostics, and presentations.
Key Characteristics
- Enhances interpretability through visualizations
- Useful for EDA (Exploratory Data Analysis) and model diagnostics
- Compatible with Scikit-learnβs outputs like predictions, feature importance, confusion matrices, etc.
- Enables plotting decision boundaries, correlation heatmaps, and distribution plots
Basic Rules
- Use Matplotlib for low-level, customizable plotting
- Use Seaborn for high-level, attractive statistical plots
- Integrate visualizations at various steps: before training (EDA), during model evaluation, and after prediction
- Convert NumPy arrays or Scikit-learn outputs into Pandas DataFrames for Seaborn compatibility
Syntax Table
| SL NO | Task | Syntax Example | Description |
|---|---|---|---|
| 1 | Import Libraries | import matplotlib.pyplot as plt |
Loads Matplotlib for plotting |
import seaborn as sns |
Loads Seaborn for statistical plots | ||
| 2 | Plot Confusion Matrix | sns.heatmap(cm, annot=True) |
Visualizes classification performance |
| 3 | Plot Feature Distribution | sns.histplot(df['feature']) |
Shows distribution of a single feature |
| 4 | Scatter Plot with Hue | sns.scatterplot(x=..., y=..., hue=...) |
Visualizes feature relationships |
| 5 | Decision Boundary (2D) | plt.contourf(xx, yy, Z) |
Plots classifier decision boundaries |
Syntax Explanation
1. Import Libraries
What is it?
Loads Matplotlib and Seaborn.
Syntax:
import matplotlib.pyplot as plt
import seaborn as sns
Explanation:
matplotlib.pyplotis used for flexible, low-level charting.seabornis built on top of Matplotlib, offering a simplified interface for statistical plots with built-in themes.
2. Plot Confusion Matrix
What is it?
Displays confusion matrix results as a heatmap.
Syntax:
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
plt.xlabel("Predicted")
plt.ylabel("Actual")
Explanation:
cmis typically obtained viaconfusion_matrix(y_test, y_pred).annot=Truedisplays the numbers inside cells.fmt='d'specifies integer format.cmap='Blues'applies a blue gradient for clarity.
3. Plot Feature Distribution
What is it?
Visualizes the distribution of a single feature or class.
Syntax:
sns.histplot(df['feature'], kde=True)
Explanation:
- Shows the frequency of data points within intervals.
kde=Trueoverlays a Kernel Density Estimate curve.- Helpful for checking normality or skew in data.
4. Scatter Plot with Hue
What is it?
Plots relationships between two numeric features, colored by class.
Syntax:
sns.scatterplot(x='feature1', y='feature2', hue='label', data=df)
Explanation:
- Useful for visualizing separation or clusters by label.
huedefines color mapping based on categorical column.- Common in binary or multiclass classification visuals.
5. Plot Decision Boundary
What is it?
Shows the boundary regions learned by a classifier in 2D.
Syntax:
plt.contourf(xx, yy, Z, cmap=plt.cm.RdBu, alpha=0.6)
plt.scatter(X[:, 0], X[:, 1], c=y, edgecolors='k')
Explanation:
- Requires meshgrid (
xx,yy) and predictionsZ = model.predict(...). contourf()fills the regions separated by class.- Effective for classifiers like SVM, Logistic Regression, KNN in 2D.
Real-Life Project: Visualizing Decision Boundaries in Iris Dataset
Project Overview
Visualize how a classifier (e.g., Logistic Regression) separates classes in the Iris dataset.
Code Example
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# Load and prepare data
data = load_iris()
df = pd.DataFrame(data.data, columns=data.feature_names)
df['target'] = data.target
X = df.iloc[:, [2, 3]].values # use petal length and width
y = df['target']
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, random_state=42)
model = LogisticRegression()
model.fit(X_train, y_train)
# Meshgrid for plotting
x_min, x_max = X_scaled[:, 0].min() - 1, X_scaled[:, 0].max() + 1
y_min, y_max = X_scaled[:, 1].min() - 1, X_scaled[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.01),
np.arange(y_min, y_max, 0.01))
Z = model.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
# Plot
plt.figure(figsize=(10,6))
plt.contourf(xx, yy, Z, alpha=0.3)
sns.scatterplot(x=X_scaled[:, 0], y=X_scaled[:, 1], hue=y, palette="deep")
plt.title("Decision Boundary - Logistic Regression on Iris")
plt.xlabel("Petal Length (standardized)")
plt.ylabel("Petal Width (standardized)")
plt.show()
Expected Output
- Scatter plot overlaid with decision regions
- Differentiated classes via color-coded hues
Common Mistakes to Avoid
- β Using raw NumPy arrays directly in Seaborn (prefer Pandas DataFrames)
- β Not standardizing data before plotting decision boundaries
- β Forgetting to adjust figure size or labels for clarity
Further Reading Recommendation
π Hands-On Python and Scikit-Learn: A Practical Guide to Machine Learning by Sarful Hassan
π Available on Amazon
