t-SNE Visualization using Scikit-learn

t-SNE (t-distributed Stochastic Neighbor Embedding) is a powerful non-linear technique for visualizing high-dimensional data in 2 or 3 dimensions. It’s particularly effective at preserving local structure and separating clusters when visualized.

Key Characteristics of t-SNE

Non-linear Dimensionality Reduction
Preserves Local Neighbor Relationships
Ideal for Visualization
Effective on Complex Datasets
Computationally Intensive

Basic Rules for Using t-SNE

Standardize data beforehand.
Use it primarily for visualization (not downstream modeling).
Start with default parameters, then tune perplexity and learning_rate.
Works best on datasets with fewer than ~10,000 samples.

Syntax Table

SL NO	Function	Syntax Example	Description
1	Import t-SNE	`from sklearn.manifold import TSNE`	Load the t-SNE class
2	Instantiate Model	`tsne = TSNE(n_components=2)`	Define target dimensions
3	Fit and Transform	`X_tsne = tsne.fit_transform(X_scaled)`	Reduce to 2D/3D for visualization

Syntax Explanation

1. Import t-SNE

from sklearn.manifold import TSNE

Imports the TSNE class from Scikit-learn’s manifold module.
Required to create and configure a t-SNE model.
The class provides all methods necessary for fitting and transforming data.

2. Instantiate Model

tsne = TSNE(n_components=2, perplexity=30, learning_rate=200)

n_components: Number of output dimensions (commonly 2 for visualization).
perplexity: Influences the balance between local and global aspects of data. It should be tuned based on dataset size (typical values range from 5 to 50).
learning_rate: Controls how quickly the model converges.
- Too low can result in poor embeddings.
- Too high can lead to divergence.
Other important optional parameters:
- n_iter: Number of optimization iterations (default 1000).
- init: Initialization method (‘random’ or ‘pca’).
- random_state: Fixes randomness for reproducibility.
- metric: Distance measure (default is ‘euclidean’).
t-SNE is sensitive to parameter values—adjust and visualize outcomes iteratively.

3. Fit and Transform

X_tsne = tsne.fit_transform(X_scaled)

Applies t-SNE to reduce dimensionality of X_scaled.
fit_transform combines learning the embedding and transforming the data in one step.
Input X_scaled should be a preprocessed (standardized or normalized) numerical dataset.
Output X_tsne is a NumPy array of shape (n_samples, n_components).
The resulting low-dimensional data can be visualized using scatter plots.
Keep in mind that t-SNE is non-deterministic unless a random_state is set, so results may vary slightly between runs.

Real-Life Project: Visualizing Customer Segments with t-SNE

Project Overview

Visualize customer groupings from mall data using t-SNE to reveal non-linear patterns not captured by PCA.

Code

import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.manifold import TSNE

# Load dataset
data = pd.read_csv('Mall_Customers.csv')
X = data[['Age', 'Annual Income (k$)', 'Spending Score (1-100)']]

# Standardize the data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Apply t-SNE
tsne = TSNE(n_components=2, perplexity=40, learning_rate=200, random_state=42)
X_tsne = tsne.fit_transform(X_scaled)

# Plot
plt.figure(figsize=(8, 6))
plt.scatter(X_tsne[:, 0], X_tsne[:, 1], edgecolor='k')
plt.title('t-SNE Visualization of Customer Features')
plt.xlabel('t-SNE Component 1')
plt.ylabel('t-SNE Component 2')
plt.grid(True)
plt.show()

Expected Output

2D plot where visually distinct clusters emerge.
Richer pattern discovery than PCA when clusters are nonlinear.

Common Mistakes to Avoid

❌ Using raw (unscaled) data.
❌ Expecting consistent results across runs without random_state.
❌ Using t-SNE output for training predictive models.