t-SNE (t-distributed Stochastic Neighbor Embedding) is a powerful non-linear technique for visualizing high-dimensional data in 2 or 3 dimensions. It’s particularly effective at preserving local structure and separating clusters when visualized.
Key Characteristics of t-SNE
- Non-linear Dimensionality Reduction
- Preserves Local Neighbor Relationships
- Ideal for Visualization
- Effective on Complex Datasets
- Computationally Intensive
Basic Rules for Using t-SNE
- Standardize data beforehand.
- Use it primarily for visualization (not downstream modeling).
- Start with default parameters, then tune
perplexityandlearning_rate. - Works best on datasets with fewer than ~10,000 samples.
Syntax Table
| SL NO | Function | Syntax Example | Description |
|---|---|---|---|
| 1 | Import t-SNE | from sklearn.manifold import TSNE |
Load the t-SNE class |
| 2 | Instantiate Model | tsne = TSNE(n_components=2) |
Define target dimensions |
| 3 | Fit and Transform | X_tsne = tsne.fit_transform(X_scaled) |
Reduce to 2D/3D for visualization |
Syntax Explanation
1. Import t-SNE
from sklearn.manifold import TSNE
- Imports the
TSNEclass from Scikit-learn’smanifoldmodule. - Required to create and configure a t-SNE model.
- The class provides all methods necessary for fitting and transforming data.
2. Instantiate Model
tsne = TSNE(n_components=2, perplexity=30, learning_rate=200)
n_components: Number of output dimensions (commonly 2 for visualization).perplexity: Influences the balance between local and global aspects of data. It should be tuned based on dataset size (typical values range from 5 to 50).learning_rate: Controls how quickly the model converges.- Too low can result in poor embeddings.
- Too high can lead to divergence.
- Other important optional parameters:
n_iter: Number of optimization iterations (default 1000).init: Initialization method (‘random’ or ‘pca’).random_state: Fixes randomness for reproducibility.metric: Distance measure (default is ‘euclidean’).
- t-SNE is sensitive to parameter values—adjust and visualize outcomes iteratively.
3. Fit and Transform
X_tsne = tsne.fit_transform(X_scaled)
- Applies t-SNE to reduce dimensionality of
X_scaled. fit_transformcombines learning the embedding and transforming the data in one step.- Input
X_scaledshould be a preprocessed (standardized or normalized) numerical dataset. - Output
X_tsneis a NumPy array of shape(n_samples, n_components). - The resulting low-dimensional data can be visualized using scatter plots.
- Keep in mind that t-SNE is non-deterministic unless a
random_stateis set, so results may vary slightly between runs.
Real-Life Project: Visualizing Customer Segments with t-SNE
Project Overview
Visualize customer groupings from mall data using t-SNE to reveal non-linear patterns not captured by PCA.
Code
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.manifold import TSNE
# Load dataset
data = pd.read_csv('Mall_Customers.csv')
X = data[['Age', 'Annual Income (k$)', 'Spending Score (1-100)']]
# Standardize the data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Apply t-SNE
tsne = TSNE(n_components=2, perplexity=40, learning_rate=200, random_state=42)
X_tsne = tsne.fit_transform(X_scaled)
# Plot
plt.figure(figsize=(8, 6))
plt.scatter(X_tsne[:, 0], X_tsne[:, 1], edgecolor='k')
plt.title('t-SNE Visualization of Customer Features')
plt.xlabel('t-SNE Component 1')
plt.ylabel('t-SNE Component 2')
plt.grid(True)
plt.show()
Expected Output
- 2D plot where visually distinct clusters emerge.
- Richer pattern discovery than PCA when clusters are nonlinear.
Common Mistakes to Avoid
- ❌ Using raw (unscaled) data.
- ❌ Expecting consistent results across runs without
random_state. - ❌ Using t-SNE output for training predictive models.
Further Reading
- Scikit-learn t-SNE Documentation
- Visualizing High-Dimensional Data with t-SNE (Blog)
- t-SNE Examples on Kaggle
