t-SNE Visualization using Scikit-learn

t-SNE (t-distributed Stochastic Neighbor Embedding) is a powerful non-linear technique for visualizing high-dimensional data in 2 or 3 dimensions. It’s particularly effective at preserving local structure and separating clusters when visualized.

Key Characteristics of t-SNE

  • Non-linear Dimensionality Reduction
  • Preserves Local Neighbor Relationships
  • Ideal for Visualization
  • Effective on Complex Datasets
  • Computationally Intensive

Basic Rules for Using t-SNE

  • Standardize data beforehand.
  • Use it primarily for visualization (not downstream modeling).
  • Start with default parameters, then tune perplexity and learning_rate.
  • Works best on datasets with fewer than ~10,000 samples.

Syntax Table

SL NO Function Syntax Example Description
1 Import t-SNE from sklearn.manifold import TSNE Load the t-SNE class
2 Instantiate Model tsne = TSNE(n_components=2) Define target dimensions
3 Fit and Transform X_tsne = tsne.fit_transform(X_scaled) Reduce to 2D/3D for visualization

Syntax Explanation

1. Import t-SNE

from sklearn.manifold import TSNE
  • Imports the TSNE class from Scikit-learn’s manifold module.
  • Required to create and configure a t-SNE model.
  • The class provides all methods necessary for fitting and transforming data.

2. Instantiate Model

tsne = TSNE(n_components=2, perplexity=30, learning_rate=200)
  • n_components: Number of output dimensions (commonly 2 for visualization).
  • perplexity: Influences the balance between local and global aspects of data. It should be tuned based on dataset size (typical values range from 5 to 50).
  • learning_rate: Controls how quickly the model converges.
    • Too low can result in poor embeddings.
    • Too high can lead to divergence.
  • Other important optional parameters:
    • n_iter: Number of optimization iterations (default 1000).
    • init: Initialization method (‘random’ or ‘pca’).
    • random_state: Fixes randomness for reproducibility.
    • metric: Distance measure (default is ‘euclidean’).
  • t-SNE is sensitive to parameter values—adjust and visualize outcomes iteratively.

3. Fit and Transform

X_tsne = tsne.fit_transform(X_scaled)
  • Applies t-SNE to reduce dimensionality of X_scaled.
  • fit_transform combines learning the embedding and transforming the data in one step.
  • Input X_scaled should be a preprocessed (standardized or normalized) numerical dataset.
  • Output X_tsne is a NumPy array of shape (n_samples, n_components).
  • The resulting low-dimensional data can be visualized using scatter plots.
  • Keep in mind that t-SNE is non-deterministic unless a random_state is set, so results may vary slightly between runs.

Real-Life Project: Visualizing Customer Segments with t-SNE

Project Overview

Visualize customer groupings from mall data using t-SNE to reveal non-linear patterns not captured by PCA.

Code

import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.manifold import TSNE

# Load dataset
data = pd.read_csv('Mall_Customers.csv')
X = data[['Age', 'Annual Income (k$)', 'Spending Score (1-100)']]

# Standardize the data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Apply t-SNE
tsne = TSNE(n_components=2, perplexity=40, learning_rate=200, random_state=42)
X_tsne = tsne.fit_transform(X_scaled)

# Plot
plt.figure(figsize=(8, 6))
plt.scatter(X_tsne[:, 0], X_tsne[:, 1], edgecolor='k')
plt.title('t-SNE Visualization of Customer Features')
plt.xlabel('t-SNE Component 1')
plt.ylabel('t-SNE Component 2')
plt.grid(True)
plt.show()

Expected Output

  • 2D plot where visually distinct clusters emerge.
  • Richer pattern discovery than PCA when clusters are nonlinear.

Common Mistakes to Avoid

  • ❌ Using raw (unscaled) data.
  • ❌ Expecting consistent results across runs without random_state.
  • ❌ Using t-SNE output for training predictive models.

Further Reading