Train-test splitting is a fundamental concept in machine learning. It ensures that models are trained on one portion of the data and evaluated on another, promoting generalization and preventing overfitting. Scikit-learn provides a simple and reliable utility for splitting datasets.
Key Characteristics of Train-Test Splitting
- Ensures Generalization: Evaluates model performance on unseen data.
- Randomization Support: Randomizes the dataset before splitting.
- Custom Split Ratios: Allows flexible train/test proportions.
- Stratification: Maintains class balance during classification splits.
- Reproducibility: Controlled with random seed (
random_state).
Basic Rules for Train-Test Splits
- Always split before preprocessing or model training.
- Use
train_test_split()fromsklearn.model_selection. - Stratify on target variable when dealing with classification problems.
- Avoid data leakage by ensuring test data is untouched during training.
- Use a fixed
random_stateto ensure reproducibility.
Syntax Table
| SL NO | Function | Syntax Example | Description |
|---|---|---|---|
| 1 | Import Function | from sklearn.model_selection import train_test_split |
Imports splitter from Scikit-learn |
| 2 | Basic Split | X_train, X_test, y_train, y_test = train_test_split(X, y) |
Splits data into train/test |
| 3 | Custom Ratio | train_test_split(X, y, test_size=0.3) |
70/30 split example |
| 4 | Set Seed | train_test_split(X, y, random_state=42) |
Ensures reproducible results |
| 5 | Stratified Split | train_test_split(X, y, stratify=y) |
Maintains label proportions |
Syntax Explanation
1. Basic Train-Test Split
- What is it? Separates features and target into training and testing groups.
- Syntax:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y)
- Explanation:
- Default split is 75% training and 25% testing.
- Random shuffling is performed before splitting.
- Keeps feature (
X) and target (y) aligned.
2. Custom Split Ratio
- What is it? Allows control over the percentage allocated to the test set.
- Syntax:
train_test_split(X, y, test_size=0.2)
- Explanation:
- 80% of data for training and 20% for testing.
- Accepts float (
0.2= 20%) or int (e.g., 100 samples). - Ensure test size is not too small for model evaluation.
3. Stratified Splitting
- What is it? Maintains label balance between train and test sets.
- Syntax:
train_test_split(X, y, stratify=y)
- Explanation:
- Especially useful for imbalanced datasets.
- Ensures proportion of each class is consistent.
- Crucial for fair performance evaluation.
4. Reproducibility with Random Seed
- What is it? Ensures same random split every run.
- Syntax:
train_test_split(X, y, random_state=42)
- Explanation:
- Random shuffling can change results.
- Setting
random_statemakes results reproducible. - Use same seed across experiments for consistency.
Real-Life Project: Splitting Heart Disease Dataset
Project Name
Train-Test Split for Predicting Heart Disease
Project Overview
The dataset includes various health metrics and a binary label indicating presence of heart disease. Proper train-test splitting will allow unbiased model evaluation.
Project Goal
- Split data into train/test sets
- Maintain label balance using stratification
- Prepare data for preprocessing and modeling
Code for This Project
import pandas as pd
from sklearn.model_selection import train_test_split
# Load dataset
data = pd.read_csv('heart_disease.csv')
X = data.drop('target', axis=1)
y = data['target']
# Split with stratification and seed
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
Expected Output
- 80% of data in
X_train,y_train - 20% in
X_test,y_test - Class distribution preserved
- Reproducible split for modeling workflows
Common Mistakes to Avoid
- ❌ Fitting preprocessing before splitting → causes data leakage
- ❌ Ignoring class imbalance → skews evaluation metrics
- ❌ Forgetting
random_state→ inconsistent results - ❌ Confusing
Xandyorder → misaligned splits
Further Reading Recommendation
📘 Hands-On Python and Scikit-Learn: A Practical Guide to Machine Learning
by Sarful Hassan
🔗 Available on Amazon
Also explore:
- 🔗 Scikit-learn Docs: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html
- 🔗 Cross-validation Guide: https://scikit-learn.org/stable/modules/cross_validation.html
- 🔗 Tutorials on Data Splitting in ML Workflows (YouTube, Kaggle)
