Splitting Data into Train and Test Sets using Scikit-learn

Train-test splitting is a fundamental concept in machine learning. It ensures that models are trained on one portion of the data and evaluated on another, promoting generalization and preventing overfitting. Scikit-learn provides a simple and reliable utility for splitting datasets.

Key Characteristics of Train-Test Splitting

Ensures Generalization: Evaluates model performance on unseen data.
Randomization Support: Randomizes the dataset before splitting.
Custom Split Ratios: Allows flexible train/test proportions.
Stratification: Maintains class balance during classification splits.
Reproducibility: Controlled with random seed (random_state).

Basic Rules for Train-Test Splits

Always split before preprocessing or model training.
Use train_test_split() from sklearn.model_selection.
Stratify on target variable when dealing with classification problems.
Avoid data leakage by ensuring test data is untouched during training.
Use a fixed random_state to ensure reproducibility.

Syntax Table

SL NO	Function	Syntax Example	Description
1	Import Function	`from sklearn.model_selection import train_test_split`	Imports splitter from Scikit-learn
2	Basic Split	`X_train, X_test, y_train, y_test = train_test_split(X, y)`	Splits data into train/test
3	Custom Ratio	`train_test_split(X, y, test_size=0.3)`	70/30 split example
4	Set Seed	`train_test_split(X, y, random_state=42)`	Ensures reproducible results
5	Stratified Split	`train_test_split(X, y, stratify=y)`	Maintains label proportions

Syntax Explanation

1. Basic Train-Test Split

What is it? Separates features and target into training and testing groups.
Syntax:

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y)

Explanation:
- Default split is 75% training and 25% testing.
- Random shuffling is performed before splitting.
- Keeps feature (X) and target (y) aligned.

2. Custom Split Ratio

What is it? Allows control over the percentage allocated to the test set.
Syntax:

train_test_split(X, y, test_size=0.2)

Explanation:
- 80% of data for training and 20% for testing.
- Accepts float (0.2 = 20%) or int (e.g., 100 samples).
- Ensure test size is not too small for model evaluation.

3. Stratified Splitting

What is it? Maintains label balance between train and test sets.
Syntax:

train_test_split(X, y, stratify=y)

Explanation:
- Especially useful for imbalanced datasets.
- Ensures proportion of each class is consistent.
- Crucial for fair performance evaluation.

4. Reproducibility with Random Seed

What is it? Ensures same random split every run.
Syntax:

train_test_split(X, y, random_state=42)

Explanation:
- Random shuffling can change results.
- Setting random_state makes results reproducible.
- Use same seed across experiments for consistency.

Real-Life Project: Splitting Heart Disease Dataset

Project Name

Train-Test Split for Predicting Heart Disease

Project Overview

The dataset includes various health metrics and a binary label indicating presence of heart disease. Proper train-test splitting will allow unbiased model evaluation.

Project Goal

Split data into train/test sets
Maintain label balance using stratification
Prepare data for preprocessing and modeling

Code for This Project

import pandas as pd
from sklearn.model_selection import train_test_split

# Load dataset
data = pd.read_csv('heart_disease.csv')
X = data.drop('target', axis=1)
y = data['target']

# Split with stratification and seed
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

Expected Output

80% of data in X_train, y_train
20% in X_test, y_test
Class distribution preserved
Reproducible split for modeling workflows

Common Mistakes to Avoid

❌ Fitting preprocessing before splitting → causes data leakage
❌ Ignoring class imbalance → skews evaluation metrics
❌ Forgetting random_state → inconsistent results
❌ Confusing X and y order → misaligned splits