Splitting Data into Train and Test Sets using Scikit-learn

Train-test splitting is a fundamental concept in machine learning. It ensures that models are trained on one portion of the data and evaluated on another, promoting generalization and preventing overfitting. Scikit-learn provides a simple and reliable utility for splitting datasets.

Key Characteristics of Train-Test Splitting

  • Ensures Generalization: Evaluates model performance on unseen data.
  • Randomization Support: Randomizes the dataset before splitting.
  • Custom Split Ratios: Allows flexible train/test proportions.
  • Stratification: Maintains class balance during classification splits.
  • Reproducibility: Controlled with random seed (random_state).

Basic Rules for Train-Test Splits

  • Always split before preprocessing or model training.
  • Use train_test_split() from sklearn.model_selection.
  • Stratify on target variable when dealing with classification problems.
  • Avoid data leakage by ensuring test data is untouched during training.
  • Use a fixed random_state to ensure reproducibility.

Syntax Table

SL NO Function Syntax Example Description
1 Import Function from sklearn.model_selection import train_test_split Imports splitter from Scikit-learn
2 Basic Split X_train, X_test, y_train, y_test = train_test_split(X, y) Splits data into train/test
3 Custom Ratio train_test_split(X, y, test_size=0.3) 70/30 split example
4 Set Seed train_test_split(X, y, random_state=42) Ensures reproducible results
5 Stratified Split train_test_split(X, y, stratify=y) Maintains label proportions

Syntax Explanation

1. Basic Train-Test Split

  • What is it? Separates features and target into training and testing groups.
  • Syntax:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y)
  • Explanation:
    • Default split is 75% training and 25% testing.
    • Random shuffling is performed before splitting.
    • Keeps feature (X) and target (y) aligned.

2. Custom Split Ratio

  • What is it? Allows control over the percentage allocated to the test set.
  • Syntax:
train_test_split(X, y, test_size=0.2)
  • Explanation:
    • 80% of data for training and 20% for testing.
    • Accepts float (0.2 = 20%) or int (e.g., 100 samples).
    • Ensure test size is not too small for model evaluation.

3. Stratified Splitting

  • What is it? Maintains label balance between train and test sets.
  • Syntax:
train_test_split(X, y, stratify=y)
  • Explanation:
    • Especially useful for imbalanced datasets.
    • Ensures proportion of each class is consistent.
    • Crucial for fair performance evaluation.

4. Reproducibility with Random Seed

  • What is it? Ensures same random split every run.
  • Syntax:
train_test_split(X, y, random_state=42)
  • Explanation:
    • Random shuffling can change results.
    • Setting random_state makes results reproducible.
    • Use same seed across experiments for consistency.

Real-Life Project: Splitting Heart Disease Dataset

Project Name

Train-Test Split for Predicting Heart Disease

Project Overview

The dataset includes various health metrics and a binary label indicating presence of heart disease. Proper train-test splitting will allow unbiased model evaluation.

Project Goal

  • Split data into train/test sets
  • Maintain label balance using stratification
  • Prepare data for preprocessing and modeling

Code for This Project

import pandas as pd
from sklearn.model_selection import train_test_split

# Load dataset
data = pd.read_csv('heart_disease.csv')
X = data.drop('target', axis=1)
y = data['target']

# Split with stratification and seed
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

Expected Output

  • 80% of data in X_train, y_train
  • 20% in X_test, y_test
  • Class distribution preserved
  • Reproducible split for modeling workflows

Common Mistakes to Avoid

  • ❌ Fitting preprocessing before splitting → causes data leakage
  • ❌ Ignoring class imbalance → skews evaluation metrics
  • ❌ Forgetting random_state → inconsistent results
  • ❌ Confusing X and y order → misaligned splits

Further Reading Recommendation

📘 Hands-On Python and Scikit-Learn: A Practical Guide to Machine Learning
by Sarful Hassan
🔗 Available on Amazon

Also explore: