Train-Test Split Simulator

Overview

The Train-Test Split Simulator demonstrates different data splitting strategies for machine learning model evaluation. Visualize how random, stratified, time-based, and group-based splits affect data distribution and class balance. The tool helps you understand which splitting method is appropriate for your specific use case and how to avoid common pitfalls like data leakage.

Open in new tab

Tips

Use stratified splitting for classification problems with imbalanced classes to ensure both sets have representative class distributions
Always use time-based splits for sequential data like time series or forecasting - random splits cause data leakage
Apply group-based splits when you have multiple samples per subject (e.g., patient studies, user behavior) to prevent information leakage
Set a random seed and document it for reproducible experiments and fair model comparisons
For small datasets (<1000 samples), use 20-30% for testing; for large datasets (>10,000), 10-20% is sufficient
Visualize the class distribution charts to verify your test set truly represents your data distribution
Consider implementing a three-way split (train/validation/test) when you need separate sets for hyperparameter tuning and final evaluation