Train-Test Split Simulator

Overview

The Train-Test Split Simulator demonstrates different data splitting strategies for machine learning model evaluation. Visualize how random, stratified, time-based, and group-based splits affect data distribution and class balance. The tool helps you understand which splitting method is appropriate for your specific use case and how to avoid common pitfalls like data leakage.

Open in new tab

Tips

  • Use stratified splitting for classification problems with imbalanced classes to ensure both sets have representative class distributions
  • Always use time-based splits for sequential data like time series or forecasting - random splits cause data leakage
  • Apply group-based splits when you have multiple samples per subject (e.g., patient studies, user behavior) to prevent information leakage
  • Set a random seed and document it for reproducible experiments and fair model comparisons
  • For small datasets (<1000 samples), use 20-30% for testing; for large datasets (>10,000), 10-20% is sufficient
  • Visualize the class distribution charts to verify your test set truly represents your data distribution
  • Consider implementing a three-way split (train/validation/test) when you need separate sets for hyperparameter tuning and final evaluation