Train-Test Split Simulator

Visualize splitting strategies, stratified splits, and class distribution

Dataset Configuration

Same seed produces same split

Help

What is Train-Test Split?

Train-test split divides a dataset into two subsets:

  • Training Set: Used to train/fit the model (typically 70-80%)
  • Test Set: Used to evaluate model performance (typically 20-30%)

This prevents overfitting by evaluating on unseen data.

Split Methods
  • Random Split:
    Randomly shuffle and divide data
    ✓ Simple and unbiased
    ✗ May create imbalanced class distributions
  • Stratified Split:
    Maintains class proportions in both sets
    ✓ Ensures representative splits for classification
    ✓ Recommended for imbalanced datasets
    Requires: Label/target column
  • Time-Based Split:
    First N% for training, last M% for testing
    ✓ Respects temporal ordering
    ✓ Prevents data leakage in time series
    Use for: Time series, sequential data
  • Group-Based Split:
    Keep groups together (don't split within groups)
    ✓ Prevents data leakage from related samples
    Use for: Patient studies, user data, hierarchical data
Best Practices
  • Test Size: Typically 20-30% of dataset
    • Smaller datasets: 20-25% test
    • Larger datasets: Can use smaller % (e.g., 10%)
  • Random Seed: Set for reproducibility
  • Stratification: Use for classification with imbalanced classes
  • Time Series: Always use time-based split, never random
  • Validation Set: Consider train-validation-test (e.g., 60-20-20)
Common Issues
  • Data Leakage:
    Information from test set influences training
    Solutions: Time-based split for temporal data, group split for related samples
  • Class Imbalance:
    Random split may create very imbalanced test set
    Solution: Use stratified split
  • Small Datasets:
    Too little data for reliable test set
    Solution: Use cross-validation instead
  • Temporal Ordering:
    Random split breaks time dependencies
    Solution: Use time-based split
Use Cases
  • Model Evaluation: Assess performance on unseen data
  • Hyperparameter Tuning: Train on train set, validate on test set
  • Feature Selection: Select features using training data only
  • Comparison: Compare different models fairly
  • Deployment Planning: Estimate real-world performance