Visualize splitting strategies, stratified splits, and class distribution
Dataset Configuration
Class Configuration (for Stratified)
Same seed produces same split
Split Results
Split Visualization
Class Distribution Comparison
Split Details
Help
What is Train-Test Split?
Train-test split divides a dataset into two subsets:
Training Set: Used to train/fit the model (typically 70-80%)
Test Set: Used to evaluate model performance (typically 20-30%)
This prevents overfitting by evaluating on unseen data.
Split Methods
Random Split: Randomly shuffle and divide data
✓ Simple and unbiased
✗ May create imbalanced class distributions
Stratified Split: Maintains class proportions in both sets
✓ Ensures representative splits for classification
✓ Recommended for imbalanced datasets
Requires: Label/target column
Time-Based Split: First N% for training, last M% for testing
✓ Respects temporal ordering
✓ Prevents data leakage in time series
Use for: Time series, sequential data
Group-Based Split: Keep groups together (don't split within groups)
✓ Prevents data leakage from related samples
Use for: Patient studies, user data, hierarchical data
Best Practices
Test Size: Typically 20-30% of dataset
Smaller datasets: 20-25% test
Larger datasets: Can use smaller % (e.g., 10%)
Random Seed: Set for reproducibility
Stratification: Use for classification with imbalanced classes
Time Series: Always use time-based split, never random