Overfitting: What It Is and How to Prevent It
Key takeaways
- Overfitting occurs when a model fits its training data too closely and fails to generalize to new data.
- Overfit models show low bias but high variance: they perform well on training data and poorly on unseen data.
- Common prevention strategies include cross-validation, ensembling, simplifying the model, and augmenting or expanding the dataset.
- The opposite problem—underfitting—occurs when a model is too simple and cannot capture underlying patterns.
What is overfitting?
Overfitting is a modeling error that arises when a function or model is tailored too closely to a limited dataset. The model captures noise and idiosyncrasies in the training data rather than the true underlying pattern. As a result, its predictive power on new, unseen data is reduced or lost.
Overfitting often appears when models become unnecessarily complex relative to the amount or quality of available data. Real-world data contain measurement errors and random variation; forcing a model to conform tightly to those imperfections leads to misleadingly strong performance on the training set but poor generalization.
Explore More Resources
Why overfitting happens
- Excessive model complexity (too many parameters or unnecessary features).
- Limited or unrepresentative training data.
- Training on noisy data without accounting for variability.
- Feature redundancy or overlapping information that confuses the model.
Overfitting vs. underfitting
- Overfitting: low bias and high variance — the model is too flexible and learns noise.
- Underfitting: high bias and low variance — the model is too simple and misses important structure.
Balancing bias and variance is central to building an effective predictive model.
How to detect overfitting
- Very high accuracy on training data but significantly worse performance on validation or test data.
- Large differences between training error and validation/test error.
- Model complexity that seems disproportionate to the size of the dataset.
How to prevent or reduce overfitting
Practical strategies include:
* Cross-validation: split the data into folds and evaluate model performance across them to get a reliable estimate of generalization error.
Ensembling: combine predictions from multiple independent models to reduce variance.
Data augmentation and expansion: increase the diversity and size of the training set so the model learns broader patterns.
Model simplification and feature selection: remove irrelevant or redundant features and prefer simpler models when appropriate.
Regularization (penalizing large parameter values) and early stopping can also limit complexity and help generalize.
Example
A university builds a model to predict which applicants will graduate. Training on 5,000 applicants, the model achieves 98% accuracy on that dataset. When applied to a different group of 5,000 applicants, accuracy drops to 50%. The model was overfit to the peculiarities of the first dataset and did not generalize.
Explore More Resources
Practical advice
- Always evaluate models on data that were not used for training.
- Monitor training vs. validation performance to spot divergence.
- Prefer simpler models when they perform similarly to more complex ones.
- Collect more and higher-quality data whenever feasible.
Conclusion
Overfitting undermines a model’s usefulness as a predictive tool. Awareness of overfitting, careful validation, appropriate model complexity, and techniques such as cross-validation and ensembling help create models that generalize well to new data.