Overfitting: What It Is and How to Prevent It

Key takeaways

Overfitting occurs when a model fits its training data too closely and fails to generalize to new data.
Overfit models show low bias but high variance: they perform well on training data and poorly on unseen data.
Common prevention strategies include cross-validation, ensembling, simplifying the model, and augmenting or expanding the dataset.
The opposite problem—underfitting—occurs when a model is too simple and cannot capture underlying patterns.

What is overfitting?

Overfitting is a modeling error that arises when a function or model is tailored too closely to a limited dataset. The model captures noise and idiosyncrasies in the training data rather than the true underlying pattern. As a result, its predictive power on new, unseen data is reduced or lost.

Overfitting often appears when models become unnecessarily complex relative to the amount or quality of available data. Real-world data contain measurement errors and random variation; forcing a model to conform tightly to those imperfections leads to misleadingly strong performance on the training set but poor generalization.

Explore More Resources

Why overfitting happens

Excessive model complexity (too many parameters or unnecessary features).
Limited or unrepresentative training data.
Training on noisy data without accounting for variability.
Feature redundancy or overlapping information that confuses the model.

Overfitting vs. underfitting

Overfitting: low bias and high variance — the model is too flexible and learns noise.
Underfitting: high bias and low variance — the model is too simple and misses important structure.
Balancing bias and variance is central to building an effective predictive model.

How to detect overfitting

Very high accuracy on training data but significantly worse performance on validation or test data.
Large differences between training error and validation/test error.
Model complexity that seems disproportionate to the size of the dataset.

How to prevent or reduce overfitting

Practical strategies include:
* Cross-validation: split the data into folds and evaluate model performance across them to get a reliable estimate of generalization error.
Ensembling: combine predictions from multiple independent models to reduce variance.
Data augmentation and expansion: increase the diversity and size of the training set so the model learns broader patterns.
Model simplification and feature selection: remove irrelevant or redundant features and prefer simpler models when appropriate.
Regularization (penalizing large parameter values) and early stopping can also limit complexity and help generalize.

Example

A university builds a model to predict which applicants will graduate. Training on 5,000 applicants, the model achieves 98% accuracy on that dataset. When applied to a different group of 5,000 applicants, accuracy drops to 50%. The model was overfit to the peculiarities of the first dataset and did not generalize.

Explore More Resources

Practical advice

Always evaluate models on data that were not used for training.
Monitor training vs. validation performance to spot divergence.
Prefer simpler models when they perform similarly to more complex ones.
Collect more and higher-quality data whenever feasible.

Conclusion

Overfitting undermines a model’s usefulness as a predictive tool. Awareness of overfitting, careful validation, appropriate model complexity, and techniques such as cross-validation and ensembling help create models that generalize well to new data.