Goodness-of-Fit

Key takeaways
* Goodness-of-fit tests assess how well sample data match an assumed probability distribution or expected frequencies.
* Common tests: chi-square (categorical data), Kolmogorov–Smirnov (continuous distributions), Anderson–Darling (sensitive to tails), Shapiro–Wilk (normality for small samples).
* Choose a test based on the data type, sample size, and whether tail behavior matters. Ensure expected counts are sufficient (rule of thumb: ≥5 per group for chi-square).

What is goodness-of-fit?

Goodness-of-fit is a family of statistical procedures that compare observed data with what a chosen model, distribution, or set of expected frequencies predicts. These tests help determine whether a sample is likely to have come from a particular population distribution or whether observed counts depart meaningfully from expectations.

Explore More Resources

How goodness-of-fit tests work

All goodness-of-fit tests share a basic hypothesis-testing framework:
* Null hypothesis (H0): the data follow the specified distribution or match expected frequencies.
* Alternative hypothesis (H1): the data do not follow the specified distribution or differ from expectations.
* Choose a significance level (alpha, commonly 0.05) and compute a test statistic. Compare it to a critical value or use a p-value to decide whether to reject H0.

Required inputs typically include:
* Observed values (from the sample).
* Expected values (from the hypothesized distribution or prior assumptions).
* Sample size and, for some tests, degrees of freedom.

Explore More Resources

Common tests and when to use them

Chi-square goodness-of-fit
* Use for categorical data divided into mutually exclusive classes (bins).
* Compares observed and expected counts across categories.
* Requires a sufficiently large sample and typically at least ~5 expected observations per group.
* Does not indicate direction or strength of an association—only whether observed frequencies differ from expected.

Kolmogorov–Smirnov (K–S) test
* Nonparametric test for continuous distributions.
* Compares the empirical distribution function of the sample to the cumulative distribution of the hypothesized distribution.
* More appropriate for larger samples; uses a D statistic. Sensitive near the center of the distribution.

Explore More Resources

Anderson–Darling (A–D) test
* Variant of the K–S approach with greater sensitivity to differences in the distribution tails.
* Useful when tail behavior matters (e.g., finance, risk analysis).

Shapiro–Wilk test
* Designed specifically to test normality for a single continuous variable.
* Recommended for small to moderate sample sizes (commonly cited for n ≤ 2000).
* Often accompanied by a Q–Q plot to visualize deviations from normality.

Explore More Resources

Other goodness-of-fit measures
* Akaike Information Criterion (AIC) and Bayesian Information Criterion (BIC): compare models by balancing fit and complexity (used for model selection rather than a formal hypothesis test).
* Cramér–von Mises (CVM): assesses fit via integrated squared differences between empirical and theoretical cumulative distributions.
* Hosmer–Lemeshow test: compares observed vs. expected frequencies for binary outcomes across grouped predicted-probability intervals.
* Kuiper’s test: similar to K–S but equally sensitive in tails and center.
* Moran’s I: evaluates spatial autocorrelation (useful when spatial dependence matters).

Practical tips
* Match the test to data type: categorical → chi-square; continuous/distributional tests → K–S, A–D, Shapiro–Wilk.
* Check assumptions: independence of observations, adequate expected counts (for chi-square), and appropriate sample size.
* When in doubt about tail differences, prefer A–D or tests designed for tail sensitivity.
* Use software packages to compute test statistics and p-values; many tests have tabulated critical values for common alpha levels.

Explore More Resources

Why goodness-of-fit matters

Model validation: confirms whether a chosen model or distribution is consistent with observed data.
Model selection: helps choose among competing models or indicate need for model refinement.
Outlier and anomaly detection: significant misfit can reveal outliers, measurement errors, or structural model problems.
Informed decisions: reliable inferences and predictions require that model assumptions hold; goodness-of-fit testing is a key check.

Goodness-of-fit vs. independence test

Goodness-of-fit tests evaluate whether observed data match a specified distribution or expected frequencies.
Independence tests (often implemented via a chi-square test of independence) assess whether two categorical variables are associated or independent.
Use a goodness-of-fit test when assessing one distribution; use an independence test when assessing a relationship between two categorical variables.

Example (illustrative)

A gym owner assumes attendance patterns: high on Mondays/Tuesdays/Saturdays, average on Wednesdays/Thursdays, and low on Fridays/Sundays. After collecting six weeks of daily attendance, the owner uses a chi-square goodness-of-fit test to compare observed counts with the expected pattern. If the test rejects the null hypothesis, the owner can revise staffing levels and schedules based on the observed distribution.

Bottom line

Goodness-of-fit testing is a core statistical tool for verifying whether data conform to assumed distributions or expected frequencies. Selecting the appropriate test depends on data type, sample size, and whether sensitivity to tail behavior or model complexity is important. Proper use of these tests strengthens model validation, improves inference, and guides data-driven decisions.