https://github.com/rhyslwells/ML_Tools/blob/main/Explorations/Preprocess/Distribution_Analysis.py
The goodness-of-fit results represent the p-values from the Kolmogorov-Smirnov (KS) test, which assesses how well the data fits each distribution. Here’s how to interpret these values:
- Higher p-value → The distribution is a better fit.
- Lower p-value → The distribution is a poor fit (likely not the correct model for the data).
- Threshold: A common significance level is 0.05.
- If p > 0.05, we do not reject the hypothesis that the data follows this distribution.
- If p < 0.05, we reject the hypothesis, meaning the data likely does not follow that distribution.
Example using penguins.csv column “bill_depth_mm”
Goodness-of-fit results for bill_depth_mm Gaussian: 0.026308596409291618 T: 0.025906678848475195 Chi-squared: 1.4504381882536289e-15 Exponential: 2.8020502445188308e-14 Logistic: 0.05019989765502264
- Gaussian (0.0263) → Poor fit (p < 0.05). The data likely does not follow a normal distribution.
- T (0.0259) → Poor fit (p < 0.05). The data does not fit a t-distribution well.
- Chi-squared (1.45e-15) → Very poor fit (extremely low p-value). The data is highly unlikely to follow a chi-squared distribution.
- Exponential (2.80e-14) → Very poor fit (extremely low p-value). The data is not exponentially distributed.
- Logistic (0.0502) → Acceptable fit (p ≈ 0.05). The data could potentially follow a logistic distribution.
The Logistic distribution has the highest p-value (0.0502), making it the best candidate among the tested distributions. However, since it’s borderline (≈0.05), you may want to visualize the distribution and compare the fits.