Distribution_Analysis.py

https://github.com/rhyslwells/ML_Tools/blob/main/Explorations/Preprocess/Distribution_Analysis.py

The goodness-of-fit results represent the p-values from the Kolmogorov-Smirnov (KS) test, which assesses how well the data fits each distribution. Here’s how to interpret these values:

Higher p-value → The distribution is a better fit.
Lower p-value → The distribution is a poor fit (likely not the correct model for the data).
Threshold: A common significance level is 0.05.
- If p > 0.05, we do not reject the hypothesis that the data follows this distribution.
- If p < 0.05, we reject the hypothesis, meaning the data likely does not follow that distribution.

Example using penguins.csv column “bill_depth_mm”

Goodness-of-fit results for bill_depth_mm Gaussian: 0.026308596409291618 T: 0.025906678848475195 Chi-squared: 1.4504381882536289e-15 Exponential: 2.8020502445188308e-14 Logistic: 0.05019989765502264

Gaussian (0.0263) → Poor fit (p < 0.05). The data likely does not follow a normal distribution.
T (0.0259) → Poor fit (p < 0.05). The data does not fit a t-distribution well.
Chi-squared (1.45e-15) → Very poor fit (extremely low p-value). The data is highly unlikely to follow a chi-squared distribution.
Exponential (2.80e-14) → Very poor fit (extremely low p-value). The data is not exponentially distributed.
Logistic (0.0502) → Acceptable fit (p ≈ 0.05). The data could potentially follow a logistic distribution.

The Logistic distribution has the highest p-value (0.0502), making it the best candidate among the tested distributions. However, since it’s borderline (≈0.05), you may want to visualize the distribution and compare the fits.

Data Archive

Explorer

Distribution_Analysis.py

Backlinks

Explorer