https://github.com/rhyslwells/ML_Tools/blob/main/Explorations/Preprocess/Distribution_Analysis.py

The goodness-of-fit results represent the p-values from the Kolmogorov-Smirnov (KS) test, which assesses how well the data fits each distribution. Here’s how to interpret these values:

  1. Higher p-value → The distribution is a better fit.
  2. Lower p-value → The distribution is a poor fit (likely not the correct model for the data).
  3. Threshold: A common significance level is 0.05.
    • If p > 0.05, we do not reject the hypothesis that the data follows this distribution.
    • If p < 0.05, we reject the hypothesis, meaning the data likely does not follow that distribution.

Example using penguins.csv column “bill_depth_mm”

Goodness-of-fit results for bill_depth_mm Gaussian: 0.026308596409291618 T: 0.025906678848475195 Chi-squared: 1.4504381882536289e-15 Exponential: 2.8020502445188308e-14 Logistic: 0.05019989765502264

  • Gaussian (0.0263)Poor fit (p < 0.05). The data likely does not follow a normal distribution.
  • T (0.0259)Poor fit (p < 0.05). The data does not fit a t-distribution well.
  • Chi-squared (1.45e-15)Very poor fit (extremely low p-value). The data is highly unlikely to follow a chi-squared distribution.
  • Exponential (2.80e-14)Very poor fit (extremely low p-value). The data is not exponentially distributed.
  • Logistic (0.0502)Acceptable fit (p ≈ 0.05). The data could potentially follow a logistic distribution.

The Logistic distribution has the highest p-value (0.0502), making it the best candidate among the tested distributions. However, since it’s borderline (≈0.05), you may want to visualize the distribution and compare the fits.