File Path: \ML_Tools\Build\Clustering\KMeans\K_Means.py
Key Concepts Used in the Script
-
Data Loading:
- The script reads data from a CSV file (
penguins.csv
) and uses a sample dataset with random features for demonstration purposes.
- The script reads data from a CSV file (
-
Data Preprocessing:
- Standardization: Features are standardized using
sklearn.preprocessing.scale
andStandardScaler
to ensure that all features contribute equally to the clustering process.
- Standardization: Features are standardized using
-
Feature Selection:
- Specific features, such as
bill_length_mm
andbill_depth_mm
, are selected for clustering.
- Specific features, such as
-
K-Means Clustering:
- The core clustering algorithm is applied with
n_clusters=3
. - Outputs include cluster centroids and labels for each data point.
- The core clustering algorithm is applied with
-
Visualization:
- Scatter plots are used to display the clustering results, highlighting the cluster centroids.
-
Evaluation of Optimal Clusters:
- Elbow Method: This method iterates through different numbers of clusters to determine the optimal number based on the within-cluster sum of squares (WCSS).
-
Cluster Assignment:
- Labels are assigned to data points, and the results are visualized to show the clustering outcome.
-
Exploratory Analysis:
- The script examines the impact of different numbers of clusters using an example function (
scatter_elbow
).
- The script examines the impact of different numbers of clusters using an example function (