USE: WCSS (within-cluster sum of squares)
WCSS is a measure developed within the ANOVA framework. It gives a very good idea about the different distance between different clusters and within clusters, thus providing us a rule for deciding the appropriate number of clusters.
The plot will resemble an “elbow,” and the goal is to find the point where the decrease in WCSS slows down, forming an elbow-like shape.
Elbow numbers are the point where the rate of decrease in WCSS starts to flatten out
The rationale behind the elbow method is that
Rationale: as you increase the number of clusters (K), the WCSS will generally decrease because each cluster becomes smaller. However, there is a point where the addition of more clusters provides diminishing returns in terms of reducing WCSS. The elbow point represents a good balance between capturing the variance in the data and avoiding excessive fragmentation.
Code
# Use WCSS and elbow method
# number of clusters
wcss=[]
start=2
end=10
# Create all possible cluster solutions with a loop
for i in range(start,end):
# Cluster solution with i clusters
kmeans = KMeans(i)
# Fit the data
kmeans.fit(df_scaled)
# Find WCSS for the current iteration
wcss_iter = kmeans.inertia_
# Append the value to the WCSS list
wcss.append(wcss_iter)
# Create a variable containing the numbers from 1 to 6, so we can use it as X axis of the future plot
number_clusters = range(start,end)
# Plot the number of clusters vs WCSS
plt.plot(number_clusters,wcss)
# Name your graph
plt.title('The Elbow Method')
# Name the x-axis
plt.xlabel('Number of clusters')
# Name the y-axis
plt.ylabel('Within-cluster Sum of Squares')
# Identify the elbow numbers (there may be more than one thats best)
elbow_nums=[4,5,6,7,8]
plotting
# function to give scatter for each elbow number
def scatter_elbow(X, elbow_num, var1, var2):
"""
Apply clustering with elbow method and plot a scatter plot with cluster information.
Parameters:
- X: DataFrame, input data for clustering
- elbow_num: int, number of clusters determined by elbow method
- var1, var2: str, names of the variables for the scatter plot
Returns:
None (plots the scatter plot)
"""
# Apply [[clustering]] with elbow number
kmeans = KMeans(elbow_num)
kmeans.fit(X)
# Add cluster information
identified_clusters = kmeans.fit_predict(X)
X['Cluster'] = identified_clusters
# Plot
plt.scatter(X[var1], X[var2], c=X['Cluster'], cmap='rainbow')
plt.xlabel(var1)
plt.ylabel(var2)
plt.title(f"{elbow_num}-Clustering for {var1}-{var2}")
plt.show()
# Example usage:
# scatter_elbow(data, elbow_num, 'var1', 'var2')
for elbow_num in elbow_nums:
scatter_elbow(df, elbow_num, var1, var2)