You can access a significantly larger sample of the platform's content for free by logging in with your Gmail account. Sign in now to explore.

k-means is a very popular clustering algorithm:

A. How does it work? B. How do you choose k? C. How do you prepare your data for k-means? D. How do you choose starting points? E. Can you think of scenarios where k-means will fail?

Below we discuss the answers to the above questions:

**A.** The k-means works as follows:

First, we initialize the \(k\)centroids (assume vector \(\mu\) stores the centroids)

Then, we assign data points to the centroid that they are closest to (Euclidean distance between the data point and the centroid).

Finally, we update the centroids by recalculating the mean of all the data points assigned to a centroid.

We repeat these steps until convergence

**B.** The K-means algorithm seeks to minimize the sum of squared distances between each data point and its assigned centroid, which is also known as the within-cluster sum of squares (WCSS). The algorithm aims to find the optimal K number of clusters that minimizes the WCSS. The optimal value of K can be determined using techniques such as the **elbow** method, which involves plotting the WCSS against the number of clusters and selecting the value of K where the curve starts to level off.

An alternative approach to select \(K\) is through the **silhouette** score (see here: https://scikit-learn.org/stable/auto_examples/cluster/plot_kmeans_silhouette_analysis.html )

In general, we use two metrics to evaluate the quality of our k-means clusters:

**Distortion:**It is calculated as the average of the squared distances from the cluster centers of the respective clusters. Typically, the Euclidean distance metric is used.**Inertia:**It is the sum of squared distances of samples to their closest cluster center.

When choosing k, the goal is to minimize distortion (and/or inertia): https://www.geeksforgeeks.org/elbow-method-for-optimal-value-of-k-in-kmeans/

Note that sometimes we might choose a convenient k based on the problem we are solving.

**C.** K-Means uses distance-based measurements (e.g., Euclidean Distance) to calculate how similar each data point is to centroids using values from all the features. These features usually take values in incomparable units (e.g., currency in dollars, weight in kg, temperature in Fahrenheit). In order to produce a fair result, it is recommended to standardize the raw data. We can transform the raw data to have a mean of zero and a standard deviation of one so that all the features have equal weights.

**D.** We can use something like **K-Means++** that allows overcoming getting stuck in a poor local optimum, hence improving the quality of the clustering. The intuition is simple. We will pick up initial centroids that are far away from each other so that it is more likely to pick the points from different clusters. K-Means++ can be implemented in the following steps.

Step 1: We randomly pick a data point as the first centroid.

Step 2: We compute the distance between the remaining points with the nearest centroid.

Step 3: We pick the next centroid such that the probability of picking a given point as the centroid is proportional to the distance between this given point and its nearest chosen centroid. In other words, the farther a point is away from the chosen centroids, the more likely it will be picked as the next centroid.

Repeat steps 2–3 until K centroids are picked.

**E.** In general, k-means will fail when clusters are defined by functions that are not symmetric around a centroid.

There are some nice visualizations in the following links:

k-means

- Analyze prediction error Medium (Prediction error, Bias-variance tradeoff, Diagnostics, Learning curves)
- Imbalanced dataset Medium (Class imbalance)
- Encoding categorical features Easy (Categorical features, Embeddings, One-hot encoding, Hashing)
- Multicollinearity Medium (Multicollinearity, Linear regression)
- Linear regression assumptions Easy (Linear regression)