Introduction

Clustering is a fundamental technique in data science, machine learning, and applied mathematics. At its core, it provides us a way to uncover hidden structure and patterns in data by grouping similar items — often represented them as vectors — based on how close they are to one another in some feature space.

In this article, I explore the problem of clustering a collection of vectors into meaningful groups, where similarity is typically measured by the distance between pairs of vectors. The focus is on one of the most well-known and widely used clustering methods: the K-Means algorithm. Along the way, I’ll demonstrate typical practical applications and provide Python examples for further illustration.

The Value of Clustering

Clustering is a fundamental and versatile technique used across a wide range of fields—including data science, machine learning, and mathematics—to uncover hidden structure and patterns in data.

In data science, clustering is often used in exploratory data analysis to reveal natural groupings within a dataset. These groupings may not be obvious at first glance, but once identified, can provide critical insights. For example, in a marketing context, clustering can help segment customers based on behaviour or purchasing patterns, making large, unwieldy datasets more manageable—and ultimately more useful. These segments can also improve the performance of downstream machine learning models by focusing them on more coherent subsets of data. Clustering can even help identify noise or anomalies, making it useful for tasks like fraud detection or outlier analysis.

Another important use of clustering is in feature engineering. By assigning each data point to a cluster, we can create new features that encode group membership—features; the group similarities carry meaningful information for predictive models - meaning that has been engineered from the dataset.

In machine learning, clustering algorithms usually fall under the umbrella of unsupervised learning; that is, models that derive their insights from unlabeld data. Since it doesn’t require labeled data, it’s particularly useful in domains where labels are expensive or difficult to obtain. Clustering can serve as a preprocessing step (~ feature engineering) to generate pseudo-labels for semi-supervised learning, or to break a dataset into smaller, more homogeneous distributions that can be modeled independently.

Clustering is also helpful in dimensionality reduction. By summarizing groups of similar data points, it allows us to simplify complex, high-dimensional datasets into more digestible forms (typically with fewer dimensions). This can then aid in visualization and interpretation, especially in domains like image processing (e.g., segmenting images into meaningful regions) or natural language processing (e.g., topic modeling via semantic similarity of sentences). In fact, I used clustering for topic discovery in my presentation, “The Intersection of Artificial Intelligence and Ophthalmology” at XXX Congreso Boliviano de Oftalmología (see my review article here).

Finally, from a mathematical perspective, clustering helps identify geometric and statistical structure in data. It can reveal relationships between points in a metric space or model distributions within data, as in the case of Gaussian Mixture Models, which are often used in fields like population dynamics.

Types of Clustering

Before we get into the detail of the K-Means clustering algorithm lets briefly review some of the other algorithms that are popular within the data science and machine learning community. Each method has its strengths and trade-offs, and the right choice often depends on the shape of your data, the level of noise, and whether or not you know how many clusters to expect.

Hierarchical Clustering

This method builds a hierarchical tree of clusters i.e. a tree-like structure. It can be subdivided into two main types:

  • Agglomerative (bottom-up): Starts with each data point as its own cluster and iteratively merges pairs of clusters based on a similarity metric.
  • Divisive (top-down): Starts with all points in a single cluster and recursively splits them.

The strengths of hierarchical clustering are that there is no need to predefine the number of clusters. This is advantagous since with a new dataset you may have no reason to choose a certain number of clusters over another - especially if you don’t have specific domain-knowledge.

The weaknesses of hierarchical clustering is it is potentially computationally expensive for large datasets. Although GPU processing may aleviate this, it can still represent a high entry bar for starting your analysis.

DBSCAN

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a density-based clustering algorithm that identifies clusters as regions of high data density separated by more sparse regions. Points are classified as core points, border points, or noise based on their neighborhood density.

It’s strength is that it can detect clusters of arbitrary shapes. Also, it handles noise and outliers well. It’s weakness is that it is sensitive to parameter choices and tuning (especially for the eps and min_samples parameters).

Mean Shift Clustering

Mean Shift is a centroid-based algorithm that iteratively shifts candidate centroids towards regions with higher point density. Thus moving centroids toward the mean of the data points, within a defined bandwidth (radius) - effectively shifting toward the “center of mass” of the data within a certain radius.

The strengths of mean shift clustering include being able to find clusters without predefining their number in advance. Again, an advantage when you have no prior (domain) knowledge of what an appropriate number of clusters should be. Also, mean shift clustering works well for non-spherical clusters. The weaknesses include it being computationally intensive and being highly sensitive to bandwidth (radius) selection.

Gaussian Mixture Models

Gaussian Mixture Models (GMM) is a probabilistic model that assumes the data is generated from a mixture of several Gaussian distributions, each representing a cluster. Unlike K-Means, GMM assigns soft cluster memberships to each point i.e. each point has a probability of belonging to each cluster, rather than a hard cluster membership assignment.

Regarding it’s strengths, GMMs are flexible and can handle overlapping clusters and non-spherical shapes. Its weaknesses include it being computationally expensive, and the fact it is sensitive to initialization settings. Also, GMMs may converge to a local optimum rather than global.

Spectral Clustering

Spectral Clustering uses the eigenvalues of a similarity matrix (derived from the graph representation of the data) to reduce dimensionality before applying a clustering algorithm like K-Means. It is particularly effective for non-convex clusters i.e. when clusters are non linearly separable.

Strengths include it being effective for clusters with complex shapes and that it is not limited to linear data separations. It’s weaknesses are that it requires a well-defined, meaningful similarity graph for operation, and that it can be computationally demanding.

OPTICS

OPTICS (Ordering Points To Identify the Clustering Structure) is a density-based algorithm similar to DBSCAN but which is better suited to data with varying densities. It generates an augmented ordering of the dataset so as to then create a reachability plot, which helps visualize the clustering structure.

OPTICS is moore robust than DBSCAN when clustering structure varies in density, but interpreting the reachability plot can be challenging.

Affinity Propagation

This algorithm treats every point as a potential cluster center (i.e., an ‘exemplar’ or ‘repreentative point’) and uses message-passing between points to identify exemplars and define their associated clusters. Unlike K-Means, it does not require the number of clusters to be predefined.

It’s strengths are that it can find clusters of varying sizes, does not require the number of clusters to be specified in advance, and is less sensitive to initialization. It’s weaknesses are that it can be computationally expensive and can sometimes over-fragment the dataset into (too) many small clusters.

BIRCH

BIRCH (Balanced Iterative Reducing and Clustering using Hierarchies) is designed for very large datasets. It incrementally clusters incoming data points by summarizing them using a specialized tree structure called the Clustering Feature Tree.

BIRCH scales well to very large datasets while remaining memory-efficient. However, it assumes the clusters are spherical and so may not perform well with arbitrary or complex cluster shapes.

CURE

CURE (Clustering Using Representatives) identifies clusters by selecting multiple representative points for each cluster, and then shrinks them towards the cluster’s centroid. Doing so dampens the influence of outliers and thus improves robustness against noise.

It handles clusters of varying shapes and sizes better than more traditional centroid-based methods but it is more computationally intensive (than those traditional, simpler methods) for large datasets.

The Goal of Clustering

Suppose that we have n-vectors: .

That is to say, each vector has components, and there are such vectors. In this context, the goal of clustering is to group or partition the vectors (if possible) into groups or clusters, with the vectors in each group close to each other.

Part of the task of clustering a collection of vectors is to determine whether or not the vectors can be divided into groups, with vectors in each group near each other. Of course this depends on , the number of clusters, and the particular data, i.e., the vectors .

Normally we have much smaller than i.e., there are many more vectors than there are groups. Typically applications of clustering algorithms use values of that range from a handful to a few hundred or more, where values of can range from hundreds to billions.

Toy dataset containing 500 points.
Demonstration of K-Means clustering.
Toy dataset
from sklearn.datasets import make_blobs

# X - ndarray of shape (n_samples, n_features)
#     The generated samples.
# y - ndarray of shape (n_samples,)
#     The integer labels for cluster membership of each sample.
# centers - ndarray of shape (n_centers, n_features)
#           The centers of each cluster.
#           Only returned if return_centers=True.

X, y, centers = make_blobs(
    # The total number of points equally divided among clusters.
    n_samples=500,
    # The fixed center locations.
    centers=[[-5, 3], [5, -5], [9, 9]],
    # The number of features for each sample.
    n_features=2,
    # The standard deviation of the clusters.
    # Smaller value means tighter clusters.
    cluster_std=2.0,
    center_box=(-10.0, 10.0),
    random_state=random_state,
    return_centers=True
)
K-Means clustering
from sklearn.cluster import KMeans

# Initialize the KMeans model.
kmeans = KMeans(
    # Number of clusters.
    n_clusters=3,
    # Initialization method.
    init='k-means++',
    # Number of times the K-Means algorithm will
    # be run with different centroid seeds.
    n_init=10,
    # Maximum number of iterations.
    max_iter=300,
    random_state=random_state
)

# Fit the model to the data.
kmeans.fit(X)

# Get the cluster labels for each data point.
cluster_labels = kmeans.labels_

# Get the coordinates of the final centroids.
kmeans_centroids = kmeans.cluster_centers_

Illustration of a toy dataset (left) and the result of clustering with k=3 (right) using the Python scikit-learn library - modules make_blobs and KMeans respectively.

The illustration above shows a simple example, with 2-vectors, shown as small circles. We can easily see that this collection of vectors can be divided into clusters, shown on the right with the colors representing the different clusters. It may well be possible to partition these data into other numbers of clusters, but clearly is a good value. For reference we note that the K-Means algorithm uses Euclidean distance as the distance metric (later on, more will be said about the algorithm itself).

Note of course that this example is not typical in several ways:

  • First, the vectors have dimension . Clustering any set of 2-vectors is easy - we simply create a scatter plot of our dataset and check visually if the points are clustered, and if so, how many clusters there are. In almost all applications is larger than (and typically, much larger than ), in which case this simple visual method cannot be used. In the more general case, for is larger than , visual inspection cannot replace formal methods like silhouette score or elbow method; again, especially in high-dimensional spaces.
  • Secondly, the points are very well clustered. In most applications the data will not be as cleanly clustered as in this toy example; data from the real world is usually messy, with several if not many points lying in between clusters.

Even though the above example is not typical, and the best value of is not as obvious, clustering is still very useful in practice.

Common applications of clustering

Clustering is widely used across many technical, data-focused fields to find meaningful structure in unlabeled data. Here are some typical applications, each framed in terms of clustering vectors that describe real-world entities.

Topic discovery

Suppose are word histograms associated with documents; thus is a word histogram representing document , using counts from a shared dictionary (a “vocabulary”). A clustering algorithm partitions the documents into groups, based on word usage. Typically these clusters can be interpreted as corresponding to groups of documents with the same or similar topics, genre, or author - without any understanding of the word meanings themselves. Since the clustering algorithm runs automatically and without any understanding of what the words in the dictionary mean, this is sometimes referred to as automatic topic discovery.

Patient clustering

Let represent feature vectors associated with patients admitted to a hospital, such that is the feature vector describing patient (e.g., lab results, demographics, medical history). A clustering algorithm clusters the patients into cohorts of similar patients (at least in terms of their feature vectors), such that the clusters correspond to shared conditions, treatment responses, or care pathways—supporting both clinical insight and personalized medicine.

Customer market segmentation

Suppose the vector gives the quantities (or dollar values) of items purchased by customer over some period of time. A clustering algorithm will group the customers into market segments, representing groups of customers with similar purchasing patterns, enabling targeted marketing, recommendation systems, and customer lifecycle analysis.

ZIP or Post Code clustering

Suppose that is a vector giving quantities or statistics for the residents of ZIP code , such as numbers of residents in various age groups, household size, education statistics, and income statistics. Imagine we have - a clustering algorithm might be used to cluster the 40,000 ZIP codes into, say, groups of ZIP codes with similar statistics, which reveal regions with similar socioeconomic profiles. These clusters could then be used for purposes of urban planning, public policy, or retail expansion.

Student clustering

Suppose the vector encodes the detailed grading record of student in a course, i.e., her grades on each question in the tutorial assignments, homework assignments, and formal exams. A clustering algorithm might be used to cluster students into groups, revealing those students who performed similarly, those who consistently scored highly, those who were late improvers, or students struggling in specific areas. This in turn might support adaptive learning interventions.

Survey response clustering

A group of people respond to a survey with questions. Each question contains a statement, such as “The movie was too long”, followed by some ordered options such as Strongly Disagree, Disagree, Neutral, Agree, Strongly Agree. This is called a Likert scale, named after the psychologist Rensis Likert. Suppose the n-vector encodes the selections of respondent on the questions, using the numerical coding , , , , for the responses above. A clustering algorithm can be used to cluster the respondents into groups, each with similar responses to the survey, revealing attitudinal or behavioural segments, which would be useful in market research, public opinion analysis, and social science.

Weather zones

For each of counties we have a 24-dimensional vector that gives the average monthly temperature in the first 12 entries and the average monthly rainfall in the last 12 entries. We can standardize all the temperatures, and all the rainfall data, so they have a typical range between and . The vector summarizes the annual weather pattern in county . A clustering algorithm can be used to cluster the counties into groups that have similar weather patterns, which we might refer to as “weather zones”. This clustering can be shown on a map, and used to recommend land use depending on zone, such as landscape plantings, agriculture, energy planning, or regional landscaping advice.

Daily energy use patterns

The 24-vectors give the average (electric) energy use for customers over some period (say, a month) for each hour of the day. A clustering algorithm partitions customers into groups, each with similar patterns of daily energy consumption. We might expect a clustering algorithm to “discover” usage archetypes, such as which customers have a swimming pool, an electric water heater, or solar panels. Utility companies can then use these groupings for demand forecasting and rate design.

Financial sectors

For each of companies we have an n-vector whose components are financial and business attributes such as total capitalization, quarterly returns and risks, trading volume, profit and loss, or dividends paid. These quantities would typically be scaled so as to have similar ranges of values. A clustering algorithm would group companies into sectors, i.e., groups of companies with similar attributes, financial profiles, industry sectors, or investment strategies.

Clustering Objective

Let’s now formalize the clustering procedure, and introduce a natural measure to quantify the quality of a given clustering.

Specifying the cluster assignments

We specify a clustering of vectors by stating which cluster or group each vector belongs to. We label the groups , and specify a clustering or assignment of the given vectors to groups using an N-vector , where is the group (number) that the vector is assigned to.

  • As a simple example with vectors and groups, means that:
    • is assigned to group .
    • , , and are assigned to group .
    • is assigned to group .

We also describe the clustering by the sets of indices for each group; let be the set of indices corresponding to group .

For our simple example above, we have:

Formally, we can express these index sets in terms of the group assignment vector as: which means that is the set of all indices for which .

Group representatives

With each of the groups we associate a group representative n-vector, which we denote . These representatives can be any n-vectors; they do not need to be one of the given vectors.

We want each representative to be close to the vectors in its associated group:

Note that is in group , so is the representative vector associated with data vector .

The Clustering Objective

We can now give a single number that we use to judge a choice of clustering, along with a choice of the group representatives.

We define:

which is the mean square distance from the vectors to their associated representatives. This is an “ensemble average” often referred to in statistical mechanics as a measure of the deviation of the position of a particle with respect to a reference position (and often as a function of time such that it can be updated over time). It is the most common measure of the spatial extent of random motion, and can be thought of as measuring the portion of the system “explored” by the random walker.

Note that depends on the cluster assignments (i.e., ), as well as the choice of the group representatives . The smaller is, the better the clustering.

An extreme case is , which means that the distance between every original vector and its assigned representative is zero. This happens only when the original collection of vectors only takes different values, and each vector is assigned to the representative it is equal to.

Our choice of clustering objective makes sense, since it encourages all points to be near their associated representative, but there are other reasonable choices.

  • For example, it is possible to use an objective that encourages more balanced groupings.
  • But we will stick with this basic (and very common) choice of clustering objective.

The k-Means Algorithm

The two calculations we need to solve

We have two calculations to fulfill:

  1. Choosing the group assignments.
  2. Choosing the group representatives.

Both of which are required to minimize such that, in turn, the clustering objective would be to make as small as possible.

Solving the two calculations

The two choices/calculations we need to make are circular, i.e., each depends on the other. Thus, instead of solving directly we rely on a very old idea in computation … we simply iterate between the two choices.

This means that we repeatedly alternate between updating the group assignments, and then updating the representatives. In each step the objective gets better (i.e., goes down) unless the step does not change the choice.

Iterating between choosing the group representatives and choosing the group assignments is the celebrated K-Means algorithm for clustering a collection of vectors.

And so (finally!) to the K-Means algorithm

The K-Means algorithm was first proposed in 1957 by Stuart Lloyd, and independently by Hugo Steinhaus. It is sometimes called the Lloyd algorithm. The name “K-Means” has been used since the 1960s.

In pseudocode the algorithm reads like the following:

Given: a list of N vectors x1, ..., xN, and an initial list of k group representative vectors z1, ..., zk

Repeat until convergence:

  1. Partition the vectors into k groups. For each vector i=1, …, N, assign x1 to the group associated with the nearest representative.
  2. Update representatives. For each group j=1, …, k, set zj to be the mean of the vectors in group j.

Regarding convergence, the fact that decreases in each step implies that the K-Means algorithm converges in a finite number of steps.

However, depending on the initial choice of representatives, the algorithm can, and does, converge to different final partitions, with different objective values. The K-Means algorithm is a heuristic, which means it cannot guarantee that the partition it finds minimizes our objective .

For this reason it is common to run the K-Means algorithm several times, with different initial representatives, and choose the one among them with the smallest final value of .

Let’s create a gif animation to demonstrate the iteration of the K-Means algorithm, as we’ve just defined it. First, create a toy dataset, using the make_blobs module from scikit-learn.

# X - ndarray of shape (n_samples, n_features)
#     The generated samples.

# Generate synthetic data using make_blobs
X, _ = make_blobs(
    # Effectively the number of rows in the matrix X.
    n_samples=500,
    centers=[[-5, 3], [5, -5], [9, 9]],
    # Effectively the number of columns in the matrix X.
    n_features=2,
    cluster_std=2.0,
    random_state=69,
    return_centers=False
)

Then, define a similar target for K-Means clustering, again as we did previous. We expected there to be 3 clusters so we set n_clusters=3. We can set as many iterations as we feel is necessary with max_iter, knowing that in the code below there will be a if clause that looks for convergence. Let’s also define the locations of our centroids. These are initialized to points randomly selected from i.e. they are taken from the dataset itself.

# K-Means parameters
n_clusters = 3
max_iter = 10
plot_ranges = [[-9.5, 13.5], [-10, 14]]

np.random.seed(0)

# Randomly select 'n_clusters' unique indices from
# the range [0, len(X)) i.e. to len(X) - 1.
# False ensures that the same point is not selected more than once.
# Thus the 'n_clusters' centroid's intial positions are initialized
# to points randomly selected from X (i.e. taken from the dataset itself).
centroids = X[np.random.choice(len(X), n_clusters, replace=False)]

Initialize a list to store the results of each iteration. And thus store the initial locations of the centroids and the associated initial clustering label groupings.

# Initial assignment of points to the nearest centroid.
# This function computes for each row in 'X', the index of
# the row of 'centroids' which is closest to it.
# The default distance metric used is the Euclidean distance.
initial_labels = pairwise_distances_argmin(X, centroids)
labels_history = [initial_labels]

centroids_history = [centroids.copy()]

# Calculate intial mean square distance.
squared_distances = np.sum((X - centroids[initial_labels]) ** 2, axis=1)
mean_square_distance = np.mean(squared_distances)
print(f"Initial iteration: Mean Square Distance = {mean_square_distance}")

Iterate over the steps we defined in our pseudocode above.

# Iterate the K-Means algorithm based on our pseudocode.
for iteration in range(1, max_iter):
    # Step 1: Assign points to the nearest centroid.
    labels = pairwise_distances_argmin(X, centroids)

    # Step 2: Compute new centroids.
    new_centroids = []
    for i in range(n_clusters):
        points_in_cluster = X[labels == i]
        if len(points_in_cluster) > 0:
            new_centroids.append(points_in_cluster.mean(axis=0))
        else:
            # No points assigned to this cluster, keep the centroid the same.
            new_centroids.append(centroids[i])

    new_centroids = np.array(new_centroids)

    # Calculate mean square distance
    squared_distances = np.sum((X - new_centroids[labels]) ** 2, axis=1)
    mean_square_distance = np.mean(squared_distances)

    # Check for convergence
    if np.allclose(centroids, new_centroids):
        print(f"\nConverged after {len(centroids_history) - 1} iterations.\nThe centroid positions after the {len(centroids_history)} iteration being identical to the previous iteration.")
        break

    print(f"Iteration {iteration}: Mean Square Distance = {mean_square_distance}")

    # Reassign points to the nearest centroid.
    # The new centroids and labels should only be appended to their respective histories if
    # convergence has not been reached. This will prevent the final two entries from being identical.
    labels_history.append(labels)
    centroids_history.append(new_centroids.copy())

    centroids = new_centroids

# Ensure both histories are the same length
try:
    assert len(centroids_history) == len(labels_history)
except AssertionError:
    print(f"centroids_history:\n{centroids_history}\n")
    print(f"labels_history:\n{labels_history}\n")
    raise AssertionError(f"Length of centroids_history ({len(centroids_history)}) and labels_history ({len(labels_history)}) do not match.")
else:
    # Check if the final two entries of 'centroids_history' are identical.
    if np.array_equal(centroids_history[-1], centroids_history[-2]):
        print("The final two entries of 'centroids_history' are identical.")
    # Likewise for 'labels_history'.
    if np.array_equal(labels_history[-1], labels_history[-2]):
        print("The final two entries of 'labels_history' are identical.")

Plot the result, step by step, to a gif.

# Create animation
fig, ax = plt.subplots(figsize=(10, 10))

def init():
    ax.clear()
    ax.set_xlim(plot_ranges[0])
    ax.set_ylim(plot_ranges[1])
    ax.spines['top'].set_visible(False)
    ax.spines['bottom'].set_visible(False)
    ax.spines['left'].set_visible(False)
    ax.spines['right'].set_visible(False)
    ax.xaxis.set_tick_params(labelbottom=False, length=0)
    ax.yaxis.set_tick_params(labelleft=False, length=0)
    return ax,

def update(frame):
    ax.clear()
    sns.scatterplot(
      x=X[:, 0], y=X[:, 1],
      hue=labels_history[frame],
      palette="bright",
      s=40
      ax=ax,
      legend=None
    )
    ax.scatter(
      centroids_history[frame][:, 0], centroids_history[frame][:, 1],
      s=60, c='red', marker='X', edgecolor='black'
    )
    ax.set_xlim(plot_ranges[0])
    ax.set_ylim(plot_ranges[1])
    ax.set_title(f"K-Means Clustering - Iteration {frame + 1}")
    ax.spines['top'].set_visible(False)
    ax.spines['bottom'].set_visible(False)
    ax.spines['left'].set_visible(False)
    ax.spines['right'].set_visible(False)
    ax.xaxis.set_tick_params(labelbottom=False, length=0)
    ax.yaxis.set_tick_params(labelleft=False, length=0)
    return ax,

min_frames = min(len(centroids_history), len(labels_history))

ani = FuncAnimation(
  fig, update, frames=min_frames, init_func=init, blit=False
)

# Save the animation as a GIF.
ani.save(
  "kmeans_clustering_animation.gif", writer="imagemagick", fps=1
)

plt.close(fig)

Giving us the following iterative progress.

Animated gif of K-Means clustering iterations.
Iterations of K-Means algorithm versus mean square distance.

Concluding thoughts

To conclude let us simply remind ourselves that clustering algorithms serve as foundational techniques across the domains data science, machine learning, and mathematics, offering a way to uncover inherent structures and patterns in data. We focused specifically on the K-Means algorithm, it being one of the most popular unsupervised techniques available to the data science community. We then went on to discuss the many different applications of clustering, followed by a walkthrough of the algorithm for K-Means. We finished with an animated gif of the iterative process behind K-Means.

References