news

Using cluster analysis to segment your data

Vaseline August 13, 2024

Using cluster analysis to segment your data

Image by Pexels

Machine Learning (ML for short) is not just about making predictions. There are other unguided processes, of which clustering is one that stands out. This article introduces clustering and cluster analysis, and highlights the potential of cluster analysis for segmenting, analyzing, and gaining insights from groups of similar data.

What is clustering?

Simply put, clustering is a synonym for grouping similar data itemsThis can be similar to organizing and grouping similar types of fruits and vegetables in a grocery store.

Let’s elaborate on this concept: clustering is a form of unguided learning task: a broad family of machine learning approaches in which data is a priori treated as unlabeled or uncategorized, and the goal is to discover patterns or insights underlying them. The goal of clustering in particular is to discover groups of data observations with comparable characteristics or properties.

This is where clustering comes into the spectrum of ML techniques:

To better understand the concept of clustering, think of finding segments of customers in a supermarket with similar shopping behavior, or grouping a large number of products in an e-commerce portal into categories or similar items. These are common examples of real-world scenarios with clustering processes.

Common clustering techniques

There are several methods for clustering data. Three of the most popular families of methods are:

Iterative clustering: these algorithms iteratively (and sometimes re-assign) data points to their respective clusters until they converge to a “good enough” solution. The most popular iterative clustering algorithm is k-means, where data points are assigned to clusters defined by representative points (cluster centroids) and these centroids are gradually updated until convergence is achieved.
Hierarchical clustering: As the name suggests, these algorithms build a hierarchical tree structure using either a top-down approach (splitting the set of data points until the desired number of subgroups is reached) or a bottom-up approach (gradually merging similar data points as bubbles into increasingly larger groups). AHC (Agglomerative hierarchical clustering) is a common example of a bottom-up hierarchical clustering algorithm.
Density-based clustering: These methods identify areas with a high density of data points to form clusters. DBSCAN (Density-based spatial clustering of noisy applications) is a popular algorithm in this category.

Are Clustering And Cluster analysis the same?

The burning question at this point might be: do clustering and clustering analysis refer to the same concept?
There is no doubt that they are closely related, but they are not the same. There are subtle differences.

Clustering is the process of grouping similar data so that two objects in the same group or cluster are more similar than two objects in different groups.
Meanwhile, cluster analysis has become a broader term that includes not only the process of grouping (clustering) data but also the analysis, evaluationAnd interpretation of obtained clusters, within a specific domain context.

The diagram below illustrates the difference and relationship between these two often confused terms.

Practical example

From now on, let’s focus on cluster analysis by giving a practical example:

Segments a set of data.
Analyze the obtained segments

NOTE: The accompanying code in this example assumes some familiarity with the basics of the Python language and libraries such as sklearn (for training cluster models), pandas (for data management), and matplotlib (for data visualization).

We will illustrate the cluster analysis on the Penguins of the Palmer Archipelago dataset, which contains data observations on penguin specimens classified into three different species: Adelie, Gentoo and Chinstrap. This dataset is quite popular for training classification models, but also has a lot to say about finding clusters of data within it. All we need to do after loading the dataset file is assume that the ‘species’ class feature is unknown.

import pandas as pd
penguins = pd.read_csv('penguins_size.csv').dropna()
X = penguins.drop('species', axis=1)

We will also remove two categorical features from the dataset describing the sex of the penguin and the island where this specimen was observed, leaving the rest of the numerical features. We also store the known labels (species) in a separate variable and: they will be useful later to compare the obtained clusters with the actual classification of penguins in the dataset.

X = X.drop(('island', 'sex'), axis=1)
y = penguins.species.astype("category").cat.codes

With the following few lines of code it is possible to apply the K-means clustering algorithms available in the learn library, to find a song I of clusters in our data. We only need to specify the number of clusters we want to find. In this case, we group the data into k=3 clusters:

from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters = 3, n_init=100)
X("cluster") = kmeans.fit_predict(X)

The last line in the above code stores the cluster result, namely the id of the cluster assigned to each data instance, in a new attribute named ‘cluster’.

Time to generate some visualizations of our clusters to analyze and interpret them! The following code snippet is a bit long, but it boils down to generating two data visualizations: the first shows a scatterplot around two data features – culmen length and flipper length – and the cluster each observation belongs to, and the second visualization shows the actual penguin species each data point belongs to.

plt.figure (figsize=(12, 4.5))
# Visualize the clusters obtained for two of the data attributes: culmen length and flipper length
plt.subplot(121)
plt.plot(X(X("cluster")==0)("culmen_length_mm"),
X(X("cluster")==0)("flipper_length_mm"), "mo", label="First cluster")
plt.plot(X(X("cluster")==1)("culmen_length_mm"),
X(X("cluster")==1)("flipper_length_mm"), "ro", label="Second cluster")
plt.plot(X(X("cluster")==2)("culmen_length_mm"),
X(X("cluster")==2)("flipper_length_mm"), "go", label="Third cluster")
plt.plot(kmeans.cluster_centers_(:,0), kmeans.cluster_centers_(:,2), "kD", label="Cluster centroid")
plt.xlabel("Culmen length (mm)", fontsize=14)
plt.ylabel("Flipper length (mm)", fontsize=14)
plt.legend(fontsize=10)

# Compare against the actual ground-truth class labels (real penguin species)
plt.subplot(122)
plt.plot(X(y==0)("culmen_length_mm"), X(y==0)("flipper_length_mm"), "mo", label="Adelie")
plt.plot(X(y==1)("culmen_length_mm"), X(y==1)("flipper_length_mm"), "ro", label="Chinstrap")
plt.plot(X(y==2)("culmen_length_mm"), X(y==2)("flipper_length_mm"), "go", label="Gentoo")
plt.xlabel("Culmen length (mm)", fontsize=14)
plt.ylabel("Flipper length (mm)", fontsize=14)
plt.legend(fontsize=12)
plt.show

Here are the visualizations:

By observing the clusters we can gain some initial insight:

There is a subtle, but not very clear separation between data points (penguins) assigned to the different clusters, with some gentle overlap between the subgroups found. This does not necessarily lead us to conclude that the clustering results are good or bad: we have applied the k-means algorithm to several features of the dataset, but this visualization shows how data points are positioned across clusters in terms of just two features: ‘culmen length’ and ‘flipper length’. There may be other pairs of features under which clusters are visually represented as more clearly separated from each other.

This leads to the question: what if we try to visualize our cluster under two other variables that we use for training the model?

Let’s try to visualize the body mass (grams) and culmen length (mm) of the penguins.

plt.plot(X(X("cluster")==0)("body_mass_g"),
X(X("cluster")==0)("culmen_length_mm"), "mo", label="First cluster")
plt.plot(X(X("cluster")==1)("body_mass_g"),
X(X("cluster")==1)("culmen_length_mm"), "ro", label="Second cluster")
plt.plot(X(X("cluster")==2)("body_mass_g"),
X(X("cluster")==2)("culmen_length_mm"), "go", label="Third cluster")
plt.plot(kmeans.cluster_centers_(:,3), kmeans.cluster_centers_(:,0), "kD", label="Cluster centroid")
plt.xlabel("Body mass (g)", fontsize=14)
plt.ylabel("Culmen length (mm)", fontsize=14)
plt.legend(fontsize=10)
plt.show

This one looks crystal clear! Now we have divided our data into three distinct groups. And we can gain additional insights by further analyzing our visualization:

There is a strong relationship between the clusters found and the values of the traits ‘body mass’ and ‘culmen length’. From the bottom left to the top right of the graph, penguins in the first group are characterized by being small due to their low values of ‘body mass’, but they show largely varying beak lengths. Penguins in the second group are of average size and average to high values of ‘beak length’. Finally, penguins in the third group are characterized by being larger and having a longer beak.
It can also be noted that there are a few outliers, i.e. data observations with atypical values that are far from the majority. This is especially noticeable with the dot at the very top of the visualization area, which indicates that there are some penguins observed with too long beaks in all three groups.

Complete

This post illustrated the concept and practical application of cluster analysis as the process of finding subgroups of elements with similar characteristics or properties in your data and analyzing these subgroups to extract valuable or actionable insights from them. From marketing to e-commerce to ecology projects, cluster analysis is widely applied in a variety of real-world domains.

Ivan Palomares Carrascosa is a thought leader, author, speaker and advisor in AI, machine learning, deep learning and LLMs. He trains and mentors others in leveraging AI in the real world.