- On April 20, 2017
- analysis, cluster, data, technique
Cluster analysis is an essential weapon for every data scientist’s armoury. It is a great way to gain a sense of your data and is probably one of the most popular types of analysis techniques going around. This blog post will give you a quick primer on how to perform cluster analysis in R.
Clustering aims to divide data into groups (clusters). Imagine you are a strength and conditioning coach who wants to create different training regimes for players with different 40m sprint times (or different blood lactate levels or different deadlift numbers or whatever). Although it may not be feasible to create a different regime for each individual player (although that might be ideal), the next best thing would be to find 4-5 groups of players who have similar values and then create a different training regime for each group.
Clustering could help you find the best way to divide those players into groups. Thankfully we have programs like R which helps us find the best way to group the data without us having to do the messy mathematics ourselves.
When clustering goes well, the data within each cluster will be similar to each other, but the data in different clusters will be distinct from each other. Not ALL data is ‘clusterable’ though. It’s a bit like linear regression: you can input non-linear data into a linear model and still get a result, but it’s not going to mean anything. Similar to clustering, you can input non-clustered data into a clustering algorithm and still get a result, but it will be meaningless.
An example with the Iris data set
As an example of what clustering looks like, take a look at the scatter plot matrix of the ‘iris’ data set below. The iris data set comes built-in to R and contains the petal and sepal lengths (sepals are the small green leaves under the petals) of 50 iris flowers of three different species: Setosa, Versicolor and Virginica. Each plot is coloured according to species, and you can see that some of the plots show distinct clustering.
Figure 1: Setosa is pink, Versicolor is green and Virginica is blue
Clustering AIS data: K-means and Hierarchical Clustering
So let’s do some clustering of our own.
I’m using data from a 1991, Australian Institute of Sports study that looked at how various characteristics of blood varied with sport, body size and sex of the athlete. The data contains 202 observations and contains 13 variables, such as red blood cell count, haemoglobin concentration, Body mass index, sum of skin folds, and more. I got it from the downloadable ‘DAAG’ R package.
Here’s a scatter plot matrix of some of the variables.
Figure 2. Scatter plot matrix of AIS data; pcBfat vs lbm seems to show some clustering
It looks like the pcBfat (percentage of body fat) vs lbm (lean body mass) variables produces some clustering, so I’m going to use the k-means clustering algorithm on those variables to see what it returns.
Figure 3. Lean Body Mass vs Percentage Body Fat. It seems like there are clusters of data on the bottom and top left
With k-means, you have to supply the algorithm with how many clusters you want beforehand. That can be annoying since we may not know how many clusters exist in the data. What we can do is run k-means several times with different numbers of clusters and then see which number of clusters best fits the data.
Figure 4. Error in the model for different numbers of clusters
We’re looking for the ‘elbow’ in Figure 4, which I would say happens at either two or four clusters. Basically, we want to minimise the number of clusters we use while also minimising the amount of error in the model fit, and the elbow in the graph roughly tells us where those conditions intersect. I’m going to choose two clusters.
So let’s calculate k-means for two clusters.
Figure 5. Lean Body Mass vs Percentage Body Fat with clusters highlighted
You can see in Figure 5 how the k-means algorithm has distinctly picked out two clusters. We can also use something called the Hopkins statistic to give us an idea of whether the data is clusterable.
If the calculated Hopkins statistic is well below 0.5 then the data is clusterable (if the Hopkins statistic equals 0.5 then the data is uniform). For the lbm vs pcBfat data I calculated a Hopkins statistic of 0.2 — so we can conclude that the data is indeed clusterable.
There are also metric called the intracluster correlation coefficient (ICC) which can tell you how related the data points within each cluster are compared to data points in different clusters. I won’t go into the mechanics of the ICC here but it is something to keep in your back pocket.
K-means is one type of clustering technique, but I now want to quickly look at another popular clustering method called hierarchical clustering. The major benefit of hierarchical clustering is that you don’t need to pre-specify the number of clusters you want beforehand — the algorithm sorts that out for you.
Let’s use hierarchical clustering on the lbm and pcBfat variables. Note that, since hierarchical clustering works by measuring the distances between data points, it is often a good idea to scale your data before entering it into the clustering algorithm.
Figure 5. Dendrogram of Lean Body Mass vs Percentage Body Fat hierarchical clustering
It’s a bit jumbled, but at the bottom of the dendrogram are the row numbers of each observation. From Figure 5, the optimal number of clusters would seem to be between 2-4 clusters, just like in k-means. What we are looking for here is the biggest ‘gap’ between a parent split and its child node.
What to Take Away
So we have looked at two common clustering techniques, k-means and hierarchical. K-means works best when your data is spherical and you already have a good idea of how many clusters exist. Hierarchical clustering works best when the clusters are dependent on each other in a hierarchical fashion. However, hierarchical clustering does not scale well to large data sets — k-means is faster.
There are many more methods to choose from that I’ve not mentioned here like the expectation-maximisation (EM) algorithm, DBSCAN, k-means++, fuzzy c-means, etc. It’s a big world and there a lot of options.
At the end of the day, clustering is what you make of it. It probably is not going to give you all the answers but it’s a great way to get your head around how your data is structured. I have glossed over a lot of info in this post but hopefully it will get you started on your own journey into clustering.
At Fusion Sport, our analytics experts are focused on delivering the most accurate and useful data analysis for our clients all around the world. Interested in how they achieve this? Let SMARTABASE give you some insight.
Header Image: Datavizproject