- On March 30, 2017
- analytics, analytics, data, data, mean, mean, median, median
If you were at the 2017 Melbourne Sports Analytics Conference or the 2017 Gold Coast Smartabase User Conference, you would have seen High Performance Director for GWS, David Joyce, argue against a statistical test that we all take for granted:
In David’s experience, no athlete is ‘average’. A one-number summary like the group average will never capture the needs of any individual athlete.
— Sam Robertson (@Robertson_SJ) March 8, 2017
I think David is spot on. Nonetheless, his talks got me thinking more about the mathematics underlying the mean (and its alternatives) rather than the individual analysis vs group analysis debate (you might call this the N = 1 problem).
What is best practice for when you actually want to summarise (with one number) the data from a group of athletes? Well, as always in statistics, it depends on the context.
The mean is often called a measure of ‘central tendency’. So given a set of numbers, the mean should represent the ‘centre’ of those numbers … right? Well, any statistician worth his/her grain of salt will know it is not so simple.
Suppose you are a sports scientist for a rugby league team and you are interested in reporting the average deadlift 1 rep max (1RM) of your squad’s front row. A quick calculation finds that the range is 200kg to 300 kg and the mean is 250 kg.
But then your team signs an absolute behemoth of a human being who has a 1RM of 450 kg (I’m basically imagining this guy). So, you re-calculate your team’s average deadlift 1RM while considering the new guy’s stats and, guess what, the team’s average shoots up to 290 kg!
The problem with the mean is that it is highly influenced by outliers. Evidently, the 1RM of this guy is an outlier — it dragged up the group mean a bit like gravity pulls us down to Earth. The bigger the outlier, the greater the pull on the mean.
If your aim is to find the ‘typical’ observation, the mean can be misleading because it is so susceptible to huge (or tiny) values. For the same reasons, skewed distributions will pull the mean in the direction of the skew.
Where is the Middle?
On the other hand, another measure of central tendency still works pretty well when there are outliers / skew. It’s the robust cousin of the mean, the median.
Let’s take the following group of numbers: 1, 20, 38, 41, 64, 80 and 500. A quick calculation shows me the mean is about 106, but I don’t even need my calculator to see that the median is 41.
The median is the point in your data at which half of the numbers lie above it and half lie below. So in this case, three numbers lie above 41 (64, 80 and 500) and two lie below 41 (1, 20 and 38).
Also note the large difference between the calculated mean and median values: 106 vs 41. The problem is that there is an outlier, the number 500, which is inflating the mean. If we take out 500 from the data set, the mean and median would be much closer: 40.7 vs 39.5.
Skewed distributions have the same effect. To illustrate, Figure 1 shows three different types of distributions: one with a long left-tail, one with a long right-tail, and the trusty normal distribution in the middle. Within each distribution, the mean is marked with a solid line and the median with a dotted line.
See how the mean and the median sit on top of each other in the normal distribution but don’t in the skewed distributions? In the skewed cases, the mean is drawn towards the direction of the skew.
If your data is skewed, or there are large outliers, then use the median to find the centre of the data. Better yet, report both the mean and the median since any differences will reveal information about the presence of skew/outliers. If the data is normally distributed (or even just non-skewed), feel free to use the mean. The mean is easier to communicate and so if you can use it, use it.
A more subtle rule: if you are more concerned with the total sum, rather than the typical value, use the mean. For instance, if you have a salary cap and you are interested in the average salary of your players, use the mean. In this case, the mean is biased towards the high earners, and you really care about the high earners because they are the ones who are eating up your salary cap.
A more controversial rule: do not use the mean with Likert scale data. Let’s say that in a survey with a 1-5 scale of Very bad, Bad, Neutral, Good and Very Good categories, the mean result across many participants came out to be 3.5. But … what does 3.5 even mean in this context? Half way between Neutral and Good — Neutood? In terms of best practice, use the median when describing the centre of Likert data. Some may even argue for only using the mode on Likert data (note to self: future blog post). To be fair, Likert scale analysis is what statisticians fight about over drinks.
In short, the mean isn’t evil — just be aware of its limitations.
At Fusion Sport, we work with large quantities of data from all different kinds of athletes on a daily basis. Interested in how deliver the best and most accurate athlete analysis to our clients across the globe? SMARTABASE has the answers for you.