- On May 11, 2017
- analysis, analysis, data, data, model, model, R, R, values, values
Data analysts spend most of their time cleaning data.
Data with unexplainable outliers, duplicate entries, erratic recording frequencies, inconsistent units, misspellings and/or data entry errors can make it difficult to perform a good analysis. Especially if you are trying to build an automated data pipeline.
But a dataset with > 30% missing values will simply ruin an analyst’s day.
Missing data proves to be the serial pest for analysts working in human performance (or the social sciences). The problem with missing values is that it is hard to explain why the data is missing.
Because … it’s missing! It’s difficult to model an effect you can’t measure.
The problems with missing data
1) Missing values reduce the size of your data
This can have severe consequences if you have an already-small data set.
Take Table 1. If you wanted to use linear regression to investigate the relationship between haemoglobin concentration and red blood cell count, athletes 2, 3, 4, 6 and 8 would be removed by default. That turns small data into tiny data — reducing your ability to draw significant insights.
Table 1: Haemoglobin concentration and red blood cell count
2) Missing values mislead
Look again at Table 1: younger athletes tend to have more missings. If a linear model removes any observation with a missing, younger athletes would be underrepresented in the results of a linear model.
In short, missingness can become an extra effect that you have to account for in your models. All too often though, the ill effects of missingness pass-by unnoticed.
The structure of missingness: MCAR, MAR, NMAR
The ideal situation is when your data is Missing Completely At Random (MCAR). This is when there is no relationship between the missingness of the data and the data itself. It basically means you can remove or replace rows that contain missing observations without too much worry.
The second category is Missing At Random (MAR). This is when the missingness of your data does relate to the data itself — but it does not affect your variable of interest. So going back to Table 1, if there is no relationship between age and haemoglobin concentration (and/or red blood cell count), it may be ok to remove or replace the missing observations without too much worry.
The worst category is Missing Not At Random (MNAR); i.e. when the missingness relates to the variable/s of interest. So, say that you are investigating the ‘Soreness’ variable of a wellness survey but you find out that the athletes who are particularly sore seem to be skipping that question (perhaps because they are afraid that they will be put on the bench if the coach finds out how sore they are). We would need to consider to this effect in our models, which is not always an easy thing to do.
Key takeaway: plan how you are going to minimise missings before you collect the data.
Using R to understand the structure of missingness
One of my favourite ways to get a handle on missingness comes from this paper: Using decision trees to understand structure in missing data. Decision trees are a well-known non-linear statistical procedure for detecting structure in data.
As an example, I have some training data of some Australian track athletes. Variables include things like distance ran, power, heart rate, training location, and some training load variables.
I then sum the number of missings per row and divide this number by the number of variables per row. This gives me a percentage of the number of missings per row, which I make into a new variable. I then pool missingness below 5% into a category called ‘cat1_low’, missingness between 5% and 15% as ‘cat2_med’, and missingness above 15% as ‘cat3_high’.
I then feed the data into the decision tree algorithm, where the missingness variable is set as the output variable. The goal here is to get the algorithm to predict which situations lead to low, medium and high missingness. You can see the results in Figure 1.
Figure 1. Decision tree to predict the percentage of missings per row of data.
Missingness is categorised as either low, medium or high
The way the plot works is that at the top of the tree are the important predictors, and at the bottom are the predictions. What this plot is saying is that time of training, distance ran, and training location are the most important variables for predicting the percentage of missings in each row of the data.
If you follow the tree from top to bottom you will find the contexts in which low, medium or high missingness are most likely to occur (the percentages represent how much of the data reside in that node; i.e., child nodes should sum to their parent nodes).
One conclusion you could make from this graph is that if the training time is in the morning or afternoon, and the distance ran is equal to or greater than 1007m, then there is likely to be low missingness.
If, however, distance ran is less than 1007m, and the training location is not Location 10, then it is more likely that there will be medium missingness in the observation.
Likewise, if the training time is not early morning or afternoon, and the training location is not 1, 2, 5, 6, 9, 10, 12, 13, 14 or 16, then it is more likely that there will be high missings.
I find that creating these ‘Missings Trees’ are a great way to provide actionable insights into how to reduce missing data. If some of the technical details flew over your head, not to worry, it takes a bit of getting used to. Don’t hesitate to contact us at Fusion Sport for a more thorough explanation.
Love our expert series? Here are a few of our favourite blogs that will boost your analytic knowledge:
- Using Artificial Intelligence to Understand Movement Patterns in Team Sports
- Cluster Analysis: A Brief Overview
- Middle or Mean? The Misuse of the Average
- A Simple Method for Determining Ground Reaction Force During Running
Header image: Caleb Roenigk