Do you have missing data in the dataset? … don’t ignore them.

Missing data is a common issue in statistical analysis, and it can cause bias or inaccurate results. In these posts, we will explore various types of missing data mechanisms to understand how to identify them and address the issue.

Previously On…

Previously, we discussed the concept of Missing Completely at Random (MCAR) and how it generates missing data unrelated to observed data characteristics. Now, let’s delve into Missing at Random (MAR).

In a Couple of Words

Example

Consider a study where researchers are investigating the relationship between income and health status. They collect data from a sample of participants and ask them to provide their annual income and self-reported health status on a scale from 1 to 10. However, due to privacy concerns, some participants choose not to disclose their income.

Example of MAR Data.
Participant Income Health_Status
1 45000 7
2 62000 6
3 NA 4
4 55000 8
5 38000 5
6 NA 9

The missing income data is related to the health status of the participants. Those with lower health status scores (like Participant 3) might be less inclined to disclose their income. On the other hand, those with higher health status scores (like Participant 6) might be more comfortable sharing their income. The missingness of income depends on the observed variable (health status), but it’s not directly related to the values of the missing data (income). This scenario demonstrates “Missing At Random” (MAR) because the probability of income being missing can be explained by other observed variables, making it possible to handle the missingness through statistical techniques that consider these relationships. MAR is more general and realistic than MCAR.

Graphical representations of (a) missing completely at random (MCAR), (b) missing at random (MAR), and (c) missing not at random (MNAR) in a univariate missing-data pattern. X represents variables that are completely observed, Y represents a variable that is partly missing, Z represents the component of the causes of missingness unrelated to X and Y, and R represents the missingness.
Graphical representations of (a) missing completely at random (MCAR), (b) missing at random (MAR), and (c) missing not at random (MNAR) in a univariate missing-data pattern. X represents variables that are completely observed, Y represents a variable that is partly missing, Z represents the component of the causes of missingness unrelated to X and Y, and R represents the missingness.