Do you have missing data in the dataset? … don’t ignore them.
Missing data is a common issue in statistical analysis, and it can cause bias or inaccurate results. In these posts, we will explore various types of missing data mechanisms to understand how to identify them and address the issue.
Previously, we discussed the concept of Missing Completely at Random (MCAR) and how it generates missing data unrelated to observed data characteristics. Now, let’s delve into Missing at Random (MAR).
Consider a study where researchers are investigating the relationship between income and health status. They collect data from a sample of participants and ask them to provide their annual income and self-reported health status on a scale from 1 to 10. However, due to privacy concerns, some participants choose not to disclose their income.
| Participant | Income | Health_Status |
|---|---|---|
| 1 | 45000 | 7 |
| 2 | 62000 | 6 |
| 3 | NA | 4 |
| 4 | 55000 | 8 |
| 5 | 38000 | 5 |
| 6 | NA | 9 |
The missing income data is related to the health status of the participants. Those with lower health status scores (like Participant 3) might be less inclined to disclose their income. On the other hand, those with higher health status scores (like Participant 6) might be more comfortable sharing their income. The missingness of income depends on the observed variable (health status), but it’s not directly related to the values of the missing data (income). This scenario demonstrates “Missing At Random” (MAR) because the probability of income being missing can be explained by other observed variables, making it possible to handle the missingness through statistical techniques that consider these relationships. MAR is more general and realistic than MCAR.