Do you have missing data in the dataset? … don’t ignore them.
Missing data is a common issue in statistical analysis, and it can cause bias or inaccurate results. In these posts, we will explore various types of missing data mechanisms to understand how to identify them and address the issue.
Previously, we discussed the Missing At Random (MAR) mechanism and how it is linked to certain characteristics of the observed data. Now, let’s examine the most challenging mechanism to handle, Missing Not At Random (MNAR).
Imagine a study that aims to understand the relationship between income and job satisfaction. Participants are asked to provide their annual income and rate their job satisfaction on a scale from 1 to 10. However, in this scenario, the missingness in income is not only related to participants’ job satisfaction but also influenced by the actual income itself. Participants with higher incomes might be less willing to disclose their income, regardless of their job satisfaction level.
# Example of MNAR Data
library(knitr)
data_mnar <- data.frame(
Participant = 1:6,
Income = c(60000, NA, 48000, NA, 55000, NA),
Job_Satisfaction = c(8, 6, 5, 9, 7, 4)
)
kable(data_mnar, caption = "Example of MNAR Data.")
| Participant | Income | Job_Satisfaction |
|---|---|---|
| 1 | 60000 | 8 |
| 2 | NA | 6 |
| 3 | 48000 | 5 |
| 4 | NA | 9 |
| 5 | 55000 | 7 |
| 6 | NA | 4 |
In this example, the missing income data is influenced by both the job satisfaction level and the actual income. Participants with higher incomes might choose not to disclose their income regardless of their job satisfaction, while those with lower incomes might be more inclined to share their income.
This type of missingness is referred to as “Missing Not At Random” (MNAR) because the missingness is not solely dependent on observed variables like job satisfaction; it’s influenced by the values of the missing data (income) itself. Handling MNAR data can be more challenging since the missingness mechanism is not fully explained by observable factors, making it necessary to carefully consider the potential biases introduced by the missing data in any analysis.