Introduction to Missing Data (8/22/2012)

Types of Missing Data

Missing completely at random (MCAR)
- Admistrative errors/accidents
Missing at random (MAR) (“Although missing at random (MAR) is a non-testable assumption, it has been pointed out in the literature that we can get very close to MAR if we include enough variables in the imputation models”)
- Missingness related to known patient characteristics, time or place (“MAR on \( x \)”), or to the outcome (“MAR on \( y \)”)
Informative missing (non-ignorable non-response)
- Missingness related to value of predictor or characteristics not available in the analysis

Prelude to Modeling

Quantify extent of missing data
Characterize types of subjects with missing data
Find sets of variables missing on same subjects

Missing Values for Different Response Variables

Serial data with subjects dropping out (not covered in this course)
\( y \)=time to event, follow-up curtailed: covered under survival analysis (White and Royston provide a method for multiply imputing missing covariate values using censored survival time data.)
Often discard observations with completely missing \( y \) but sometimes wasteful
Characterize missings in \( y \) before dropping obs.

Problems with simple alternatives to imputation

Deletion of records—

Badly biases parameter estimates when the probability of a case being incomplete is related to \( y \) and not just \( x \)
Deletion because of a subset of \( x \) being missing always results in inefficient estimates
Deletion of records with missing \( y \) can result in biases but is the preferred approach under MCAR (Multiple imputation of \( y \) in that case does not improve the analysis and assumes the imputation model is correct.)
However von Hippel found advantages to a “use
all variables to impute all variables then drop observations with
missing \( y \)” approach
Only discard obs. when
- MCAR can be justified
- Rarely missing predictor of overriding importance that can't be imputed from other data
- Fraction of obs. with missings small and \( n \) is large
No advantage of deletion except savings of analyst time
Making up missing data better than throwing away real data

Adding extra categories of categorical predictors—

Including missing data but adding a category missing' causes serious biases
Problem acute when values missing because subject too sick
Difficult to interpret
Fails even under MCAR
May be OK if values are missing" because ofnot applicable", e.g. you have a measure of marital happiness, dichotomized as high or low, but your sample contains some unmarried people. OK to have a 3-category variable with values high, low, and unmarried.

Likewise, serious problems are caused by setting missing continuous
predictors to a constant (e.g., zero) and adding an indicator variable
to try to estimate the effect of missing values.