Introduction to Missing Data (8/22/2012)
Types of Missing Data
- Missing completely at random (MCAR)
- Admistrative errors/accidents
- Missing at random (MAR) (“Although missing at random (MAR) is a non-testable assumption, it has been pointed out in the literature that we can get very close to MAR if we include enough variables in the imputation models”)
- Missingness related to known patient characteristics, time or place (“MAR on \( x \)”), or to the outcome (“MAR on \( y \)”)
- Informative missing (non-ignorable non-response)
- Missingness related to value of predictor or characteristics not available in the analysis
Prelude to Modeling
- Quantify extent of missing data
- Characterize types of subjects with missing data
- Find sets of variables missing on same subjects
Missing Values for Different Response Variables
- Serial data with subjects dropping out (not covered in this
course)
- \( y \)=time to event, follow-up curtailed: covered under survival
analysis (White and Royston provide a method
for multiply imputing missing covariate values using censored
survival time data.)
- Often discard observations with completely missing \( y \) but
sometimes wasteful
- Characterize missings in \( y \) before dropping obs.
Problems with simple alternatives to imputation
Deletion of records—
- Badly biases parameter estimates when the probability of a
case being incomplete is related to \( y \) and not just
\( x \)
- Deletion because of a subset of \( x \) being missing
always results in inefficient estimates
- Deletion of records with missing \( y \) can result in
biases but is the preferred approach
under MCAR (Multiple imputation of \( y \) in that case
does not improve the analysis and assumes the imputation
model is correct.)
However von Hippel found advantages to a “use
all variables to impute all variables then drop observations with
missing \( y \)” approach
Only discard obs. when
- MCAR can be justified
- Rarely missing predictor of overriding importance that can't be
imputed from other data
- Fraction of obs. with missings small and \( n \) is large
No advantage of deletion except savings of analyst time
Making up missing data better than throwing away real data
Adding extra categories of categorical predictors—
- Including missing data but adding a category missing' causes
serious biases
- Problem acute when values missing because subject too sick
- Difficult to interpret
- Fails even under MCAR
- May be OK if values are
missing" because ofnot
applicable", e.g. you have a measure of marital
happiness, dichotomized as high or low, but your sample contains some
unmarried people. OK to have a 3-category
variable with values high, low, and unmarried.
Likewise, serious problems are caused by setting missing continuous
predictors to a constant (e.g., zero) and adding an indicator variable
to try to estimate the effect of missing values.