This document introduces methods for dealing with missing data in R using the MICE package by van Buuren and Groothius-Oudshoom (2011). The note also explores some sensitivity analysis for understanding the missingness mechanism.
Two datasets are used in this work:
egdata: Data containing the Weight, Length and Sex of 159 babies with missing values on Length and Weight.
MACA_score: Data containing the information on whether children and teenagers care for a parent or guardian, their age and MACA (Multidimensional Assessment of Caring Activities) score, denoted by MACATotalSum. Missing values are on age and MACATotalSum.
We will work through egdata, I recommend that you use MACA_score for your personal practice and to explore more ideas about dealing with missing data.
The datasets are available to download from: https://liveuclac-my.sharepoint.com/:f:/g/personal/sejjcog_ucl_ac_uk/Emki-iuqFfdHqIz5v-Hb7TYBF_2Pmkmw3926djYBpKZBYw?e=OcyWlO
Install and load MICE package for imputation of missing values.
install.packages('mice')
require(mice)
Install and load VIM package for displaying missing data
patterns.
install.packages("VIM")
require(VIM)
Next, we will load the egdata into R using the
read.csv function.
egdata<-read.csv('egdata.csv')
The contents of the dataset are shown below:
head(egdata)
Use the summary function to summarise variables. For
numeric variables, this should display the mean and median etc, while
for categoric variables, we need the frequency in each category.
summary(egdata)
## Sex Length Weight
## Min. :0.0000 Min. : 51.40 Min. : 2.95
## 1st Qu.:0.0000 1st Qu.: 66.60 1st Qu.: 6.77
## Median :1.0000 Median : 76.65 Median : 9.87
## Mean :0.5031 Mean : 81.75 Mean :11.61
## 3rd Qu.:1.0000 3rd Qu.:100.08 3rd Qu.:16.70
## Max. :1.0000 Max. :118.40 Max. :26.20
## NA's :11 NA's :22
The variable Sex is wrongly summarised as a numeric variable, so we
will define it as categoric using the factor function.
egdata$Sex<-factor(x=egdata$Sex,levels = c(0,1),labels=c("girl","boy"))
After running the factor command, we get our summary of
the variables correctly as follows:
summary(egdata)
## Sex Length Weight
## girl:79 Min. : 51.40 Min. : 2.95
## boy :80 1st Qu.: 66.60 1st Qu.: 6.77
## Median : 76.65 Median : 9.87
## Mean : 81.75 Mean :11.61
## 3rd Qu.:100.08 3rd Qu.:16.70
## Max. :118.40 Max. :26.20
## NA's :11 NA's :22
The summary statistics shows there are 11 missing values for Length
and 22 missing values for Weight. The missing values in
R are denoted as NA.
We will use the aggr function to explore the missing
data patterns.
MP_plot_egdata <- aggr(egdata, col=c('red','blue'),
numbers=TRUE, sortVars=TRUE,
labels=names(egdata), cex.axis=.7,
gap=3, ylab=c("Proportion of Missing Data","Pattern"))
##
## Variables sorted by number of missings:
## Variable Count
## Weight 0.13836478
## Length 0.06918239
## Sex 0.00000000
md.pattern(egdata)
## Sex Length Weight
## 127 1 1 1 0
## 21 1 1 0 1
## 10 1 0 1 1
## 1 1 0 0 2
## 0 11 22 33
The output shows that Weight has the most missing values and 79.87% of cases were fully observed with no missing values. The percentage of cases with missing values for both Weight and Length is 0.63%.
To explore how missingness on a variables relate to observed values
on other variables, we begin by creating missing data indicator
variables. The function is.na is used for this purpose.
missing_Length<-is.na(egdata$Length)#Missing indicator for Length
missing_Weight<-is.na(egdata$Weight)#Missing indicator for Weight
Using the missing data indicator variable created, let us compare the
mean Weight of those with missing values on Length against those whose
Length was observed. We will use the tapply function for
this purpose. Note that the missing data indicator binary variables are
objects of type logical taking values TRUE if
a value is missing and FALSE if otherwise.
tapply(X=egdata$Weight,INDEX=missing_Length,FUN=mean,na.rm=T)
## FALSE TRUE
## 11.50898 12.84000
We will repeat this code to compare the Length of those with and without missing values on Weight.
tapply(X=egdata$Length,INDEX=missing_Weight,FUN=mean,na.rm=T)
## FALSE TRUE
## 82.34567 78.17143
The result shows that those with missing values on Length had slightly higher mean Weight than those whose Length was observed. Difference of 1.33. For babies with missing values on Weight, their average Length was slightly lower than those of the babies whose Weight was observed. Difference of 4.17.
To make the comparisons more formal, we use a two sample t-test to compare the means, with a null hypothesis of no difference in means.
t.test(egdata$Weight~missing_Length)#Compare Weight for those missing Length
##
## Welch Two Sample t-test
##
## data: egdata$Weight by missing_Length
## t = -0.53361, df = 9.7835, p-value = 0.6055
## alternative hypothesis: true difference in means between group FALSE and group TRUE is not equal to 0
## 95 percent confidence interval:
## -6.905552 4.243505
## sample estimates:
## mean in group FALSE mean in group TRUE
## 11.50898 12.84000
The p-value of 0.61 implies that if there was no difference in the mean Weight of those missing Length or not, then we would expect to see a difference this extreme 61 in 100 times. Therefore the observed difference is not statistically significant.
t.test(egdata$Length~missing_Weight)#Compare Length for those missing Length
##
## Welch Two Sample t-test
##
## data: egdata$Length by missing_Weight
## t = 0.96306, df = 28.463, p-value = 0.3436
## alternative hypothesis: true difference in means between group FALSE and group TRUE is not equal to 0
## 95 percent confidence interval:
## -4.697723 13.046205
## sample estimates:
## mean in group FALSE mean in group TRUE
## 82.34567 78.17143
We repeat the same process to compare the mean Length of those missing Weight against those whose Weight was observed. The p-value of 0.34 implies that there is no significant difference in the mean Length.
To display this information graphically, we use dotplots. Extreme values are defined as those outside the interval \[(Q1-1.5*IQR,\ Q3+1.5*IQR)\].
Using the following lines of code, we produce the graphs below. The dotplots support the t-test results of no difference.
plot(jitter(as.numeric(missing_Length)),egdata$Weight,ylim=c(-7,35),
xaxt="n",
xlab="Length Missingness",
ylab="Weight")
axis(1, at=c(0, 1), labels =c("Not missing", "Missing"))
abline(h=quantile(egdata$Weight,1/4,na.rm=T)-1.5*IQR(egdata$Weight,na.rm=T),col="red")
abline(h=quantile(egdata$Weight,3/4,na.rm=T)+1.5*IQR(egdata$Weight,na.rm=T),col="red")
plot(jitter(as.numeric(missing_Weight)),egdata$Length,ylim=c(20,150),
xaxt="n",
xlab="Weight Missingness",
ylab="Length")
axis(1, at=c(0, 1), labels =c("Not missing", "Missing"))
abline(h=quantile(egdata$Length,1/4,na.rm=T)-1.5*IQR(egdata$Length,na.rm=T),col="red")
abline(h=quantile(egdata$Length,3/4,na.rm=T)+1.5*IQR(egdata$Length,na.rm=T),col="red")
Next we will investigate if missingness on Length and Weight is associated with Sex. A way to display this information is by using a contingency table and formal testing is done using a Chi-square test.
Contingency table for Sex versus missing Length:
tbl_Length<-table(egdata$Sex,missing_Length)
prop.table(tbl_Length,margin = 1)
## missing_Length
## FALSE TRUE
## girl 0.91139241 0.08860759
## boy 0.95000000 0.05000000
chisq.test(tbl_Length)
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: tbl_Length
## X-squared = 0.41817, df = 1, p-value = 0.5178
Contigency table for Sex versus missing Weight:
tbl_Weight<-table(egdata$Sex,missing_Weight)
prop.table(tbl_Weight,margin=1)
## missing_Weight
## FALSE TRUE
## girl 0.8607595 0.1392405
## boy 0.8625000 0.1375000
chisq.test(tbl_Weight)
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: tbl_Weight
## X-squared = 8.292e-31, df = 1, p-value = 1
The p-values of 0.52 and 1 implies that there is no association between missingness on Length and Weight and the observed values of Sex.
The results obtained so far is an indication that missingness mechanism is not systematic. Missing values are possibly missing completely at random (MCAR). This is based on a null hypothesis that the missingness mechanism is MCAR. If we suspect MNAR, then the above checks are not valid for confirming or rejecting the claim.
In this section, we consider imputation of missing values using
multiple imputation (MI), mean imputation and regression imputation. We
will compare parameter estimates from these imputation methods to
listwise deletion and pairwise deletion. All of the imputation methods
will be implemented using the mice function within the
MICE package.
We will impute missing values for Weight and Length using multiple
imputation with 5 imputed datasets. The following input arguments are
used in mice for multiple imputation:
m: Number of imputed datasets (default is
m=5)
seed: Random seed for reproducible results
: method to use to impute missing values (default method for
imputation of numeric variables is pmm). For single random
imputation using linear regression (without PMM), set
method to norm.nob. Specifying
method implies that all missing variables will be imputed
using this approach.
defaultMethod: this argument sets default methods
for imputing different types of variables. This is a vector containing 4
values and specifies the default methods for imputing numeric, binary,
nominal and ordinal variables. The default value is c(“pmm”, “logreg”,
“polyreg”, “polr”). This default implies that numeric values will be
imputed using “pmm” (predictive mean matching), binary variables will be
imputed using “logreg” (binary logistic regression), nominal variables
will be imputed using “polyreg” (multinomial logistic regression) and
ordinal variables will be imputed using “polr” (ordinal logistic
regression).
PMM is predictive mean matching. This approach combines the ideas of hot deck imputation and regression imputation. Imputed values are a random draw from a set of observed values that have predicted values close to the predicted missing value. This approach helps to avoid out of range missing value estimates when data. This is mostly used when the variable(s) with missing values have a skewed or multimodal distribution. Useful reference describing this method: http://stefvanbuuren.name/fimd/sec-pmm.html.
maxit: Maximum number of iterations to use for
imputation algorithms (default: maxit=5).Next we apply the mice function to create the multiply
imputed datasets.
MI5_egdata <- mice(egdata, m=5, seed = 500)
Using summary, we see the details of the imputation
technique used.
summary(MI5_egdata)
## Class: mids
## Number of multiple imputations: 5
## Imputation methods:
## Sex Length Weight
## "" "pmm" "pmm"
## PredictorMatrix:
## Sex Length Weight
## Sex 0 1 1
## Length 1 0 1
## Weight 1 1 0
The output shows that missing values for Length and Weight were estimated using PMM and there was no imputation for Sex (no missing values).
To see the imputed values by for each variable, we use the command below.
For example, the imputed values for Length are obtained as follows:
MI5_egdata$imp$Length #Imputed values for Length
The rows correspond to cases with missing values on Length while the columns correspond to imputed dataset that the values are taken from.
A similar command with Length replaced by
Weight gives imputed values for Weight. The output is not
shown here.
MI5_egdata$imp$Weight #Imputed values for Weight
To view the multiply imputed datasets, use the complete
function.
For example, to extract the first imputed dataset we use the following command:
MI5_egdata_1 <- complete(MI5_egdata,1)#First complete dataset with imputed values
To view the dataset in a spreadsheet format in R we run the
View function.
View(MI5_egdata_1)
In this section, we are interested in comparing the Weight of boys
and girls using a two-sample t-test. The R function for
this is ttest, however we will using the lm
function (used for regression analysis) to obtain the p-value of the two
sample t-test when doing multiple imputation. This is because the
ttest function does not produce standard error of the
estimated difference, which is required by the pool
function to combine analysis results. This approach is correct because
the p-value produce by both approaches are the same, given that squaring
the t-test statistic (value used to get p-value in t-test) gives the
F-statistic (value used to get p-value in regression). The
lm function is implemented within the function
with which performs a two-sample t-test on each of the 5
imputed data sets.
ttest_Weight_MI5<-with(data=MI5_egdata,expr=lm(Weight~Sex))
#Compare weight of boys and girls using multiply imputed datasets
Next we combine the results of the significant tests using
pool.
combine_ttest_Weight_MI5 <- pool(ttest_Weight_MI5)
The function summary produces the combined results.
summary(combine_ttest_Weight_MI5)
From the combined results, the mean Weight of boys in the sample is 1.13kg higher than that of the girls. With a p-value of 0.22, this difference in Weight is not statistically significant.
The mean imputation method replaces missing values on a variable by
the mean of observed values of that variable. This method leads to
unbiased parameter estimates under the assumption of MCAR. We will
implement mean imputation of missing values within the mice
function, with method set to 'mean' and
m=1.
Mean_egdata <- mice(egdata, m=1, method = 'mean')
MeanData_egdata<-complete(data=Mean_egdata)
The complete function is used to create a new dataframe
containing the observed and imputed values. To view the imputed values
for Length, we use the following command:
Mean_egdata$imp$Length
Next, we will repeat our analysis (comparing the mean Weight of boys and girls) using the mean imputed dataset.
#T-test for comparing Weight of boys and girls
t.test(data=MeanData_egdata,Weight~Sex)
##
## Welch Two Sample t-test
##
## data: Weight by Sex
## t = -1.024, df = 156.53, p-value = 0.3074
## alternative hypothesis: true difference in means between group girl and group boy is not equal to 0
## 95 percent confidence interval:
## -2.5736420 0.8162425
## sample estimates:
## mean in group girl mean in group boy
## 11.16402 12.04272
diff(tapply(X=MeanData_egdata$Weight,INDEX=MeanData_egdata$Sex,FUN = mean))
## boy
## 0.8786998
Based on the results obtained, in this sample, boys are 0.88kg heavier than girls on average. The p-value of 0.31 implies that the observed difference in mean Weight is not statistically significant.
This result agrees with our conclusion when missing values were imputed using multiple imputation (even though the p-value here is slightly larger).
In this section, we will estimate missing values using the regression
mean imputation method. In this approach, each variable with missing
values will be used as the dependent variable in a regression model and
predicted values are our missing data estimates. This imputation method
is also implemented using the mice function with method set to
'norm.predict' and \(m=1\). To implement single random
imputation (regression imputation with noise added), set method to
'norm.nob'.
Regression_egdata <- mice(egdata, m=1, method = 'norm.predict')
RegressionData_egdata<-complete(data=Regression_egdata)
The imputed Length values are given as follows:
Regression_egdata$imp$Length
Let us compare the mean Weight of boys and girls using the regression imputed dataset.
#T-test for comparing Weight of boys and girls
t.test(data=RegressionData_egdata,Weight~Sex)
##
## Welch Two Sample t-test
##
## data: Weight by Sex
## t = -1.1811, df = 156.52, p-value = 0.2394
## alternative hypothesis: true difference in means between group girl and group boy is not equal to 0
## 95 percent confidence interval:
## -2.8704977 0.7222262
## sample estimates:
## mean in group girl mean in group boy
## 10.86619 11.94033
diff(tapply(X=RegressionData_egdata$Weight,INDEX=RegressionData_egdata$Sex,FUN = mean))
## boy
## 1.074136
The results suggests that boys are heavier than girls on average by 1.22kg. This difference in Weight is not statistically significant.
The t-test result agrees with our conlcusion from estimating missing values using the multiple imputation and mean imputation methods.
The R default method for dealing with missing values in most functions is to delete cases with missing values on at least a variable. So, we will just go directly to implement the data analysis on the actual data with missing values. This is listwise deletion.
t.test(data=egdata,Weight~Sex)
##
## Welch Two Sample t-test
##
## data: Weight by Sex
## t = -1.0236, df = 134.64, p-value = 0.3079
## alternative hypothesis: true difference in means between group girl and group boy is not equal to 0
## 95 percent confidence interval:
## -2.9902880 0.9506503
## sample estimates:
## mean in group girl mean in group boy
## 11.09250 12.11232
diff(tapply(X=egdata$Weight,INDEX=egdata$Sex,FUN = mean,na.rm=T))
## boy
## 1.019819
This test has a p-value of 0.31, which implies that the observed difference in mean Weight of 1.02kg is not statistically significant.
The above conclusion corresponds with the results we obtained by imputation of missing data using multiple imputation, mean imputation and regression imputation. This may be an indication that missingness mechanism is not systematic (MCAR). This claim is plausible given the sensitivity analysis so far and the impuation results, it is however best to use your understanding of the data and the data collection process to support any intuition. NB: When the missingness mechanism is MCAR, multiple imputation, mean imputation, regression imputation and listwise deletion lead to unbiased parameter estimates.
Note that given that the analysis of interest here is a two sample t-test (and we do not need to calculate a covariance matrix), the listwise and pairwise deletion approaches will lead to exactly the same results.
From our analysis so far, the different methods for dealing with missing data lead to same conclusions when comparing the weight of boys and girls. This implies that our data is quite robust to the missing values. We recommend using multiple imputation for dealing with missing data to achieve high power, high precision and to get unbiased parameter estimates under the MCAR and MAR assumptions. However, the sensitivity analysis above suggests that the missing values have had a minimal impact on our analysis.
To perform pairwise deletion in regression analysis, you need to
install the regtools package and run the regression
analysis using the lmac function instead of
lm.
#Install package for pairwise deletion in regression analysis
install.packages('regtools')
library(regtools)
To show an example where the output of listwise deletion and pairwise deletion will differ, we will fit a linear regression model to predict Weight of babies using Length and Sex. From the estimated regression coefficients, notice the difference in values obtained using listwise deletion and pairwise deletion.
Listwise deletion regression coefficients:
lm_listwise<-lm(Weight~Length+Sex,data=egdata)
coef(lm_listwise)
## (Intercept) Length Sexboy
## -11.8765968 0.2807237 0.5179966
Pairwise deletion regression coefficients:
lm_pairwise<-lmac(cbind(egdata$Length,egdata$Sex,egdata$Weight))
coef(lm_pairwise)
## [1] -12.5721999 0.2872227 0.4636315