1 Introduction and Dataset

This document introduces methods for dealing with missing data in R using the MICE package by van Buuren and Groothius-Oudshoom (2011). The note also explores some sensitivity analysis for understanding the missingness mechanism.

Two datasets are used in this work:

  1. egdata: Data containing the Weight, Length and Sex of 159 babies with missing values on Length and Weight.

  2. MACA_score: Data containing the information on whether children and teenagers care for a parent or guardian, their age and MACA (Multidimensional Assessment of Caring Activities) score, denoted by MACATotalSum. Missing values are on age and MACATotalSum.

We will work through egdata, I recommend that you use MACA_score for your personal practice and to explore more ideas about dealing with missing data.

The datasets are available to download from: https://liveuclac-my.sharepoint.com/:f:/g/personal/sejjcog_ucl_ac_uk/Emki-iuqFfdHqIz5v-Hb7TYBF_2Pmkmw3926djYBpKZBYw?e=OcyWlO

Install and load MICE package for imputation of missing values.

install.packages('mice')
require(mice)

Install and load VIM package for displaying missing data patterns.

install.packages("VIM")
require(VIM)

2 Exploratory Analysis

Next, we will load the egdata into R using the read.csv function.

egdata<-read.csv('egdata.csv')

The contents of the dataset are shown below:

head(egdata)

Use the summary function to summarise variables. For numeric variables, this should display the mean and median etc, while for categoric variables, we need the frequency in each category.

summary(egdata)
##       Sex             Length           Weight     
##  Min.   :0.0000   Min.   : 51.40   Min.   : 2.95  
##  1st Qu.:0.0000   1st Qu.: 66.60   1st Qu.: 6.77  
##  Median :1.0000   Median : 76.65   Median : 9.87  
##  Mean   :0.5031   Mean   : 81.75   Mean   :11.61  
##  3rd Qu.:1.0000   3rd Qu.:100.08   3rd Qu.:16.70  
##  Max.   :1.0000   Max.   :118.40   Max.   :26.20  
##                   NA's   :11       NA's   :22

The variable Sex is wrongly summarised as a numeric variable, so we will define it as categoric using the factor function.

egdata$Sex<-factor(x=egdata$Sex,levels = c(0,1),labels=c("girl","boy"))

After running the factor command, we get our summary of the variables correctly as follows:

summary(egdata)
##    Sex         Length           Weight     
##  girl:79   Min.   : 51.40   Min.   : 2.95  
##  boy :80   1st Qu.: 66.60   1st Qu.: 6.77  
##            Median : 76.65   Median : 9.87  
##            Mean   : 81.75   Mean   :11.61  
##            3rd Qu.:100.08   3rd Qu.:16.70  
##            Max.   :118.40   Max.   :26.20  
##            NA's   :11       NA's   :22

The summary statistics shows there are 11 missing values for Length and 22 missing values for Weight. The missing values in R are denoted as NA.

We will use the aggr function to explore the missing data patterns.

MP_plot_egdata <- aggr(egdata, col=c('red','blue'),
                       numbers=TRUE, sortVars=TRUE,
                       labels=names(egdata), cex.axis=.7,
                       gap=3, ylab=c("Proportion of Missing Data","Pattern"))

## 
##  Variables sorted by number of missings: 
##  Variable      Count
##    Weight 0.13836478
##    Length 0.06918239
##       Sex 0.00000000
md.pattern(egdata)

##     Sex Length Weight   
## 127   1      1      1  0
## 21    1      1      0  1
## 10    1      0      1  1
## 1     1      0      0  2
##       0     11     22 33

The output shows that Weight has the most missing values and 79.87% of cases were fully observed with no missing values. The percentage of cases with missing values for both Weight and Length is 0.63%.

To explore how missingness on a variables relate to observed values on other variables, we begin by creating missing data indicator variables. The function is.na is used for this purpose.

missing_Length<-is.na(egdata$Length)#Missing indicator for Length

missing_Weight<-is.na(egdata$Weight)#Missing indicator for Weight

Using the missing data indicator variable created, let us compare the mean Weight of those with missing values on Length against those whose Length was observed. We will use the tapply function for this purpose. Note that the missing data indicator binary variables are objects of type logical taking values TRUE if a value is missing and FALSE if otherwise.

tapply(X=egdata$Weight,INDEX=missing_Length,FUN=mean,na.rm=T)
##    FALSE     TRUE 
## 11.50898 12.84000

We will repeat this code to compare the Length of those with and without missing values on Weight.

tapply(X=egdata$Length,INDEX=missing_Weight,FUN=mean,na.rm=T)
##    FALSE     TRUE 
## 82.34567 78.17143

The result shows that those with missing values on Length had slightly higher mean Weight than those whose Length was observed. Difference of 1.33. For babies with missing values on Weight, their average Length was slightly lower than those of the babies whose Weight was observed. Difference of 4.17.

To make the comparisons more formal, we use a two sample t-test to compare the means, with a null hypothesis of no difference in means.

t.test(egdata$Weight~missing_Length)#Compare Weight for those missing Length
## 
##  Welch Two Sample t-test
## 
## data:  egdata$Weight by missing_Length
## t = -0.53361, df = 9.7835, p-value = 0.6055
## alternative hypothesis: true difference in means between group FALSE and group TRUE is not equal to 0
## 95 percent confidence interval:
##  -6.905552  4.243505
## sample estimates:
## mean in group FALSE  mean in group TRUE 
##            11.50898            12.84000

The p-value of 0.61 implies that if there was no difference in the mean Weight of those missing Length or not, then we would expect to see a difference this extreme 61 in 100 times. Therefore the observed difference is not statistically significant.

t.test(egdata$Length~missing_Weight)#Compare Length for those missing Length
## 
##  Welch Two Sample t-test
## 
## data:  egdata$Length by missing_Weight
## t = 0.96306, df = 28.463, p-value = 0.3436
## alternative hypothesis: true difference in means between group FALSE and group TRUE is not equal to 0
## 95 percent confidence interval:
##  -4.697723 13.046205
## sample estimates:
## mean in group FALSE  mean in group TRUE 
##            82.34567            78.17143

We repeat the same process to compare the mean Length of those missing Weight against those whose Weight was observed. The p-value of 0.34 implies that there is no significant difference in the mean Length.

To display this information graphically, we use dotplots. Extreme values are defined as those outside the interval \[(Q1-1.5*IQR,\ Q3+1.5*IQR)\].

Using the following lines of code, we produce the graphs below. The dotplots support the t-test results of no difference.

plot(jitter(as.numeric(missing_Length)),egdata$Weight,ylim=c(-7,35),
     xaxt="n",
     xlab="Length Missingness",
     ylab="Weight")

axis(1, at=c(0, 1), labels =c("Not missing", "Missing"))


abline(h=quantile(egdata$Weight,1/4,na.rm=T)-1.5*IQR(egdata$Weight,na.rm=T),col="red")
abline(h=quantile(egdata$Weight,3/4,na.rm=T)+1.5*IQR(egdata$Weight,na.rm=T),col="red")

plot(jitter(as.numeric(missing_Weight)),egdata$Length,ylim=c(20,150), 
     xaxt="n",
     xlab="Weight Missingness",
     ylab="Length")

axis(1, at=c(0, 1), labels =c("Not missing", "Missing"))


abline(h=quantile(egdata$Length,1/4,na.rm=T)-1.5*IQR(egdata$Length,na.rm=T),col="red")
abline(h=quantile(egdata$Length,3/4,na.rm=T)+1.5*IQR(egdata$Length,na.rm=T),col="red")

Next we will investigate if missingness on Length and Weight is associated with Sex. A way to display this information is by using a contingency table and formal testing is done using a Chi-square test.

Contingency table for Sex versus missing Length:

tbl_Length<-table(egdata$Sex,missing_Length)
prop.table(tbl_Length,margin = 1)
##       missing_Length
##             FALSE       TRUE
##   girl 0.91139241 0.08860759
##   boy  0.95000000 0.05000000
chisq.test(tbl_Length)
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  tbl_Length
## X-squared = 0.41817, df = 1, p-value = 0.5178

Contigency table for Sex versus missing Weight:

tbl_Weight<-table(egdata$Sex,missing_Weight)
prop.table(tbl_Weight,margin=1)
##       missing_Weight
##            FALSE      TRUE
##   girl 0.8607595 0.1392405
##   boy  0.8625000 0.1375000
chisq.test(tbl_Weight)
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  tbl_Weight
## X-squared = 8.292e-31, df = 1, p-value = 1

The p-values of 0.52 and 1 implies that there is no association between missingness on Length and Weight and the observed values of Sex.

The results obtained so far is an indication that missingness mechanism is not systematic. Missing values are possibly missing completely at random (MCAR). This is based on a null hypothesis that the missingness mechanism is MCAR. If we suspect MNAR, then the above checks are not valid for confirming or rejecting the claim.

3 Imputation of Missing Data

In this section, we consider imputation of missing values using multiple imputation (MI), mean imputation and regression imputation. We will compare parameter estimates from these imputation methods to listwise deletion and pairwise deletion. All of the imputation methods will be implemented using the mice function within the MICE package.

3.1 Multiple Imputation

We will impute missing values for Weight and Length using multiple imputation with 5 imputed datasets. The following input arguments are used in mice for multiple imputation:

  • m: Number of imputed datasets (default is m=5)

  • seed: Random seed for reproducible results

  • : method to use to impute missing values (default method for imputation of numeric variables is pmm). For single random imputation using linear regression (without PMM), set method to norm.nob. Specifying method implies that all missing variables will be imputed using this approach.

  • defaultMethod: this argument sets default methods for imputing different types of variables. This is a vector containing 4 values and specifies the default methods for imputing numeric, binary, nominal and ordinal variables. The default value is c(“pmm”, “logreg”, “polyreg”, “polr”). This default implies that numeric values will be imputed using “pmm” (predictive mean matching), binary variables will be imputed using “logreg” (binary logistic regression), nominal variables will be imputed using “polyreg” (multinomial logistic regression) and ordinal variables will be imputed using “polr” (ordinal logistic regression).

PMM is predictive mean matching. This approach combines the ideas of hot deck imputation and regression imputation. Imputed values are a random draw from a set of observed values that have predicted values close to the predicted missing value. This approach helps to avoid out of range missing value estimates when data. This is mostly used when the variable(s) with missing values have a skewed or multimodal distribution. Useful reference describing this method: http://stefvanbuuren.name/fimd/sec-pmm.html.

  • maxit: Maximum number of iterations to use for imputation algorithms (default: maxit=5).

Next we apply the mice function to create the multiply imputed datasets.

MI5_egdata <- mice(egdata, m=5, seed = 500)

Using summary, we see the details of the imputation technique used.

summary(MI5_egdata)
## Class: mids
## Number of multiple imputations:  5 
## Imputation methods:
##    Sex Length Weight 
##     ""  "pmm"  "pmm" 
## PredictorMatrix:
##        Sex Length Weight
## Sex      0      1      1
## Length   1      0      1
## Weight   1      1      0

The output shows that missing values for Length and Weight were estimated using PMM and there was no imputation for Sex (no missing values).

To see the imputed values by for each variable, we use the command below.

For example, the imputed values for Length are obtained as follows:

MI5_egdata$imp$Length #Imputed values for Length

The rows correspond to cases with missing values on Length while the columns correspond to imputed dataset that the values are taken from.

A similar command with Length replaced by Weight gives imputed values for Weight. The output is not shown here.

MI5_egdata$imp$Weight #Imputed values for Weight

To view the multiply imputed datasets, use the complete function.

For example, to extract the first imputed dataset we use the following command:

MI5_egdata_1 <- complete(MI5_egdata,1)#First complete dataset with imputed values

To view the dataset in a spreadsheet format in R we run the View function.

View(MI5_egdata_1)

3.2 Analysis of Multiply Imputed Datasets

In this section, we are interested in comparing the Weight of boys and girls using a two-sample t-test. The R function for this is ttest, however we will using the lm function (used for regression analysis) to obtain the p-value of the two sample t-test when doing multiple imputation. This is because the ttest function does not produce standard error of the estimated difference, which is required by the pool function to combine analysis results. This approach is correct because the p-value produce by both approaches are the same, given that squaring the t-test statistic (value used to get p-value in t-test) gives the F-statistic (value used to get p-value in regression). The lm function is implemented within the function with which performs a two-sample t-test on each of the 5 imputed data sets.

ttest_Weight_MI5<-with(data=MI5_egdata,expr=lm(Weight~Sex))
#Compare weight of boys and girls using multiply imputed datasets

Next we combine the results of the significant tests using pool.

combine_ttest_Weight_MI5 <- pool(ttest_Weight_MI5)

The function summary produces the combined results.

summary(combine_ttest_Weight_MI5)

From the combined results, the mean Weight of boys in the sample is 1.13kg higher than that of the girls. With a p-value of 0.22, this difference in Weight is not statistically significant.

3.3 Mean Imputation

The mean imputation method replaces missing values on a variable by the mean of observed values of that variable. This method leads to unbiased parameter estimates under the assumption of MCAR. We will implement mean imputation of missing values within the mice function, with method set to 'mean' and m=1.

Mean_egdata <- mice(egdata, m=1, method = 'mean')
MeanData_egdata<-complete(data=Mean_egdata)

The complete function is used to create a new dataframe containing the observed and imputed values. To view the imputed values for Length, we use the following command:

Mean_egdata$imp$Length

Next, we will repeat our analysis (comparing the mean Weight of boys and girls) using the mean imputed dataset.

#T-test for comparing Weight of boys and girls
t.test(data=MeanData_egdata,Weight~Sex)
## 
##  Welch Two Sample t-test
## 
## data:  Weight by Sex
## t = -1.024, df = 156.53, p-value = 0.3074
## alternative hypothesis: true difference in means between group girl and group boy is not equal to 0
## 95 percent confidence interval:
##  -2.5736420  0.8162425
## sample estimates:
## mean in group girl  mean in group boy 
##           11.16402           12.04272
diff(tapply(X=MeanData_egdata$Weight,INDEX=MeanData_egdata$Sex,FUN = mean))
##       boy 
## 0.8786998

Based on the results obtained, in this sample, boys are 0.88kg heavier than girls on average. The p-value of 0.31 implies that the observed difference in mean Weight is not statistically significant.

This result agrees with our conclusion when missing values were imputed using multiple imputation (even though the p-value here is slightly larger).

3.4 Regression Imputation

In this section, we will estimate missing values using the regression mean imputation method. In this approach, each variable with missing values will be used as the dependent variable in a regression model and predicted values are our missing data estimates. This imputation method is also implemented using the mice function with method set to 'norm.predict' and \(m=1\). To implement single random imputation (regression imputation with noise added), set method to 'norm.nob'.

Regression_egdata <- mice(egdata, m=1, method = 'norm.predict')
RegressionData_egdata<-complete(data=Regression_egdata)

The imputed Length values are given as follows:

Regression_egdata$imp$Length

Let us compare the mean Weight of boys and girls using the regression imputed dataset.

#T-test for comparing Weight of boys and girls
t.test(data=RegressionData_egdata,Weight~Sex)
## 
##  Welch Two Sample t-test
## 
## data:  Weight by Sex
## t = -1.1811, df = 156.52, p-value = 0.2394
## alternative hypothesis: true difference in means between group girl and group boy is not equal to 0
## 95 percent confidence interval:
##  -2.8704977  0.7222262
## sample estimates:
## mean in group girl  mean in group boy 
##           10.86619           11.94033
diff(tapply(X=RegressionData_egdata$Weight,INDEX=RegressionData_egdata$Sex,FUN = mean))
##      boy 
## 1.074136

The results suggests that boys are heavier than girls on average by 1.22kg. This difference in Weight is not statistically significant.

The t-test result agrees with our conlcusion from estimating missing values using the multiple imputation and mean imputation methods.

3.5 Listwise deletion and Pairwise deletion

The R default method for dealing with missing values in most functions is to delete cases with missing values on at least a variable. So, we will just go directly to implement the data analysis on the actual data with missing values. This is listwise deletion.

t.test(data=egdata,Weight~Sex)
## 
##  Welch Two Sample t-test
## 
## data:  Weight by Sex
## t = -1.0236, df = 134.64, p-value = 0.3079
## alternative hypothesis: true difference in means between group girl and group boy is not equal to 0
## 95 percent confidence interval:
##  -2.9902880  0.9506503
## sample estimates:
## mean in group girl  mean in group boy 
##           11.09250           12.11232
diff(tapply(X=egdata$Weight,INDEX=egdata$Sex,FUN = mean,na.rm=T))
##      boy 
## 1.019819

This test has a p-value of 0.31, which implies that the observed difference in mean Weight of 1.02kg is not statistically significant.

The above conclusion corresponds with the results we obtained by imputation of missing data using multiple imputation, mean imputation and regression imputation. This may be an indication that missingness mechanism is not systematic (MCAR). This claim is plausible given the sensitivity analysis so far and the impuation results, it is however best to use your understanding of the data and the data collection process to support any intuition. NB: When the missingness mechanism is MCAR, multiple imputation, mean imputation, regression imputation and listwise deletion lead to unbiased parameter estimates.

Note that given that the analysis of interest here is a two sample t-test (and we do not need to calculate a covariance matrix), the listwise and pairwise deletion approaches will lead to exactly the same results.

From our analysis so far, the different methods for dealing with missing data lead to same conclusions when comparing the weight of boys and girls. This implies that our data is quite robust to the missing values. We recommend using multiple imputation for dealing with missing data to achieve high power, high precision and to get unbiased parameter estimates under the MCAR and MAR assumptions. However, the sensitivity analysis above suggests that the missing values have had a minimal impact on our analysis.

To perform pairwise deletion in regression analysis, you need to install the regtools package and run the regression analysis using the lmac function instead of lm.

#Install package for pairwise deletion in regression analysis
install.packages('regtools')
library(regtools)

To show an example where the output of listwise deletion and pairwise deletion will differ, we will fit a linear regression model to predict Weight of babies using Length and Sex. From the estimated regression coefficients, notice the difference in values obtained using listwise deletion and pairwise deletion.

Listwise deletion regression coefficients:

lm_listwise<-lm(Weight~Length+Sex,data=egdata)
coef(lm_listwise)
## (Intercept)      Length      Sexboy 
## -11.8765968   0.2807237   0.5179966

Pairwise deletion regression coefficients:

lm_pairwise<-lmac(cbind(egdata$Length,egdata$Sex,egdata$Weight))
coef(lm_pairwise)
## [1] -12.5721999   0.2872227   0.4636315