We’ll go here through a data set which has missing values. We’ll highlight possibilities - what can be done - if one doesn’t want to remove the observations which include the missing values.
Let’s first look at the famous data set from Francis Anscombe: Anscombe’s quartet wiki
First, let’s load and look at the data, before introducing the missing values
dat_orig <- anscombe #original data set
par(mfrow = c(2,2))
attach(dat_orig)
plot(x1, y1)
abline(lm(y1 ~ x1), col = "red")
plot(x2,y2)
abline(lm(y2 ~ x2), col = "red")
plot(x3, y3)
abline(lm(y3 ~ x3), col = "red")
plot(x4, y4)
abline(lm(y4 ~ x4), col = "red")detach(dat_orig)We are not going into detail here, but the as you can see, all three setups have the same linear regression line. The details would reveale that mean, sample variance, correlation, etc. are the same, too (up to a certain accuracy).
#check this out
summary(dat_orig)
#> x1 x2 x3 x4
#> Min. : 4.0 Min. : 4.0 Min. : 4.0 Min. : 8
#> 1st Qu.: 6.5 1st Qu.: 6.5 1st Qu.: 6.5 1st Qu.: 8
#> Median : 9.0 Median : 9.0 Median : 9.0 Median : 8
#> Mean : 9.0 Mean : 9.0 Mean : 9.0 Mean : 9
#> 3rd Qu.:11.5 3rd Qu.:11.5 3rd Qu.:11.5 3rd Qu.: 8
#> Max. :14.0 Max. :14.0 Max. :14.0 Max. :19
#> y1 y2 y3 y4
#> Min. : 4.260 Min. :3.100 Min. : 5.39 Min. : 5.250
#> 1st Qu.: 6.315 1st Qu.:6.695 1st Qu.: 6.25 1st Qu.: 6.170
#> Median : 7.580 Median :8.140 Median : 7.11 Median : 7.040
#> Mean : 7.501 Mean :7.501 Mean : 7.50 Mean : 7.501
#> 3rd Qu.: 8.570 3rd Qu.:8.950 3rd Qu.: 7.98 3rd Qu.: 8.190
#> Max. :10.840 Max. :9.260 Max. :12.74 Max. :12.500Let’s introduce some missing values and then see, what we can do about it.
dat <- within(dat_orig, {
y1[1:3] <- NA
y4[3:5] <- NA
})
knitr::kable(dat)| x1 | x2 | x3 | x4 | y1 | y2 | y3 | y4 |
|---|---|---|---|---|---|---|---|
| 10 | 10 | 10 | 8 | NA | 9.14 | 7.46 | 6.58 |
| 8 | 8 | 8 | 8 | NA | 8.14 | 6.77 | 5.76 |
| 13 | 13 | 13 | 8 | NA | 8.74 | 12.74 | NA |
| 9 | 9 | 9 | 8 | 8.81 | 8.77 | 7.11 | NA |
| 11 | 11 | 11 | 8 | 8.33 | 9.26 | 7.81 | NA |
| 14 | 14 | 14 | 8 | 9.96 | 8.10 | 8.84 | 7.04 |
| 6 | 6 | 6 | 8 | 7.24 | 6.13 | 6.08 | 5.25 |
| 4 | 4 | 4 | 19 | 4.26 | 3.10 | 5.39 | 12.50 |
| 12 | 12 | 12 | 8 | 10.84 | 9.13 | 8.15 | 5.56 |
| 7 | 7 | 7 | 8 | 4.82 | 7.26 | 6.42 | 7.91 |
| 5 | 5 | 5 | 8 | 5.68 | 4.74 | 5.73 | 6.89 |
Wait - but why do we need to look at missing values? Missing is missing, right?
Well, unfortunately no my friend.
In classical regression ( as well as most other odels), R automatically excludes all cases which includes missing values (this approach is called: complete-case analysis). This can limit the amount of information available drastically. Therefore, it can be essential to find values where they are missing, but finding them can be very tricky (but also intersting and rewarding).
Here, we check a few methods on how to display the NA values and how to display their patterns - ranging from basic to not-so-basic approaches
# first thing to do - always - is
summary(dat$y1)
#> Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
#> 4.260 5.465 7.785 7.492 9.098 10.840 3
# summary(dat$y4) #let's stick with y1
# Table
table(dat$y1) # attention here, NA not displayed
#>
#> 4.26 4.82 5.68 7.24 8.33 8.81 9.96 10.84
#> 1 1 1 1 1 1 1 1
table(dat$y1, useNA = "ifany") # better
#>
#> 4.26 4.82 5.68 7.24 8.33 8.81 9.96 10.84 <NA>
#> 1 1 1 1 1 1 1 1 3The MICE package (stands for Multiple Imputation by Chained Equations) can help in several ways - here, we’ll use it for finding the NA-patterns:
require(mice)
#> Loading required package: mice
#> Warning: package 'mice' was built under R version 3.3.3
md.pattern(dat) # md: missing data...pattern
#> x1 x2 x3 x4 y2 y3 y1 y4
#> 6 1 1 1 1 1 1 1 1 0
#> 2 1 1 1 1 1 1 0 1 1
#> 2 1 1 1 1 1 1 1 0 1
#> 1 1 1 1 1 1 1 0 0 2
#> 0 0 0 0 0 0 3 3 6We see that there are 4 patterns: 6 observations with complete information, 2 observation have a NA in y1, 2 other obervation in y4 and 1 observaton one in both y1 and y4.
The VIM package might help us to see the patterns visually (VIM is an abbreviation for Visualization and Imutation of Missing Values):
require(VIM)
#> Loading required package: VIM
#> Warning: package 'VIM' was built under R version 3.3.3
#> Loading required package: colorspace
#> Loading required package: grid
#> Loading required package: data.table
#> VIM is ready to use.
#> Since version 4.0.0 the GUI is in its own package VIMGUI.
#>
#> Please use the package to use the new (and old) GUI.
#> Suggestions and bug-reports can be submitted at: https://github.com/alexkowa/VIM/issues
#>
#> Attaching package: 'VIM'
#> The following object is masked from 'package:datasets':
#>
#> sleep
marginplot(dat[c(5,8)])For the plot, we only specifed to take y1 and y4. Therefore, we can see here the pattern for these two columns as follows:
- The blue dots represent individuals that have observed values in both y1 and y4.
- The blue boxes are their box plots.
- The red dots represent missing values for either y1 but observed for y4 (left vertical margin) or vice versa: missing values for y4 but observed for y1 (bottom horizontal margin).
- and their box plots.
Under the MCAR (Missing Completely at Random) assumption, the red and the blue box should be identical.
To examine the MAR assumption (Missing at Random), we can check the pbox() function:
# from the VIM package:
# Parallel boxplots:
pbox(dat, pos = 2)Theh MAR assumption states that missingness is based on the other observed vairable(s) but not of the missing variable(s) itserlf. In the above plot, we are looking at the missing values and how they are distributed w.r.t. to x2 (= position 2 in our case).
Evidence for MAR would be if the missing-value distributions were much higher or much lower than those of the non-missing-values. But because in our example, the boxplots seem to be mostly overlapping, hence, the missing values in y1 and y3 seem to be randomly distributed w.r.t. to x2, hence, providing support for the MCAR assumption, and not the MAR assumption.
The function aggr() is an additional tool to visualize the missing values. The left plot shows which vectors are affected. And the right plot shows you the pattern - like before with the md.pattern from the mice package.
aggr(dat, numbers=TRUE, sortVars=TRUE, labels=names(dat), cex.axis=.7, gap=3, ylab=c("Proportion of missingness","Missingness Pattern"))#>
#> Variables sorted by number of missings:
#> Variable Count
#> y1 0.2727273
#> y4 0.2727273
#> x1 0.0000000
#> x2 0.0000000
#> x3 0.0000000
#> x4 0.0000000
#> y2 0.0000000
#> y3 0.0000000
Let’s go pack to the MICE package and impute the values. The main function for imputing the values is mice(). You see here that we iterate 7 times (5 is default):
imp1 <- mice(dat, m = 7)
#>
#> iter imp variable
#> 1 1 y1 y4
#> 1 2 y1 y4
#> 1 3 y1 y4
#> 1 4 y1 y4
#> 1 5 y1 y4
#> 1 6 y1 y4
#> 1 7 y1 y4
#> 2 1 y1 y4
#> 2 2 y1 y4
#> 2 3 y1 y4
#> 2 4 y1 y4
#> 2 5 y1 y4
#> 2 6 y1 y4
#> 2 7 y1 y4
#> 3 1 y1 y4
#> 3 2 y1 y4
#> 3 3 y1 y4
#> 3 4 y1 y4
#> 3 5 y1 y4
#> 3 6 y1 y4
#> 3 7 y1 y4
#> 4 1 y1 y4
#> 4 2 y1 y4
#> 4 3 y1 y4
#> 4 4 y1 y4
#> 4 5 y1 y4
#> 4 6 y1 y4
#> 4 7 y1 y4
#> 5 1 y1 y4
#> 5 2 y1 y4
#> 5 3 y1 y4
#> 5 4 y1 y4
#> 5 5 y1 y4
#> 5 6 y1 y4
#> 5 7 y1 y4So what happend here exactly?
The output object is imp1 and its summary looks like this:
summary(imp1$imp)
#> Length Class Mode
#> x1 0 -none- NULL
#> x2 0 -none- NULL
#> x3 0 -none- NULL
#> x4 0 -none- NULL
#> y1 7 data.frame list
#> y2 0 -none- NULL
#> y3 0 -none- NULL
#> y4 7 data.frame listWe see beautifully that the we have two dataframes for the variables which had missing values. Let’s look at y1:
imp1$imp$y1
#> 1 2 3 4 5 6 7
#> 1 4.82 7.24 9.96 7.24 4.82 7.24 5.68
#> 2 5.68 9.96 8.33 10.84 8.81 8.33 4.82
#> 3 10.84 10.84 8.81 9.96 8.33 9.96 9.96
imp1$method
#> x1 x2 x3 x4 y1 y2 y3 y4
#> "" "" "" "" "pmm" "" "" "pmm"You remember, we used 7 iterations and per variable, we introduced 3 missing values. The method used to impute was pmm (predictive mean matching). Now we have to decide which dataset (from the 7 possibilities) to use to fill the missing values. Here, we use all 7 as follows:
dat_imputed <- complete(imp1, 7)The next step is to check how good the fit is.
To visualize the result, the xyplot() comes in quite handy. Remember that we had missing data in y1 and y4.
require(lattice)
#> Loading required package: lattice
#> Warning: package 'lattice' was built under R version 3.3.3
#original:
plot(dat_orig$x1, dat_orig$y1)xyplot(imp1, y1 ~ x1|.imp, pch = 20, cex = 1.4) # this function comes from latticeHere again, the blue ones are the observed data and red ones are imputed data. The red points should ideally be similar to the blue ones so that the imputed values are similar.
Just as it was for the xyplot(), the red imputed values should be similar to the blue imputed values for them to be MAR here:
densityplot(imp1)We have now 7 different imputed values which seem to be somehow different (refer to the xyplot and the densityplot). To create a more robust model, we can use all 7 imputed datasets for modelling.
# create the 7 models
# stick to a simple linear regression line
lm_7_models <- with(imp1, lm(y1 ~ y4 + x1))
# pool the results together
# mice::pool()
comb_7_models <- pool(lm_7_models)
summary(comb_7_models)
#> est se t df Pr(>|t|) lo 95
#> (Intercept) 5.6488756 2.9390958 1.921977 4.900900 0.11380986 -1.95249287
#> y4 -0.3173652 0.2605539 -1.218041 5.452826 0.27327404 -0.97075160
#> x1 0.4703296 0.1604511 2.931295 5.465491 0.02929274 0.06822196
#> hi 95 nmis fmi lambda
#> (Intercept) 13.2502440 NA 0.4176955 0.2203343
#> y4 0.3360211 3 0.3508315 0.1496271
#> x1 0.8724372 0 0.3492844 0.1479956This gives us one set of missing values for our linear regression model.