1 - Libraries Used

library(ggplot2) #for plots
library(gridExtra) #arranging multiple plots
library(plyr) #melting data
library(zscorer) #data set anthro3
library(rjags) #JAGS
library(rethinking) #Statistical Rethinking library
library(kableExtra) #pretty tables

2 - Introduction

3 - Types of Missing Data

3.1 MCAR

3.2 MAR

Here is an example of Missing at Random data in the predictor space. The data anthro3 here is from the anthro3 is from the library zscorer.

library(zscorer)
head(anthro3)
##   psu age sex weight height muac oedema
## 1   1  10   1    5.7   64.2  125      2
## 2   1  10   2    5.8   64.4  121      2
## 3   1   9   2    6.5   62.2  139      2
## 4   1  11   9    6.5   64.9  129      2
## 5   1  24   2    6.5   72.9  120      2
## 6   1  12   2    6.6   69.4  126      2

The data set as is has no missing values. We will introduce missingness as a function of height. In other words, those who are taller are more likely to have a missing weight component. As height is positively correlated with weight, we expect the two groups (missing vs observed) to have dissimilar weight distributions.

missing.total <- floor(nrow(anthro3) * .30)
naive.weight <- anthro3$height/sum(anthro3$height)
#make it more extreme
naive.weight <- ifelse(naive.weight < 0.0042, 0, naive.weight)
set.seed(10)
p <- list()
for(i in c(1:4)) {
  miss.index <- sample(c(1:nrow(anthro3)), size = missing.total, prob = naive.weight)
  w.obs <- anthro3$weight[-c(miss.index)]
  w.miss <- anthro3$weight[miss.index]
  group <- c(rep("Obs", length(w.obs)), rep("Miss", length(w.miss)))
  df <- data.frame(group = group, Weight = c(w.obs, w.miss))
  p[[i]] <- ggplot(df, aes(x=group, y=Weight)) + 
  geom_boxplot() + ggtitle('Weight Miss vs Obs MAR')
}
grid.arrange(p[[1]], p[[2]], p[[3]], p[[4]], nrow = 2)

As we can see, in four samples in which the missingess process was MAR (i.e. a function of another predictor variable), the two groups ‘Miss’ and ‘Obs’ did not show similar distributions. In fact, the ‘Miss’ group consistently had a higher weight. Let’s compare this to a MCAR situation in which the missingness process is ‘completely at random’ (i.e. just a random indexing of the rows).

p <- list()
set.seed(10)
for(i in c(1:4)) {
  rand.index <- sample(c(1:nrow(anthro3)), size = missing.total, replace = TRUE)
  w.obs <- anthro3$weight[-c(rand.index)]
  w.miss <- anthro3$weight[rand.index]
  group <- c(rep("Obs", length(w.obs)), rep("Miss", length(w.miss)))
  df <- data.frame(group = group, Weight = c(w.obs, w.miss))
  p[[i]] <- ggplot(df, aes(x=group, y=Weight)) + 
  geom_boxplot() + ggtitle('Weight Miss vs Obs MCAR')
}
grid.arrange(p[[1]], p[[2]], p[[3]], p[[4]], nrow = 2)

In this MCAR example, we can see that the two groups ‘Miss’ and ‘Obs’ had similar weight distributions. In 2 of the 4 cases the ‘Obs’ had slightly greater weight. This shows that in the case of MCAR, the two groups Missing and Observed will show similar distributions. In other words, the missingess process is NOT a function of one of the other predictor variables.

3.3 MNAR

4 - Example 1.) Missing predictor variable using JAGS

4.1 - Data and Model

We will follow the example 15.2.2 given in Chapter 15 of Statistical Rethinking, 2nd edition (McElreath, 2019).

In the data set 'data(milk) is recorded various animals data in regards to their brain’s neocortex percent (B), their body mass (M), and the amount of energy (kcal per gram) in their milk(K). We consider the following statistical model.

\[ (K_i|\mu_i, \sigma) \sim Normal(\mu_i, \sigma)\\ \mu_i = \alpha + \beta_B B_i + \beta_M \log(M_i) \]

In other words, the mean energy K (kcal/gram) in the species milk is a linear combination of the species’ neocortex percent (\(B\)) and the \(\log()\) of the body mass (\(M\)).

Let’s take a look at our data.

data(milk)
head(milk[,-1])
##              species kcal.per.g perc.fat perc.protein perc.lactose mass
## 1     Eulemur fulvus       0.49    16.60        15.42        67.98 1.95
## 2           E macaco       0.51    19.27        16.91        63.82 2.09
## 3           E mongoz       0.46    14.11        16.85        69.04 2.51
## 4      E rubriventer       0.48    14.91        13.18        71.91 1.62
## 5        Lemur catta       0.60    27.28        19.50        53.22 2.19
## 6 Alouatta seniculus       0.47    21.22        23.58        55.20 5.25
##   neocortex.perc
## 1          55.16
## 2             NA
## 3             NA
## 4             NA
## 5             NA
## 6          64.54

As we can see, the neocortex percent column has quite a few missing values. In fact, of the 29 recorded species, 12 are missing these values.

We could of course just delete all the missing cases. But that would be eliminating about 40% of our data. Instead, we will impute the missing values. The simplest model (that ignores correlation with body mass \(M\)) will impute the missing values of neocortex percent \(B\) from its own Normal distribution (McElreath, 2019).

One thing to note before preceding: is this the best prior? No. As already stated, it ignores the positive correlation that \(B\) has with \(M\). Furthermore, \(B\) is a percent, meaning it can really be reduced to a positive number between 0 and 1. That being said, we’ll start with this example and then try to improve upon it in later sections.

So our full model (as stated in McElreath, 2019):

\[ (K_i|\mu_i, \sigma) \sim Normal(\mu_i, \sigma)\\ \mu_i = \alpha + \beta_B B_i + \beta_M \log(M_i)\\ B_i \sim Normal(\nu, \sigma_B)\\ \alpha \sim Normal(0, 0.5)\\ \beta_B \sim Normal(0, 0.5)\\ \beta_M \sim Normal(0, 0.5)\\ \sigma \sim Exponential(1)\\ \nu \sim Normal(0.5, 1)\\ \sigma_B \sim Exponential(1) \]

4.2 - JAGS Implementation and imputed \(B\) values

We will implement this model using JAGS.

First write our static model to the file ‘milk.txt’.

cat("model {
   for(i in 1:29) {
      K[i] ~ dnorm(mu[i], sigma)
      mu[i] <- alpha + beta.B*(B[i]) + beta.M*(M[i])
   }

  # model for missing exposure variable
  for(i in 1:29) {  
      B[i] ~ dnorm(nu, sigma.b) 
  }
   
   #prior on sigma, nu, and sigma.b
   sigma ~ dexp(1)
   nu ~ dnorm(0.5, 1)
   sigma.b ~ dexp(1)
  
  # priors on regression coefficients
   alpha ~ dnorm(0,0.5)
   beta.B ~ dnorm(0,0.5) 
   beta.M ~ dnorm(0, 0.5)

}", file="milk.txt")
#data. standardize everything. 
na.index <- which(is.na(milk$neocortex.perc))
milk$neocortex.prop <- milk$neocortex.perc/100
stand.K = (milk$kcal.per.g - mean(milk$kcal.per.g))/sd(milk$kcal.per.g)
stand.B = (milk$neocortex.prop - mean(milk$neocortex.prop, na.rm = TRUE))/(sd(milk$neocortex.prop, na.rm = TRUE))
stand.logM = (log(milk$mass) - mean(log(milk$mass)))/sd(log(milk$mass))
dat<-list(
  K = stand.K, 
  B=stand.B, 
  M = stand.logM)

#Run the model and get mean estimates
milk.model <- jags.model(file = 'milk.txt', data = dat, n.chains = 2, n.adapt = 1000)
## Compiling model graph
##    Resolving undeclared variables
##    Allocating nodes
## Graph information:
##    Observed stochastic nodes: 46
##    Unobserved stochastic nodes: 18
##    Total graph size: 180
## 
## Initializing model
milk.samples <- coda.samples(milk.model, variable.names = c("B", "M", "alpha", "nu", "beta.B", "beta.M"), n.iter = 5000)[[1]]
burnin <- 2000
param.estimates <- milk.samples[-c(1:burnin),]
param.estimates.means <- apply(param.estimates, 2, mean)
imputes <- as.numeric(param.estimates.means[c(1:29)])

#present results of imputed B vector
orig <- stand.B
string.vec <- rep('', 29)
for(i in 1:29) {
  string.vec[i] <- paste('B[', i, ']')
}

results.df <- data.frame(estimate.names = string.vec, imputes = imputes, original.data = orig)
x_html <- knitr::kable(results.df, "html")
kable_styling(x_html, "striped", position = "left", font_size = 11)
estimate.names imputes original.data
B[ 1 ] -2.0801960 -2.0801960
B[ 2 ] -0.6420022 NA
B[ 3 ] -0.7729520 NA
B[ 4 ] -0.8122292 NA
B[ 5 ] -0.3302219 NA
B[ 6 ] -0.5086413 -0.5086413
B[ 7 ] -0.5086413 -0.5086413
B[ 8 ] 0.0107425 0.0107425
B[ 9 ] 0.4420388 NA
B[ 10 ] 0.2134697 0.2134697
B[ 11 ] -1.4619618 -1.4619618
B[ 12 ] -0.9861393 -0.9861393
B[ 13 ] -1.2156734 -1.2156734
B[ 14 ] -0.2603690 NA
B[ 15 ] 0.2044777 NA
B[ 16 ] 0.4011180 0.4011180
B[ 17 ] 0.3382774 NA
B[ 18 ] 0.4748370 0.4748370
B[ 19 ] 0.5791706 NA
B[ 20 ] 0.9757910 0.9757910
B[ 21 ] -0.4585621 NA
B[ 22 ] -0.0076873 -0.0076873
B[ 23 ] -0.2834924 NA
B[ 24 ] 0.6172487 0.6172487
B[ 25 ] 0.8417565 0.8417565
B[ 26 ] 0.3538085 NA
B[ 27 ] 0.4463547 0.4463547
B[ 28 ] 1.4616661 1.4616661
B[ 29 ] 1.3259562 1.3259562

As we can see, the elements of \(B\) (the standardized version of it) that already had a value were left unchanged. The missing values were imputed.

4.3 - Comparing Deletion and Imputing (Coefficients of \(\mu_i = \alpha + \beta_B B_i + \beta_M \log(M_i)\))

For comparison, we will also run the model while deleting all missing value rows. We will compare the slope coefficient estimates \(\beta_B\) and \(\beta_M\) in our two models.

Write our static JAGS model to ‘milk_obs.txt’

cat("model {
   for(i in 1:17) {
      K[i] ~ dnorm(mu[i], sigma)
      mu[i] <- alpha + beta.B*(B[i]) + beta.M*(M[i])
   }

  # model for missing exposure variable
  for(i in 1:17) {  
      B[i] ~ dnorm(nu, sigma.b) 
  }
   
   #prior on sigma, nu, and sigma.b
   sigma ~ dexp(1)
   nu ~ dnorm(0.5, 1)
   sigma.b ~ dexp(1)
  
  # priors on regression coefficients
   alpha ~ dnorm(0,0.5)
   beta.B ~ dnorm(0,0.5) 
   beta.M ~ dnorm(0, 0.5)

}", file="milk_obs.txt")
#run model if we drop all missing cases
obs.index <- which(!is.na(milk$neocortex.prop))
dat.list.obs <- list(
  K = dat$K[obs.index],
  B = dat$B[obs.index],
  M = dat$M[obs.index])

milk.model.obs <- jags.model(file = 'milk_obs.txt', data = dat.list.obs, n.chains = 2, n.adapt = 1000)
## Compiling model graph
##    Resolving undeclared variables
##    Allocating nodes
## Graph information:
##    Observed stochastic nodes: 34
##    Unobserved stochastic nodes: 6
##    Total graph size: 109
## 
## Initializing model
milk.samples.obs <- coda.samples(milk.model.obs, variable.names = c("B", "M", "alpha", "nu", "beta.B", "beta.M"), n.iter = 5000)[[1]]

Graphing our distributions.

#Compare the distributions of beta's from our two models (observed only vs imputed)
burnin <- 2000
impute <- milk.samples[-c(1:burnin),]
impute.beta.b <- impute[,(29+29+2)]
impute.beta.m <- impute[,(29+29+3)]


obs <- milk.samples.obs[-c(1:burnin),]
obs.beta.b <- obs[,(17+17+2)]
obs.beta.m <- obs[,(17+17+3)]

#------------------------------------------
##beta.b
df.beta.b <- data.frame(
              method = factor(rep(c("Impute", "Obs.Only"), each = length(impute.beta.b))), 
              beta.b = c(impute.beta.b, obs.beta.b)
              )
mu.b <- ddply(df.beta.b, "method", summarise, grp.mean = mean(beta.b))
                        
#overlaid density plots
p.b <- ggplot(df.beta.b, aes(x=beta.b, color=method)) +
  geom_density() + 
  geom_vline(data=mu.b, aes(xintercept=grp.mean, color=method),
                              linetype="dashed") +
  ggtitle('Beta.B Sampling Distribution')
#------------------------------------------
##beta.m
df.beta.m <- data.frame(
  method = factor(rep(c("Impute", "Obs.Only"), each = length(impute.beta.b))), 
  beta.m = c(impute.beta.m, obs.beta.m)
)
mu.m <- ddply(df.beta.m, "method", summarise, grp.mean = mean(beta.m))

#overlaid density plots
p.m <- ggplot(df.beta.m, aes(x=beta.m, color=method)) +
  geom_density() + 
  geom_vline(data=mu.m, aes(xintercept=grp.mean, color=method),
             linetype="dashed") +
  ggtitle('Beta.M Sampling Distribution')

grid.arrange(p.b, p.m, nrow = 2)

Some things to note from our graphs.

1.) When we impute we decreased the variance of the \(\hat{\beta}_B\) and \(\hat{\beta}_M\) coefficient sampling distributions. This makes intuitive sense since we used more information.

2.) When we impute we caused shrinkage in the coefficients. This suggests that the mean \(\mu_i\) of the energy \(K\) in the milk is less influenced by body mass \(M\) and neocortex percent \(B\).

5 - Example 2.) ?