During office hours, Dr. Matthews explained that a realistic case when it may make sense to regard part of the missingness indicators as missing is really to find a case when When you’re not sure what the missing value means. We are tasked to think of a situation where a missing value is simply missing or does the missingness carry some value. We came up with several examples:
How many dogs do you own? This is a good example because a blank value could mean that they have 0 dogs, or that the data is simply missing. Have you ever had a baby? This is a good example because if a man were answering this, what would he respond? Would the missingness imply a man was taking the survey, a woman was taking the survey and she didn’t have kids, or that the data simply missing. Have you ever had prostate cancer? This is similar to the above scenario, as a woman may leave this question blank because it is not applicable to her.
Part A: The sample mean of the MCAR data is xbar = 98.43, ybar = 70.25
Part B: The sample mean of the MAR data is xbar = 98.13, ybar =70.25
Part C: The sample mean of the MNAR data is xbar = 101.18, ybar= 70.25
The covariance matrices for each missingness mechanism are computed below.
Part D: We notice that covariances are not as good as we slide down the missingness mechanism scale from MCAR to MAR to MNAR. This is because we are removing data points and we have a bias response. It is getting more and more difficult to estimate values as we approach MNAR. This simulation helps us understand the difficulty in MNAR problems and the value in having MCAR problems.
set.seed(1234)
x<-rnorm(100,100,10)
y<-rnorm(100, 70, 6)
dat<-data.frame(x,y)
head(dat)
## x y
## 1 87.92934 72.48714
## 2 102.77429 67.15169
## 3 110.84441 70.39596
## 4 76.54302 66.98513
## 5 104.29125 65.04401
## 6 105.06056 71.00194
summary(dat)
## x y
## Min. : 76.54 Min. :52.87
## 1st Qu.: 91.05 1st Qu.:66.64
## Median : 96.15 Median :70.20
## Mean : 98.43 Mean :70.25
## 3rd Qu.:104.71 3rd Qu.:73.77
## Max. :125.49 Max. :88.26
cov(dat)
## x y
## x 100.883001 -1.578917
## y -1.578917 38.354786
#PART A MCAR
#missingness is unrelated to observed and the missing data points
datMCAR<-dat
datMCAR$x[runif(100<0.5)]<-NA
summary(datMCAR)
## x y
## Min. : 76.54 Min. :52.87
## 1st Qu.: 91.05 1st Qu.:66.64
## Median : 96.15 Median :70.20
## Mean : 98.43 Mean :70.25
## 3rd Qu.:104.71 3rd Qu.:73.77
## Max. :125.49 Max. :88.26
aggr(datMCAR)
matrixplot(datMCAR)
## Warning in hex(RGB(r, g, b), gamma = gamma, fixup = fixup, ...): 'gamma' is
## deprecated and has no effect
## Warning in hex(RGB(r, g, b), gamma = gamma, fixup = fixup, ...): 'gamma' is
## deprecated and has no effect
cov(datMCAR[!is.na(datMCAR$x),1:2])
## x y
## x 100.883001 -1.578917
## y -1.578917 38.354786
summary(datMCAR)
## x y
## Min. : 76.54 Min. :52.87
## 1st Qu.: 91.05 1st Qu.:66.64
## Median : 96.15 Median :70.20
## Mean : 98.43 Mean :70.25
## 3rd Qu.:104.71 3rd Qu.:73.77
## Max. :125.49 Max. :88.26
#PART B MAR
#all things you want to know about missing can be can learn through missing datas
datMAR <-dat
datMAR$x[datMAR$y>0 & runif(100)>0.5] <-NA
summary(datMAR)
## x y
## Min. : 78.20 Min. :52.87
## 1st Qu.: 92.48 1st Qu.:66.64
## Median : 99.27 Median :70.20
## Mean : 99.53 Mean :70.25
## 3rd Qu.:105.95 3rd Qu.:73.77
## Max. :124.16 Max. :88.26
## NA's :60
aggr(datMAR)
matrixplot(datMAR)
## Warning in hex(RGB(r, g, b), gamma = gamma, fixup = fixup, ...): 'gamma' is
## deprecated and has no effect
## Warning in hex(RGB(r, g, b), gamma = gamma, fixup = fixup, ...): 'gamma' is
## deprecated and has no effect
cov(datMAR[!is.na(datMAR$x), 1:2])
## x y
## x 111.28175 12.51922
## y 12.51922 32.80739
summary(datMAR)
## x y
## Min. : 78.20 Min. :52.87
## 1st Qu.: 92.48 1st Qu.:66.64
## Median : 99.27 Median :70.20
## Mean : 99.53 Mean :70.25
## 3rd Qu.:105.95 3rd Qu.:73.77
## Max. :124.16 Max. :88.26
## NA's :60
#covariances are not as good because we are removing data points and we have a bias response
#PART C MNAR
#only way to know is through knowing what the data is
datMNAR<-dat
datMNAR$x[datMNAR$x<100 & runif(100)<0.5]<-NA
summary(datMNAR)
## x y
## Min. : 78.20 Min. :52.87
## 1st Qu.: 92.09 1st Qu.:66.64
## Median : 99.18 Median :70.20
## Mean :100.36 Mean :70.25
## 3rd Qu.:106.67 3rd Qu.:73.77
## Max. :125.49 Max. :88.26
## NA's :24
aggr(datMNAR)
matrixplot(datMNAR)
## Warning in hex(RGB(r, g, b), gamma = gamma, fixup = fixup, ...): 'gamma' is
## deprecated and has no effect
## Warning in hex(RGB(r, g, b), gamma = gamma, fixup = fixup, ...): 'gamma' is
## deprecated and has no effect
cov(datMNAR[!is.na(datMNAR$x), 1:2])
## x y
## x 107.561066 -7.319667
## y -7.319667 35.973143
summary(datMNAR)
## x y
## Min. : 78.20 Min. :52.87
## 1st Qu.: 92.09 1st Qu.:66.64
## Median : 99.18 Median :70.20
## Mean :100.36 Mean :70.25
## 3rd Qu.:106.67 3rd Qu.:73.77
## Max. :125.49 Max. :88.26
## NA's :24
From this logistic regression, we see that x1 and x2 are statistically significant. Probabilititeis of missingness changes with values of x this means missingness is related to values of x1 changing. This means. Thus, this is not MCAR. We cannot conclude anything else.
datata <- read.csv("C:/Users/Kajal/Downloads/datHW1MissingData.csv")
aggr(datata)
matrixplot(datata)
## Warning in hex(RGB(r, g, b), gamma = gamma, fixup = fixup, ...): 'gamma' is
## deprecated and has no effect
## Warning in hex(RGB(r, g, b), gamma = gamma, fixup = fixup, ...): 'gamma' is
## deprecated and has no effect
## Warning in hex(RGB(r, g, b), gamma = gamma, fixup = fixup, ...): 'gamma' is
## deprecated and has no effect
logM<- glm(M~ x2, data=datata, family= "binomial")
summary(logM)
##
## Call:
## glm(formula = M ~ x2, family = "binomial", data = datata)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.5455 -0.6293 -0.4513 -0.2016 1.9156
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.7115 0.3202 -5.344 9.08e-08 ***
## x2 1.1738 0.3211 3.655 0.000257 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 102.791 on 99 degrees of freedom
## Residual deviance: 84.997 on 98 degrees of freedom
## AIC: 88.997
##
## Number of Fisher Scoring iterations: 5
#Probably MAR because it is not connected to X2?
#HOW DO YOU READ THIS? ASK IN OFFICE HOURS
#Try to show something is not MCAR,
In this problem, we would follow the rules of the EM algorithm that we learned in class. First, we begin with an initializing step. Next, we would comput the expectation, or the e step. In this case, we would be computing \(\hat{y}\). Following the expectation step, we would compute the maximization step which is simply maximizing the linear model \(x_1 + x_2\). We would repeat the expectation step and compute the \(\hat{y}\) and then follow with the maximization and repeat until the values converge.