Homework 10

##Kevin Kuipers ##October 30, 2018

##Problem 1 )

Consider the respiratory data from the HSAUR3 package.

First I will load the some of the libraries that are needed and the data set. I will also use the ?repsiratory to underatand the variables and what they are.

library(HSAUR3)
library(gee)
library(lme4)
data(respiratory)

#?repsiratory

##Problem 1 A)

a. Investigate the use of other correlational structures 
than the independence and exchangeable structures used in 
the text for the respiratory data.

#Looking at the function to solve Generalized Estimation Equaion Model

First I will use the ?gee() command to explore the arguments of the Generalized Estimation Equation Model are and specifically look at the corstr to see what other correlation structures are permitted.

#Just uncomment to look at the generalized estimation equation model.
#gee()

It appears there are 5 different correlation structures other than independence and exchangeable. I will fit a model for each one:

Fixed stat_M_dep non_stat_M_dep AR-M unstructured

#Rearranging the data

Before fitting the models I will use the same code found in the textbook for creating a baseline status and converting the data to a data frame. The baseline will be the status for month = 0. The nstat variable is a dummy coding variable. The code used was found the Textbook from Chapter 14.

resp <- subset(respiratory, month > '0')
resp$baseline <- rep(subset(respiratory, month == '0')$status,
                     rep(4, 111))
resp$nstat <- as.numeric(resp$status == 'good')
resp$month <- resp$month[, drop = TRUE]


names(resp)[names(resp) == 'treatment'] <- 'trt'
levels(resp$trt)[2] <- 'trt'

#Fitting Models

library(MESS)
#
#resp_fixed <- gee(nstat ~ centre + trt + gender + baseline + age,
#                  data = resp, family = 'binomial', id = subject,
#                  corstr = 'fixed', scale.fix = TRUE,
#                  scale.value = 1)


#Fitting a generalized estimation equation with a correlation structure as sat_M_dep
resp_stat_M_dep <- gee(nstat ~ centre + trt + gender + baseline + age,
                  data = resp, family = 'binomial', id = subject,
                  corstr = 'stat_M_dep', scale.fix = TRUE,
                  scale.value = 1)

##  (Intercept)      centre2       trttrt   gendermale baselinegood 
##  -0.90017133   0.67160098   1.29921589   0.11924365   1.88202860 
##          age 
##  -0.01816588

#Fitting a generalized estimation equation with a correlation structure as non_sat_M_dep
resp_non_stat_M_Dep <- gee(nstat ~ centre + trt + gender + baseline + age,
                  data = resp, family = 'binomial', id = subject,
                  corstr = 'non_stat_M_dep', scale.fix = TRUE,
                  scale.value = 1)

##  (Intercept)      centre2       trttrt   gendermale baselinegood 
##  -0.90017133   0.67160098   1.29921589   0.11924365   1.88202860 
##          age 
##  -0.01816588

#Fitting a generalized estimation equation with a correlation structure as AR-M
resp_arm <- gee(nstat ~ centre + trt + gender + baseline + age,
                  data = resp, family = 'binomial', id = subject,
                  corstr = 'AR-M', scale.fix = TRUE,
                  scale.value = 1)

##  (Intercept)      centre2       trttrt   gendermale baselinegood 
##  -0.90017133   0.67160098   1.29921589   0.11924365   1.88202860 
##          age 
##  -0.01816588

#Fitting a generalized estimation equation with a correlation structure as unstructured
resp_unstructured <- gee(nstat ~ centre + trt + gender + baseline + age,
                  data = resp, family = 'binomial', id = subject,
                  corstr = 'unstructured', scale.fix = TRUE,
                  scale.value = 1)

##  (Intercept)      centre2       trttrt   gendermale baselinegood 
##  -0.90017133   0.67160098   1.29921589   0.11924365   1.88202860 
##          age 
##  -0.01816588

It turns out that I was not able to run a gee model for a correlation structure as fixed. Rstudio become hung up everytime. So it is still in the code chunk but just commeted it.

##Problem 1 B)

b. Which model is the best? Compare the following models: independent, 
exchangable, and what ever models you tried in part 
(a). Justify your answer. (Hint: use QIC  in MESS library), 
MSE, misclassification rate, comparison of naive vs robust Z-score, or another method, 
be sure to state your method)

First I will fit a generalized estimation equation model for both independence and exchangeable correlation structures. Then I will run QIC and MSE tests to see which models perform the best.

library(geepack)

#Fitting a generalized estimation equation with a correlation structure as independence
resp_independence <- gee(nstat ~ centre + trt + gender + baseline + age,
                  data = resp, family = 'binomial', id = subject,
                  corstr = 'independence', scale.fix = TRUE,
                  scale.value = 1)

##  (Intercept)      centre2       trttrt   gendermale baselinegood 
##  -0.90017133   0.67160098   1.29921589   0.11924365   1.88202860 
##          age 
##  -0.01816588

#Fitting a generalized estimation equation with a correlation structure as exchangeable
resp_exchangeable <- gee(nstat ~ centre + trt + gender + baseline + age,
                  data = resp, family = 'binomial', id = subject,
                  corstr = 'exchangeable', scale.fix = TRUE,
                  scale.value = 1)

##  (Intercept)      centre2       trttrt   gendermale baselinegood 
##  -0.90017133   0.67160098   1.29921589   0.11924365   1.88202860 
##          age 
##  -0.01816588

library('MuMIn')
#QIC Values for Each model 
resp_stat_M_dep_QIC <- QIC(resp_stat_M_dep)
resp_non_stat_M_Dep_QIC <- QIC(resp_non_stat_M_Dep)
resp_arm_QIC <- QIC(resp_arm)
resp_unstructured_QIC <- QIC(resp_unstructured)
resp_independence_QIC <- QIC(resp_independence)
resp_exchangeable_QIC <- QIC(resp_exchangeable)


#MSE Values for each Model
resp_stat_M_dep_MSE <- mean((resp_stat_M_dep$fitted.values-subset(resp$nstat, resp$nstat != "NA"))^2)
resp_non_stat_M_Dep_MSE <- mean((resp_non_stat_M_Dep$fitted.values-subset(resp$nstat, resp$nstat != "NA"))^2)
resp_arm_MSE <- mean((resp_arm$fitted.values-subset(resp$nstat, resp$nstat != "NA"))^2)
resp_unstructured_MSE <- mean((resp_unstructured$fitted.values-subset(resp$nstat, resp$nstat != "NA"))^2)
resp_independence_MSE <- mean((resp_independence$fitted.values-subset(resp$nstat, resp$nstat != "NA"))^2)
resp_exchangeable_MSE <- mean((resp_exchangeable$fitted.values-subset(resp$nstat, resp$nstat != "NA"))^2)

qic_values <-c(resp_stat_M_dep_QIC,resp_non_stat_M_Dep_QIC, resp_arm_QIC, resp_unstructured_QIC, resp_independence_QIC, resp_exchangeable_QIC)

MSE_values <-c(resp_stat_M_dep_MSE,resp_non_stat_M_Dep_MSE, resp_arm_MSE, resp_unstructured_MSE, resp_independence_MSE, resp_exchangeable_MSE)


#creating a table of QIC values and MSE Values for each model correlation structure 
df <- data.frame(QIC = (qic_values), MSE = (MSE_values))
rownames(df) <- c('stat_M_dep','non_stat_M_Dep','AR-M','Unstructured','Independence','Exchangeable')
df

#Unstructured Naive and Robust values

summary(resp_unstructured)$coefficient

##                 Estimate Naive S.E.    Naive z Robust S.E.   Robust z
## (Intercept)  -0.93127982 0.47918516 -1.9434655  0.46124991 -2.0190352
## centre2       0.67279473 0.33907786  1.9841895  0.35482020  1.8961568
## trttrt        1.27891545 0.33544088  3.8126404  0.34944996  3.6597956
## gendermale    0.09467354 0.41729637  0.2268736  0.44362952  0.2134068
## baselinegood  1.93462516 0.34281837  5.6432949  0.34804681  5.5585200
## age          -0.01688917 0.01255736 -1.3449620  0.01290535 -1.3086948

#Inpendence Naive and Robust values

summary(resp_independence)$coefficient

##                 Estimate  Naive S.E.   Naive z Robust S.E.   Robust z
## (Intercept)  -0.90017133 0.337653052 -2.665965  0.46032700 -1.9555041
## centre2       0.67160098 0.239566599  2.803400  0.35681913  1.8821889
## trttrt        1.29921589 0.236841017  5.485603  0.35077797  3.7038127
## gendermale    0.11924365 0.294671045  0.404667  0.44320235  0.2690501
## baselinegood  1.88202860 0.241290221  7.799854  0.35005152  5.3764332
## age          -0.01816588 0.008864403 -2.049306  0.01300426 -1.3969169

#Exchangeable Naive and Robust values

summary(resp_exchangeable)$coefficient

##                 Estimate Naive S.E.    Naive z Robust S.E.   Robust z
## (Intercept)  -0.90017133  0.4784634 -1.8813796  0.46032700 -1.9555041
## centre2       0.67160098  0.3394723  1.9783676  0.35681913  1.8821889
## trttrt        1.29921589  0.3356101  3.8712064  0.35077797  3.7038127
## gendermale    0.11924365  0.4175568  0.2855747  0.44320235  0.2690501
## baselinegood  1.88202860  0.3419147  5.5043802  0.35005152  5.3764332
## age          -0.01816588  0.0125611 -1.4462014  0.01300426 -1.3969169

##Discusion:

After fitting several Generalized Estimation Equation models with different correlation structures it appears that Independence and Exchangeable models have the lowest MSE values. However, Unstructured has the lowest QIC value with Indepedence and Exchangeable close behind with the same MSE and QIC values. Due to this closeness between QIC and MSE with these three models I decided to look at the Naive and Robust z values. It appears that independence has the biggest changes between the Naive z and Robust z values. I prefer a model have not a drastic difference between these two sets of z-values. Therefore, I want to turn my attention between Exchangeable and unstructured. They both do not contain as much variance between for the variables in the model and their Naive z vs their Robust z. Due to lower MSE between these two models I would choose the exchangeable model for the model of selection. But they both seem to be performing close the same.

##Problem 2

The data set schizophrenia2 from HSAUR3 package were collected in a follow-up study of women patients with schizophrenia (Davis, 2002). The binary response recorded at 0, 2, 6, 8 and 10 months after hospitalization was ``thought disorder’’ (absent or present). The single covariate is the factor indicating whether a patient had suffered early or late onset of her condition (age of onset less than 20 years or age of onset 20 years or above). The question of interest is whether the course of the illness differs between patients with early and late onset schizophrenia.

#Loading Data set
data(schizophrenia2)

##Problem 2 A)

a. plots and summary statistics

#Summary Statistics

I will use describeBy on the data set to acquire summary statistics of the variables. This command is found in the psych library.

#Onset Less than 20 years old summary

library(psych)
describeBy(subset(schizophrenia2, onset=='< 20 yrs'))

#Onset Greater than 20 years old summary

describeBy(subset(schizophrenia2, onset!='< 20 yrs'))

#Disorder present summary

describeBy(subset(schizophrenia2, disorder=='present'))

#Disorder absent summary

describeBy(subset(schizophrenia2, disorder=='absent'))

#Mosaic Plots

The mosaic plots will also include the missing data as an onset category. I will create binary onset category. If less than 20 years then 0 if the onset category is greater than 20 years then it is 1. I will also create number variables for the different types of disorder categoryes: Missing is 2 Present is 0 Absent is 1

After doing all this I will put it back into a data frame.

#creating a binary variable for onset, 0 if less than 20, 1 if greater than 20 years old
onsetb <- ifelse(schizophrenia2$onset=='< 20 yrs', 0 , 1)

#Creating a category for missing data
disorderb <- ifelse(is.na(schizophrenia2$disorder), 'missing', ifelse(schizophrenia2$disorder=='present','present',ifelse(schizophrenia2$disorder=='absent','absent','')))
#creating a data frame with the binary variables
disorderc <- ifelse(is.na(schizophrenia2$disorder), '2', ifelse(schizophrenia2$disorder=='present','0',ifelse(schizophrenia2$disorder=='absent','1','')))

sciz <- data.frame (
  subject = schizophrenia2$subject,
    onset = schizophrenia2$onset,
    disorder = schizophrenia2$disorder,
    month = schizophrenia2$month,
  onsetb = onsetb,
  disorderb = disorderb,
  disorderc = disorderc
)


library(vcd)
library(ggmosaic)

mosaicplot(onset ~ month + disorderb, data=sciz, color=4:6, las=1, main='Mosaic Plot of Onset Schizophrenia Explained by Month & Disorder', xlab='Onset Category',ylab='Month')

ggplot(data=sciz) + geom_mosaic(aes(x=product(month, onset), fill=disorderb, offset=0.5)) + labs(title='Mosaic Plot of Onset Schizophrenia Explained by Month & Disorder',x='Onset Category',y='Months') + scale_y_continuous(position='left',labels=c(0,2,6,8,10))

#Boxplots

#Boxplots of Month vs Disorder category

layout(matrix(1:2, ncol=2))
boxplot(month ~ disorderb, data=subset(sciz, onset=='< 20 yrs'), ylab='Month', xlab='Disorder Category', main = 'Less than 20 Years')

boxplot(month ~ disorderb, data=subset(sciz, onset!='< 20 yrs'), ylab='Month', xlab='Disorder Category', main = 'Greater than 20 Years')

ggplot(sciz, aes(disorderb, month)) + geom_boxplot() + facet_grid(~onset) + labs(title='Boxplots of Disorder Category vs Month',x='Disorder Category',y='Months')

##Problem 2 B)

b. the GEE approach

I will fit GEE models using several different correlation structures. I will use all the the same correlation structures found in problem 1. I will then compare the Model’s QIC score. The formula for fitting the models will be disorder is explained by monthg and onset category. The library I will use is MuMIn for calculating the QIC scores for the model seleciton pertaining to the GEE model correlation structures. The lowest QIC score would be the I would choose.

library(geepack)
library(MESS)
library(gee)

#Fitting a generalized estimation equation with a correlation structure as sat_M_dep
sciz_stat_M_dep <- gee(disorderc ~ month + onsetb, 
                       data=sciz, family = 'binomial', id=subject, 
                       corstr = 'stat_M_dep', scale.fix=TRUE, scale.value = 1)

## (Intercept)       month      onsetb 
##  -0.7543183   0.3048683   0.3595812

#Fitting a generalized estimation equation with a correlation structure as non_sat_M_dep
sciz_non_stat_M_dep <- gee(disorderc ~ month + onsetb, data=sciz, family='binomial', id=subject, 
    corstr = 'non_stat_M_dep', scale.fix=TRUE, scale.value = 1)

## (Intercept)       month      onsetb 
##  -0.7543183   0.3048683   0.3595812

#Fitting a generalized estimation equation with a correlation structure as AR-M
sciz_arm <- gee(disorderc ~ month + onsetb, data=sciz, family='binomial', id=subject, 
    corstr = 'AR-M', scale.fix=TRUE, scale.value = 1)

## (Intercept)       month      onsetb 
##  -0.7543183   0.3048683   0.3595812

#Fitting a generalized estimation equation with a correlation structure as AR-M
sciz_unstructured <- gee(disorderc ~ month + onsetb, data=sciz, family='binomial', id=subject, 
    corstr = 'unstructured', scale.fix=TRUE, scale.value=1)

## (Intercept)       month      onsetb 
##  -0.7543183   0.3048683   0.3595812

#Fitting a generalized estimation equation with a correlation structure as independence
sciz_independence <- gee(disorderc ~ month + onsetb, data=sciz, family='binomial', id=subject, 
    corstr = 'independence', scale.fix=TRUE, scale.value = 1)

## (Intercept)       month      onsetb 
##  -0.7543183   0.3048683   0.3595812

#Fitting a generalized estimation equation with a correlation structure as exchangeable
sciz_exchangeable <- gee(disorderc ~ month + onsetb, data=sciz, family='binomial', id=subject, 
    corstr = 'exchangeable', scale.fix=TRUE, scale.value=1)

## (Intercept)       month      onsetb 
##  -0.7543183   0.3048683   0.3595812

library(MuMIn)

#Model selection on QIC
model.sel(sciz_stat_M_dep, sciz_non_stat_M_dep, sciz_arm, sciz_unstructured, sciz_independence, sciz_exchangeable, rank=QIC)

It appears the unstructured correlation structure produces the lowest QIC value.

##Problem 2 C)

c. mixed effects model (lmer) from previous chapter

#Fitting lmer mixed model with disorder explained by onset & month & (1|subject)

sciz1 <- data.frame (
  subject = as.numeric(schizophrenia2$subject),
    onset = schizophrenia2$onset,
    disorder = schizophrenia2$disorder,
    month = as.numeric(schizophrenia2$month),
  onsetb = as.numeric(onsetb),
  disorderb = disorderb,
  disorderc = as.numeric(disorderc)
)


library("gee")
library("lme4")
library("Matrix")
library("multcomp")
library('tidyverse')

sciz_lmer <- lmer(disorderc ~ onsetb + month + (1|subject), data=sciz1, REML=FALSE, na.action=na.omit)

#Problem 2 D)

d. Is there a difference? What model(s) work best? Describe your results.

I will look at the summaries of each model

#Summary of unstructured correlation

summary(sciz_unstructured)

## 
##  GEE:  GENERALIZED LINEAR MODELS FOR DEPENDENT DATA
##  gee S-function, version 4.13 modified 98/01/27 (1998) 
## 
## Model:
##  Link:                      Logit 
##  Variance to Mean Relation: Binomial 
##  Correlation Structure:     Unstructured 
## 
## Call:
## gee(formula = disorderc ~ month + onsetb, id = subject, data = sciz, 
##     family = "binomial", corstr = "unstructured", scale.fix = TRUE, 
##     scale.value = 1)
## 
## Summary of Residuals:
##         Min          1Q      Median          3Q         Max 
## -0.99126755 -0.31537323  0.02581796  0.11786844  1.41914258 
## 
## 
## Coefficients:
##               Estimate Naive S.E.   Naive z Robust S.E.   Robust z
## (Intercept) -0.7751170  0.3013160 -2.572439   0.4199193 -1.8458713
## month        0.5507056  0.1146183  4.804691   0.3717890  1.4812317
## onsetb      -0.5163445  0.5142307 -1.004111   0.9228023 -0.5595396
## 
## Estimated Scale Parameter:  1
## Number of Iterations:  5
## 
## Working Correlation
##            [,1]       [,2]       [,3]       [,4]       [,5]
## [1,] 1.00000000 0.05326946 0.05486696 0.08754649 0.09336227
## [2,] 0.05326946 1.00000000 0.06912575 0.10586554 0.15979905
## [3,] 0.05486696 0.06912575 1.00000000 0.47494142 0.90699664
## [4,] 0.08754649 0.10586554 0.47494142 1.00000000 1.48092437
## [5,] 0.09336227 0.15979905 0.90699664 1.48092437 1.00000000

#Summary of mixed models approach

summary(sciz_lmer)

## Linear mixed model fit by maximum likelihood  ['lmerMod']
## Formula: disorderc ~ onsetb + month + (1 | subject)
##    Data: sciz1
## 
##      AIC      BIC   logLik deviance df.resid 
##    298.2    315.1   -144.1    288.2      215 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -2.4676 -0.6317  0.0239  0.7361  2.1974 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  subject  (Intercept) 0.09675  0.3110  
##  Residual             0.16498  0.4062  
## Number of obs: 220, groups:  subject, 44
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept) 0.363782   0.074352   4.893
## onsetb      0.006250   0.121928   0.051
## month       0.073071   0.007382   9.898
## 
## Correlation of Fixed Effects:
##        (Intr) onsetb
## onsetb -0.447       
## month  -0.516  0.000

cftest(sciz_lmer)

## 
##   Simultaneous Tests for General Linear Hypotheses
## 
## Fit: lmer(formula = disorderc ~ onsetb + month + (1 | subject), data = sciz1, 
##     REML = FALSE, na.action = na.omit)
## 
## Linear Hypotheses:
##                  Estimate Std. Error z value Pr(>|z|)    
## (Intercept) == 0 0.363782   0.074352   4.893 9.95e-07 ***
## onsetb == 0      0.006250   0.121928   0.051    0.959    
## month == 0       0.073071   0.007382   9.898  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## (Univariate p values reported)

#Discussion:

It appears that there are trade offs between both model approaches. If we assume the AIC and QIC are equivelant scores it appears that the GEE method produces better results. However, if comparing the standard error values in the lmer method to the robust and Naive standard errors, mixed model has lower values. Something that is concerning with the GEE approach is that there is alittle jump in values between the naive and robust standard errors, not huge but something worth mentioning. The robust Z values are lower than the Z values for the mixed model. When looking at the summary of the residuals between the two models the (Min, Q1, Median, Q3, and Max) the GEE model has lower values. Even though both models perform better in certain ways, I believe the GEE approach outweighs the mixed model.

Homework 10

STAT 601