Exercise One

For this exercise, you will use the Myopia Study dataset and you will need to refer to the results from Week Two, Exercise 1. If you need the file you can download the MYOPIA.dta Stata file, or you can also access the data through the CSV file.

ex1Dt <- read.csv("MYOPIA-fixed.csv")
# logistic regression from Week 2, Ex 1
logEx1.1 <- glm(MYOPIC ~ SPHEQ, family = binomial(), data = ex1Dt)
# I'll also plot the model just for the sake of it
ggplot(ex1Dt, aes(x = SPHEQ, y = MYOPIC)) + geom_jitter(shape = "O", position = position_jitter(height = .02)) + stat_smooth(method="glm", family="binomial", se= FALSE)

Complete the following:

  1. Using the results from Week Two, Exercise 1, compute 95 percent confidence intervals for the slope coefficient SPHEQ. Write a sentence interpreting this confidence.

    There are 2 types of confidence interval functions

    • confint(): CIs using profiled log-likelihood
    • confint.defaut() : CIs using standard errors (we’ll use this one)
    cbind("beta" = logEx1.1$coef, confint.default(logEx1.1))
    ##                    beta      2.5 %     97.5 %
    ## (Intercept)  0.05397314 -0.3512461  0.4591923
    ## SPHEQ       -3.83309758 -4.6530870 -3.0131081
  2. Use Stata (or your preferred analysis software tool) to obtain the estimated covariance matrix. Then compute the logit and estimated logistic probability for a subject with SPHEQ = 2. Then evaluate the endpoints of the 95 percent confidence intervals for the logit and estimated logistic probability. Write a sentence interpreting the estimated probability and its confidence interval.

    The covariance matrix is calculated as:

    vcov(logEx1.1)
    ##             (Intercept)       SPHEQ
    ## (Intercept)  0.04274486 -0.06337768
    ## SPHEQ       -0.06337768  0.17503316

    For the prediction:

    # Determine the SPHEQ value to use in predictions
    data2predict <- data.frame(SPHEQ = 2)
    
    # Calculate the predicted logit and 95% CI limits
    logitPrd <- predict(logEx1.1, data2predict, type = "link", se = TRUE)
    logitPrd.Lims <- logitPrd$fit + c(-1, 1) * qnorm(0.975) * logitPrd$se.fit
    logitPrd.DF <- data.frame(cbind(logitPrd$fit, rbind(logitPrd.Lims)))
    
    # Calculate the predicted probability and 95% CI limits using plogis
    probPrd <- sapply(logitPrd.DF[1, ], plogis)
    
    # Neat output
    prd <- data.frame(rbind(logitPrd.DF, probPrd))
    rownames(prd) <- c("LOGIT", "PROB")
    colnames(prd) <- c("PRD", "LL", "UL")

    The prediction table prd becomes:

    ##                 PRD            LL          UL
    ## LOGIT -7.6122220300 -8.9833100196 -6.24113404
    ## PROB   0.0004941278  0.0001254711  0.00194386

Exercise Two

For this exercise, you will use the ICU Study dataset and you will need to refer to the results from Week One, Exercise 2, Part (d). If you need the file you can download the icu.dta Stata file, or you can also access the data through the CSV file.

ex2Dt <- read.csv("icu.csv")
# logistic regression from Week 1, Ex 2
logEx2.1 <- glm(STA ~ AGE, family = binomial(), data = ex2Dt)
# I'll also plot the model just for the sake of it
ggplot(ex2Dt, aes(x = AGE, y = STA)) + geom_jitter(shape = "O", position = position_jitter(height = .02)) + stat_smooth(method="glm", family="binomial", se= FALSE)

Complete the following:

  1. Using the results from Week One, Exercise 2, Part (d), compute 95 percent confidence intervals for the slope and constant term. Write a sentence interpreting the confidence interval for the slope.

    cbind("beta" = logEx2.1$coef, confint.default(logEx2.1))
    ##                    beta       2.5 %      97.5 %
    ## (Intercept) -3.05851323 -4.42280739 -1.69421908
    ## AGE          0.02754261  0.00683723  0.04824799
  2. Obtain the estimated covariance matrix for the model fit from Week One, Exercise 2, Part (d). Then compute the logit and estimated logistic probability for a 60-year old subject. Then compute a 95 percent confidence intervals for the logit and estimated logistic probability. Write a sentence or two interpreting the estimated probability and its confidence interval.

    # Covariance Matrix:
    vcov(logEx2.1)
    ##              (Intercept)           AGE
    ## (Intercept)  0.484529087 -0.0071029945
    ## AGE         -0.007102994  0.0001116015
    # Determine the SPHEQ value to use in predictions
    data2predict.2 <- data.frame(AGE = 60)
    
    # Calculate the predicted logit and 95% CI limits
    logitPrd.2 <- predict(logEx2.1, data2predict.2, type = "link", se = TRUE)
    logitPrd.2.Lims <- logitPrd.2$fit + c(-1, 1) * qnorm(0.975) * logitPrd.2$se.fit
    logitPrd.2.DF <- data.frame(cbind(logitPrd.2$fit, rbind(logitPrd.2.Lims)))
    
    # Calculate the predicted probability and 95% CI limits using plogis
    probPrd.2 <- sapply(logitPrd.2.DF[1, ], plogis)
    
    # Neat output
    prd.2 <- data.frame(rbind(logitPrd.2.DF, probPrd.2))
    rownames(prd.2) <- c("LOGIT", "PROB")
    colnames(prd.2) <- c("PRD", "LL", "UL")

    The prediction table prd.2 becomes:

    ##              PRD         LL         UL
    ## LOGIT -1.4059567 -1.7670123 -1.0449011
    ## PROB   0.1968726  0.1459143  0.2602054

    It is observed that a 60-year old individual has \(0.2\) probability of dying in the ICU. And 95 times out of 100 similar samples will have ratios between \([0.15, 0.26]\)

Exercise Three

For this exercise, you will use the ICU Study dataset. If you need the file you can download the icu.dta Stata file, or you can also access the data through the CSV file.

First, using the ICU data, consider the multiple logistic regression model of vital status, STA, on age (AGE), cancer part of the present problem (CAN), CPR prior to ICU admission (CPR), infection probable at ICU admission (INF), and race (RACE).*

Then, complete the following:

  1. The variable RACE is coded at three levels. Prepare a table showing the coding of the two design variables necessary for including this variable in a logistic regression model.

    RACE RC02 RC03
    White 0 0
    Black 1 0
    Other 0 1
  2. Write down the equation for the logistic regression model of STA on AGE, CAN, CPR, INF, and RACE. Write down the equation for the logit transformation of this logistic regression model. How many parameters does this model contain?

    The logistic regression model is as follows: \[ \pi(X) = \frac{e^{\beta_0 + \beta_1(AGE) + \beta_2(CAN) + \beta_3(CPR) + \beta_4(INF) + \beta_5(RC02) + \beta_6(RC03)}}{1-e^{\beta_0 + \beta_1(AGE) + \beta_2(CAN) + \beta_3(CPR) + \beta_4(INF) + \beta_5(RC02) + \beta_6(RC03)}} \] Where, \(X = \text{vector of covariates}\)

    The logit transformation equation is as follows:

    \[ g(X) = \beta_0 + \beta_1(AGE) + \beta_2(CAN) + \beta_3(CPR) + \beta_4(INF) + \beta_5(RC02) + \beta_6(RC03)\] The model contains \(7\) parameters

  3. Write down an expression for the likelihood and log likelihood for the logistic regression model in part (b). How many likelihood equations are there? Write down an expression for a typical likelihood equation for this problem.

    Likelihood function:

    \[l(\beta) = \prod_{i=0}^{n} \zeta(X_i)\] where; \[ X = \text{set of covariates} \] \[ \zeta(X_i) = \pi(X_i)^{y_i}(1-\pi(X_i))^{1-y_i} \] \[ y_i = \begin{cases} 0 & \quad \text{if patient lived}\\ 1 & \quad \text{if patient died}\\ \end{cases} \]

  4. Using a logistic regression package, obtain the maximum likelihood estimates of the parameters of the logistic regression model in part (b). Using these estimates write down the equation for the fitted values, that is, the estimated logistic probabilities.

    ex2Dt$RACE <- as.factor(ex2Dt$RACE)
    
    logEx3.1 <- glm(STA ~ AGE + CAN + CPR + INF + RACE, family = binomial(), data = ex2Dt)
    
    summary(logEx3.1)
    ## 
    ## Call:
    ## glm(formula = STA ~ AGE + CAN + CPR + INF + RACE, family = binomial(), 
    ##     data = ex2Dt)
    ## 
    ## Deviance Residuals: 
    ##     Min       1Q   Median       3Q      Max  
    ## -1.3703  -0.6823  -0.5421  -0.3082   2.5124  
    ## 
    ## Coefficients:
    ##             Estimate Std. Error z value Pr(>|z|)    
    ## (Intercept) -3.51152    0.81443  -4.312 1.62e-05 ***
    ## AGE          0.02712    0.01159   2.340  0.01926 *  
    ## CAN          0.24451    0.61681   0.396  0.69180    
    ## CPR          1.64650    0.62341   2.641  0.00826 ** 
    ## INF          0.68067    0.38042   1.789  0.07357 .  
    ## RACE2       -0.95708    1.08445  -0.883  0.37748    
    ## RACE3        0.25975    0.87127   0.298  0.76561    
    ## ---
    ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    ## 
    ## (Dispersion parameter for binomial family taken to be 1)
    ## 
    ##     Null deviance: 200.16  on 199  degrees of freedom
    ## Residual deviance: 179.30  on 193  degrees of freedom
    ## AIC: 193.3
    ## 
    ## Number of Fisher Scoring iterations: 5

    Logit transformation equation:

    \[ g(X) = -3.512 + 0.027(AGE) + 0.245(CAN) + 1.647(CPR) + 0.681(INF) + -0.957(RC02) + 0.260(RC03)\]

    Logistic regression model:

    \[ \pi(X) = \frac{e^{-3.512 + 0.027(AGE) + 0.245(CAN) + 1.647(CPR) + 0.681(INF) + -0.957(RC02) + 0.260(RC03)}}{1-e^{-3.512 + 0.027(AGE) + 0.245(CAN) + 1.647(CPR) + 0.681(INF) + -0.957(RC02) + 0.260(RC03)}} \]

  5. Using the results of the output from the logistic regression package used in part (d), assess the significance of the slope coefficients for the variables in the model using the likelihood ratio test. What assumptions are needed for the p-values computed for this test to be valid? What is the value of the deviance for the fitted model?

    Deviance is found as 179.3

    For the likelihood ratio test one should perdorm a \(\chi^2\) test with \(6\) degrees of freedom on the difference between null deviance and the model deviance.

    pchisq(logEx3.1$null.deviance - logEx3.1$deviance, 6, lower.tail = FALSE)
    ## [1] 0.001943745

    Since \(p<0.05\) the model is significant and at least one of the parameters has predicting value on death probability.

  6. Use the Wald statistics to obtain an approximation to the significance of the individual slope coefficients for the variables in the model. Fit a reduced model that eliminates those variables with nonsignificant Wald statistics. Assess the joint (conditional) significance of the variables excluded from the model. Present the results of fitting the reduced model in a table.

    Wald Statistics are shown in the summary table above. All the parameters with \(p<0.5\) are considered significant (also expressed with ‘.’ and ’*’). One could perform backwards elimination according to Wald statistics and obtain:

    ## 
    ## Call:
    ## glm(formula = STA ~ AGE + CPR + INF, family = binomial(), data = ex2Dt)
    ## 
    ## Deviance Residuals: 
    ##     Min       1Q   Median       3Q      Max  
    ## -1.3633  -0.6810  -0.5524  -0.3091   2.4868  
    ## 
    ## Coefficients:
    ##             Estimate Std. Error z value Pr(>|z|)    
    ## (Intercept) -3.57604    0.77306  -4.626 3.73e-06 ***
    ## AGE          0.02792    0.01136   2.458  0.01397 *  
    ## CPR          1.63066    0.61553   2.649  0.00807 ** 
    ## INF          0.69708    0.37750   1.847  0.06481 .  
    ## ---
    ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    ## 
    ## (Dispersion parameter for binomial family taken to be 1)
    ## 
    ##     Null deviance: 200.16  on 199  degrees of freedom
    ## Residual deviance: 180.51  on 196  degrees of freedom
    ## AIC: 188.51
    ## 
    ## Number of Fisher Scoring iterations: 5

    Here INF can be either significant or not, let’s first check with the likelihood ratio test to compare with the previous model. Note the \(\chi^2\) degree of freedom is \(3\) because there are 3 parameters in logEx3.2 vs. 6 parameters in logEx3.1

    pchisq(logEx3.2$deviance - logEx3.1$deviance, 3, lower.tail = FALSE)
    ## [1] 0.74996

    Since \(p>0.05\) the new model is not significantly different than the full model.

    Let’s now do the same to see whether INF has any statistically significant effect. Note the \(\chi^2\) degree of freedom is \(1\)

    ## 
    ## Call:
    ## glm(formula = STA ~ AGE + CPR, family = binomial(), data = ex2Dt)
    ## 
    ## Deviance Residuals: 
    ##     Min       1Q   Median       3Q      Max  
    ## -1.3583  -0.7006  -0.5730  -0.3454   2.3871  
    ## 
    ## Coefficients:
    ##             Estimate Std. Error z value Pr(>|z|)    
    ## (Intercept) -3.35196    0.74550  -4.496 6.92e-06 ***
    ## AGE          0.02961    0.01115   2.656  0.00792 ** 
    ## CPR          1.78409    0.60730   2.938  0.00331 ** 
    ## ---
    ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    ## 
    ## (Dispersion parameter for binomial family taken to be 1)
    ## 
    ##     Null deviance: 200.16  on 199  degrees of freedom
    ## Residual deviance: 183.95  on 197  degrees of freedom
    ## AIC: 189.95
    ## 
    ## Number of Fisher Scoring iterations: 5
    ## [1] 0.06366446

    Since \(p>0.05\) the new model is not significantly different and INF is not a necessary parameter

  7. Using the results from part (f), compute 95 percent confidence intervals for all coefficients in the model. Write a sentence interpreting the confidence intervals for the non-constant covariates.

    cbind("beta" = logEx3.3$coef, confint.default(logEx3.3))
    ##                    beta        2.5 %      97.5 %
    ## (Intercept) -3.35195550 -4.813107596 -1.89080341
    ## AGE          0.02960741  0.007755898  0.05145892
    ## CPR          1.78409209  0.593811578  2.97437261