For this exercise, you will use the ICU dataset. You can download the icu.dta Stata file, or you can also access the data through this CSV file.
a) Consider the ICU data and use as the outcome variable vital status (STA) and race (RACE) as a covariate. Prepare a table showing the coding of the two design variables for RACE using the value RACE = 1, white, as the reference group.
icu_cc <- read.csv("icu_correct_coding.csv")
names(icu_cc) <- tolower(names(icu_cc))
icu_cc[,2] <- as.factor(icu_cc[,2])
icu_cc[,5] <- factor(icu_cc[,5], levels = c("1", "2", "3"), labels = c("White", "Black", "Other"))
fit.icu <- glm(sta ~ race, data = icu_cc, family = binomial(link = logit))
summary(fit.icu)
##
## Call:
## glm(formula = sta ~ race, family = binomial(link = logit), data = icu_cc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.6893 -0.6893 -0.6893 -0.3715 2.3272
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.31634 0.18513 -7.110 1.16e-12 ***
## raceBlack -1.32272 1.05152 -1.258 0.208
## raceOther -0.06996 0.81196 -0.086 0.931
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 200.16 on 199 degrees of freedom
## Residual deviance: 197.90 on 197 degrees of freedom
## AIC: 203.9
##
## Number of Fisher Scoring iterations: 5
contrasts(icu_cc$race)
## Black Other
## White 0 0
## Black 1 0
## Other 0 1
b) Show that the estimated log-odds ratios obtained from the cross-classification of STA by RACE, using RACE = 1 as the reference group, are identical to estimated slope coefficients for the two design variables from the logistic regression of STA on RACE.
\[logit(race = White) = \ln{\pi}{1 - \pi} = \beta_0 - 1.323*Black - 0.07*Other\]
library(car)
(LogOdds1 <- logit(predict(fit.icu, newdata = data.frame(race = "White"), type = "response")))
## 1
## -1.316336
(LogOdds2 <- logit(predict(fit.icu, newdata = data.frame(race = "Black"), type = "response")))
## 1
## -2.639057
(LogOdds3 <- logit(predict(fit.icu, newdata = data.frame(race = "Other"), type = "response")))
## 1
## -1.386294
(OR21 <- exp(LogOdds2 - LogOdds1)) #odds ratio of dying for people of black race compared to people of white race
## 1
## 0.2664093
log(OR21) # taking natural logarithm of odds ratio
## 1
## -1.322722
coef(fit.icu)[2]
## raceBlack
## -1.322722
(OR31 <- exp(LogOdds3 - LogOdds1)) # ratio of odds of death for people of other races, that are not black, versus whites.
## 1
## 0.9324324
log(OR31) #log of odds ratio
## 1
## -0.06995859
coef(fit.icu)[3] #coefficient for race = 3, or "Other" from logistic regression output
## raceOther
## -0.06995859
c) Verify that the estimated standard errors of the estimated slope coefficients for the two design variables for RACE are identical to the square root of the sum of the inverse of the cell frequencies from the cross-classification of STA by RACE used to calculate the odds ratio.
(smt <- summary(fit.icu)$coef)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.31633577 0.1851308 -7.11030229 1.157891e-12
## raceBlack -1.32272156 1.0515203 -1.25791343 2.084231e-01
## raceOther -0.06995859 0.8119565 -0.08616051 9.313388e-01
(tbl <- table(icu_cc$sta, icu_cc$race))
##
## White Black Other
## 0 138 14 8
## 1 37 1 2
(beta_B <- smt[2,1])
## [1] -1.322722
(SE_Black <- sqrt(1/138 + 1/37 + 1/14 + 1/1))
## [1] 1.051524
smt[2, 2]
## [1] 1.05152
(beta_O <- smt[3, 1])
## [1] -0.06995859
(SE_Other <- sqrt(1/138 + 1/37 + 1/8 + 1/2))
## [1] 0.8119565
smt[3, 2]
## [1] 0.8119565
d) Use either set of computations to compute the 95% Confidence Interval for the odds ratios.
(CI_OR21 <- exp(beta_B + c(-1, 1) * qnorm(0.975, lower.tail = TRUE) * SE_Black))
## [1] 0.03392273 2.09222254
(CI31 <- exp(beta_O + c(-1, 1) * qnorm(0.975, lower.tail = TRUE) * SE_Other))
## [1] 0.1898798 4.5788458
Odds of dying are from 93% less to 2 times higher for blacks versus whites, while odds of dying for people of other races than black or white are from 81% lower to 4.5 times higher than odds of whites of dying when addmitted to intensive care unit.
For this exercise, you will continue using the ICU dataset.
a) Create design variables for RACE using the method typically employed in ANOVA.
# "method typically employed in ANOVA" -> deviation from means coding:
(mt <- rbind( c( -1, -1), diag( c( 1, 1))))
## [,1] [,2]
## [1,] -1 -1
## [2,] 1 0
## [3,] 0 1
contrasts(icu_cc$race)
## Black Other
## White 0 0
## Black 1 0
## Other 0 1
contrasts(icu_cc$race) <- mt
contrasts(icu_cc$race) #deviation from means coding
## [,1] [,2]
## White -1 -1
## Black 1 0
## Other 0 1
b) Perform the logistic regression of STA on RACE.
fit.icu.dfmc <- glm(sta ~ race, data = icu_cc, family = binomial(link = logit))
summary(fit.icu.dfmc)
##
## Call:
## glm(formula = sta ~ race, family = binomial(link = logit), data = icu_cc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.6893 -0.6893 -0.6893 -0.3715 2.3272
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.7806 0.4385 -4.060 4.9e-05 ***
## race1 -0.8585 0.7412 -1.158 0.247
## race2 0.3943 0.6330 0.623 0.533
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 200.16 on 199 degrees of freedom
## Residual deviance: 197.90 on 197 degrees of freedom
## AIC: 203.9
##
## Number of Fisher Scoring iterations: 5
(coefs <- fit.icu.dfmc$coef)
## (Intercept) race1 race2
## -1.7805625 -0.8584948 0.3942681
c) Show by calculation that the estimated logit differences of RACE = 2 versus RACE = 1 and RACE = 3 versus RACE = 1 are equivalent to the values of the log-odds ratio obtained in exercise 1.
(logitW <- logit(predict(fit.icu.dfmc, newdata = data.frame(race = "White"), type = "response")))
## 1
## -1.316336
coefs[1] + coefs[2] * (-1) + coefs[3] * (-1) # equivalent to logitW, as well as the intercept from the reference cell coding fit
## (Intercept)
## -1.316336
(logitB <- logit(predict(fit.icu.dfmc, newdata = data.frame(race = "Black"), type = "response")))
## 1
## -2.639057
coefs[1] + coefs[2] * 1 + coefs[3] * 0
## (Intercept)
## -2.639057
(logitO <- logit(predict(fit.icu.dfmc, newdata = data.frame(race = "Other"), type = "response")))
## 1
## -1.386294
coefs[1] + coefs[2] * 0 + coefs[3] * 1
## (Intercept)
## -1.386294
coef(fit.icu) # coefficients from reference cell coding
## (Intercept) raceBlack raceOther
## -1.31633577 -1.32272156 -0.06995859
(logOR21 <- logitB - logitW) # identical to coefficient for race = "Black" from ref. cell coding
## 1
## -1.322722
(logOR31 <- logitO - logitW) # identical to coefficient for race = "Other" from rcc
## 1
## -0.06995859
d) Use the results of the logistic regression to obtain the 95% Confidence Interval for the odds ratios and verify that they are the same limits as obtained in exercise 1. Note that the estimated covariance matrix for the estimated coefficients is needed to obtain the estimated variances of the logit differences.
\(in\ deviation\ from\ means\ coding:\) \[logit(race = "White") = \beta_0 + \beta_1 * (-1) + \beta_2 * (-1)\] \[logit(race = "Black") = \beta_0 + \beta_1 * 1 + \beta_2 * 0\] \[logit(race = "Other") = \beta_0 + \beta_1 * 0 + \beta_2 * 1\] \[\log(OR("Black", "White")) = logit(race = "Black") - logit(race = "White")\]
\[\log(OR("Black", "White")) = \beta_0 + \beta_1 - \beta_0 + \beta_1 + \beta_2 = 2 * \beta_1 + \beta_2\] \[Var(log(OR("Black", "White"))) = (2 * \beta_1 + \beta_2)^2\] \[Var(log(OR("Black", "White"))) = 4 * Var(\beta_1) + 4 * Cov(\beta_1, \beta_2) + Var(\beta_2) \]
\[\log(OR("Other", "White")) = logit(race = "Other") - logit(race = "White") \] \[\log(OR("Other", "White")) = \beta_0 + \beta_2 - \beta_0 + \beta_1 + \beta_2 = 2 * \beta_2 + \beta_1\] \[Var(log(OR("Other", "White"))) = (2 * \beta_2 + \beta_1)^2\] \[Var(log(OR("Other", "White"))) = 4 * Var(\beta_2) + 4 * Cov(\beta_1, \beta_2) + Var(\beta_1) \]
(vc_dfmc <- vcov(fit.icu.dfmc))
## (Intercept) race1 race2
## (Intercept) 0.19229945 0.1648411 0.01603389
## race1 0.16484109 0.5494400 -0.37317442
## race2 0.01603389 -0.3731744 0.40063278
(VarOR_BW <- 4 * vc_dfmc[2,2] + 4 * vc_dfmc[2,3] + vc_dfmc[3,3])
## [1] 1.105695
(seOR_BW <- sqrt(VarOR_BW))
## [1] 1.05152
(VarOR_OW <- 4 * vc_dfmc[3,3] + 4 * vc_dfmc[2,3] + vc_dfmc[2,2])
## [1] 0.6592734
(seOR_OW <- sqrt(VarOR_OW))
## [1] 0.8119565
(ciOR_BW <- exp(logOR21 + c(-1, 1) * qnorm(0.975, lower.tail = TRUE) * seOR_BW))
## [1] 0.03392295 2.09220897
(ciOR_OW <- exp(logOR31 + c(-1, 1) * qnorm(0.975, lower.tail = TRUE) * seOR_OW))
## [1] 0.1898798 4.5788458