For this exercise, you will use the ICU dataset (this is the same data we used in weeks 1 through 3). You can download the icu.dta Stata file, or you can also access the data through this CSV file.
library(knitr)
knitr::opts_chunk$set(message = FALSE,
warning = FALSE,
collapse = TRUE,
fig.align = 'center',
cache = TRUE,
echo = TRUE,
results = 'asis')
library(ztable)
## Welcome to package ztable ver 0.1.5
options(ztable.type = 'html')
filename <- "icu.csv"
url = "https://learn.canvas.net/courses/1179/files/461760/download?download_frd=1"
if (!file.exists(filename)) {
download.file(url, "icu.csv")
}
icu <- read.csv("icu.CSV")
library(dplyr)
names(icu) <- tolower(names(icu))
icu <- tbl_df(icu)
# icu
From the ICU data, use as the outcome variable vital status (STA) and CPR prior to ICU admission (CPR) as a covariate.
fit <- glm(sta ~ cpr, data = icu, family = binomial(link = logit))
ztable(fit)
Estimate | Std. Error | z value | Pr(>|z|) | OR | lcl | ucl | |
---|---|---|---|---|---|---|---|
(Intercept) | -1.5404 | 0.1918 | -8.03 | 0.0000 | 0.21 | 0.14 | 0.31 |
a) Demonstrate that the value of the log-odds ratio obtained from the cross-classification of STA by CPR is identical to the estimated slope coefficient from the logistic regression of STA on CPR. Verify that the estimated standard error of the estimated slope coefficient for CPR obtained from the logistic regression package is identical to the square root of the sum of the inverse of the cell frequencies from the cross-classification of STA by CPR. Use either set of computations to obtain 95% CI for the odds ratio. What aspect concerning the coding of the variable CPR makes the calculations for the two methods equivalent?
library(pander)
pander(tbl <- icu %>%
select(sta, cpr) %>%
table)
0 | 1 | |
---|---|---|
0 | 154 | 6 |
1 | 33 | 7 |
Odds ratio of dying for people who had received a CPR compared to people who didn’t can be calculated from the above table as:
\[OR(cpr = 1, cpr = 0) = \frac{\frac{7}{33}}{\frac{6}{154}}= 5.4\]
Standard error of the log of odds ratio, i.e. the coefficient of CPR
from the regression fit can be found from the formula:
\[Var(\ln(OR)) = \frac{1}{7}+\frac{1}{33}+\frac{1}{6}+\frac{1}{154} \]
(OR.cpr <- (7/33)/(6/154)) # OR calculated from the table
[1] 5.444444
(beta.cpr <- coef(fit)[2])
cpr
1.694596
exp(beta.cpr) # OR as exponentiated coefficient for CPR from the logistic regression fit
cpr
5.444444
# Standard error of the coefficient for CPR:
(se.cpr <- sqrt(1/7+1/33+1/6+1/154))
[1] 0.5884899
(se.cpr.fit <- summary(fit)$coef[2,2])
[1] 0.5884899
# 95 % confident interval for the odds ratio
(CI <- exp(beta.cpr + c(-1,1)*qnorm(0.975, lower.tail = TRUE)*se.cpr))
[1] 1.718027 17.253495
exp(confint(fit))
2.5 % 97.5 %
(Intercept) 0.1447167 0.3077347 cpr 1.7047493 17.9389765
exp(confint.default(fit))
2.5 % 97.5 %
(Intercept) 0.1471337 0.312086 cpr 1.7180273 17.253494
Odds ratio as calculated from the table and the one calculated by exponentiating the coefficient for the CPR
from the logistic regression fit are the same because we used the reference cell coding.
b) For purposes of illustration, use a data transformation statement to recode, for this problem only, the variable CPR as follows: 4 = no and 2 = yes. Perform the logistic regression of STA on CPR (recoded). Demonstrate how the calculation of the logit difference of CPR = yes versus CPR = no is equivalent to the value of the log-odds ratio obtained in exercise 1-a. Use the results from the logistic regression to obtain the 95% CI for the odds ratio and verify that they are the same limits as obtained in Exercise 1-a.
icu$cpr.recoded <- ifelse(icu$cpr == 0, 4, 2)
fit.recoded <- glm(sta ~ cpr.recoded, data = icu, family = binomial(link = logit))
ztable(fit.recoded)
Estimate | Std. Error | z value | Pr(>|z|) | OR | lcl | ucl | |
---|---|---|---|---|---|---|---|
(Intercept) | 1.8487 | 1.1291 | 1.64 | 0.1016 | 6.35 | 0.68 | 63.13 |
\[Odds(STA = 1| CPR) = \frac{\pi_{CPR}}{1-\pi_{CPR}} \]
\[OR(CPR = yes, CPR = no) = \frac{Odds(STA=1|CPR=yes)}{Odds(STA=1|CPR = no)} \] \[=\dfrac{\dfrac{\pi_{CPR = yes}}{1-\pi_{CPR = yes}}} {\dfrac{\pi_{CPR = no}}{1-\pi_{CPR = no}}} = e^{g(CPR = yes) - g(CPR = no)} \]
\[= e^{2\cdot \beta_1 - 4 \cdot \beta_1 } = e^{-2\beta_1}\]
\[Var(\log OR.recoded) = (-2\beta_1)\cdot(-2\beta_1 = 4 \cdot \beta_1^2=4 \cdot Var(\beta_1)=\]
(beta1.recoded <- coef(fit.recoded)[2])
cpr.recoded -0.8472979
(OR.recoded <- exp(-2*beta1.recoded))
cpr.recoded 5.444444
OR.cpr
[1] 5.444444
(var.beta1.recoded <- (summary(fit.recoded)$coef[2,2])^2) #variance of the slope from the recoded fit
[1] 0.08658009
(Var.OR.recoded <- 4*var.beta1.recoded)
[1] 0.3463203
(SE.OR.recoded <- sqrt(Var.OR.recoded))
[1] 0.5884899
(CI.recoded <- exp(-2*beta1.recoded + c(-1,1)*qnorm(0.975, lower.tail = TRUE)*SE.OR.recoded))
[1] 1.718027 17.253494
ztable(exp(confint(fit)))
2.5 % | 97.5 % | |
---|---|---|
(Intercept) | 0.14 | 0.31 |
ztable(exp(confint.default(fit)))
2.5 % | 97.5 % | |
---|---|---|
(Intercept) | 0.15 | 0.31 |
c) Consider the ICU data and use as the outcome variable vital status (STA) and race (RACE) as a covariate. Prepare a table showing the coding of the two design variables for RACE using the value RACE = 1, white, as the reference group. Show that the estimated log-odds ratios obtained from the cross-classification of STA by RACE, using RACE = 1 as the reference group, are identical to estimated slope coefficients for the two design variables from the logistic regression of STA on RACE. Verify that the estimated standard errors of the estimated slope coefficients for the two design variables for RACE are identical to the square root of the sum of the inverse of the cell frequencies from the cross-classification of STA by RACE used to calculate the odds ratio. Use either set of computations to compute the 95% CI for the odds ratios.
# fitting the model:
icu$race <- as.factor(icu$race)
fit2 <- glm(sta ~ race, data = icu, family = binomial(link = logit))
ztable(fit2)
Estimate | Std. Error | z value | Pr(>|z|) | OR | lcl | ucl | |
---|---|---|---|---|---|---|---|
(Intercept) | -1.3163 | 0.1851 | -7.11 | 0.0000 | 0.27 | 0.18 | 0.38 |
ztable(contrasts(icu$race))
2 | 3 | |
---|---|---|
1 | 0.00 | 0.00 |
pander(table(icu$race))
1 | 2 | 3 |
---|---|---|
175 | 15 | 10 |
pander(tbl <- icu %>%
select(sta, race) %>%
table)
1 | 2 | 3 | |
---|---|---|---|
0 | 138 | 14 | 8 |
1 | 37 | 1 | 2 |
(OR.2.1 <- (1/14)/(37/138))
[1] 0.2664093
(SE.log.OR.2 <- sqrt(1/1+1/14+1/37+1/138))
[1] 1.051524
(OR.2.fit2 <- exp(coef(fit2)[2]))
race2
0.2664093
(SE.log.OR.2.fit2 <- summary(fit2)$coef[2,2])
[1] 1.05152
(OR.3.1 <- (2/8)/(37/138))
[1] 0.9324324
(OR.2.fit2 <- exp(coef(fit2)[3]))
race3
0.9324324
(SE.log.OR.3 <- sqrt(1/2+1/8+1/37+1/138))
[1] 0.8119565
(SE.log.OR.3.fit2 <- summary(fit2)$coef[3,2])
[1] 0.8119565
coefs2 <- fit2$coef
CI2 <- exp(coefs2[2] + c(-1,1)*qnorm(0.975, lower.tail = TRUE) * SE.log.OR.2)
CI3 <- exp(coefs2[3] + c(-1,1) * qnorm(0.975, lower.tail = TRUE) * SE.log.OR.3)
df <- data.frame(lower = c(CI2[1], CI3[1]), OR = c(OR.2.1, OR.3.1), upper = c(CI2[2], CI3[2]))
rownames(df) <- c("race2", "race3")
pander(df)
lower | OR | upper | |
---|---|---|---|
race2 | 0.03392 | 0.2664 | 2.092 |
race3 | 0.1899 | 0.9324 | 4.579 |
d) Create design variables for RACE using the method typically employed in ANOVA. Perform the logistic regression of STA on RACE. Show by calculation that the estimated logit differences of RACE = 2 versus RACE = 1 and RACE = 3 versus RACE = 1 are equivalent to the values of the log-odds ratio obtained in problem 1(c). Use the results of the logistic regression to obtain the 95% CI for the odds ratios and verify that they are the same limits as obtained in Exercise 1(c). Note that the estimated covariance matrix for the estimated coefficients is needed to obtain the estimated variances of the logit differences.
dfm <- rbind(c(-1, -1), diag(nrow = 2,ncol = 2))
pander(dfm)
-1 | -1 |
1 | 0 |
0 | 1 |
contrasts(icu$race) <- dfm
pander(contrasts(icu$race))
-1 | -1 |
1 | 0 |
0 | 1 |
fit2.recoded <- glm(sta ~ race, data = icu, family = binomial(link = logit))
ztable(fit2.recoded)
Estimate | Std. Error | z value | Pr(>|z|) | OR | lcl | ucl | |
---|---|---|---|---|---|---|---|
(Intercept) | -1.7806 | 0.4385 | -4.06 | 0.0000 | 0.17 | 0.06 | 0.36 |
e) Consider the logistic regression of STA on CRN and AGE. Consider CRN to be the risk factor and show that AGE is a confounder of the association of CRN with STA. Addition of the interaction of AGE by CRN presents an interesting modeling dilemma. Examine the main effects only and interaction models graphically. Using the graphical results and any significance tests you feel are needed, select the best model (main effects or interaction) and justify your choice. Estimate relevant odds ratios. Repeat this analysis of confounding and interaction for a model that includes CPR as the risk factor and AGE as the potential confounding variable.
`crn` = history of chronic renal failure
fit3a <- glm(sta ~ crn, data = icu, family = binomial(link = logit))
fit3b <- glm(sta ~ crn + age, data = icu, family = binomial(link = logit))
(loglik3a <- fit3a$deviance / (-2))
[1] -97.36837
(loglik3b <- fit3b$deviance / (-2))
[1] -94.30229
(G <- (-2) * (loglik3a - loglik3b))
[1] 6.132161
(pval <- 2 * pchisq(G, df = 1, lower.tail = FALSE))
[1] 0.02654891
(delta.beta.percent <- (fit3a$coef[2] - fit3b$coef[2])/fit3b$coef[2] * 100)
crn
19.60082
library(car)
icu$logit.crn1 <- logit(predict(fit3b, newdata = data.frame(age = icu$age, crn = 1), type = "response"))
icu$logit.crn0 <- logit(predict(fit3b, newdata = data.frame(age = icu$age, crn = 0), type = "response"))
library(ggplot2)
ggplot(data = icu) +
geom_line(aes(x = age, y = logit.crn1), color = "darkgreen") +
geom_line(aes(x = age, y = logit.crn0), color = "red") +
labs(y = "logit") +
annotate(geom = "text", x= c(75, 75), y = c(-1.3, 0), label = c("CRN = 0", "CRN = 1")) +
ggtitle("Model without interaction")
fit3b.interaction <- glm(sta ~ crn * age, data = icu, family = binomial(link = logit))
ztable(fit3b.interaction)
Estimate | Std. Error | z value | Pr(>|z|) | OR | lcl | ucl | |
---|---|---|---|---|---|---|---|
(Intercept) | -3.2979 | 0.7705 | -4.28 | 0.0000 | 0.04 | 0.01 | 0.15 |
icu$logit.crn1.int <- logit(predict(fit3b.interaction, newdata = data.frame(age = icu$age, crn = 1), type = "response"))
icu$logit.crn0.int <- logit(predict(fit3b.interaction, newdata = data.frame(age = icu$age, crn = 0), type = "response"))
max(icu$age)
[1] 92
ggplot(data = icu) +
geom_line(aes(x = age, y = logit.crn1.int), color = "darkgreen") +
geom_line(aes(x = age, y = logit.crn0.int), color = "red") +
ylab("logit") +
ggtitle("Model with interaction") +
annotate(geom = "text", x = c(62.5, 62.5,88), y = c(-0.2, -1.3, -2.3), label = c("CRN = 1", "CRN = 0", "max \n age = 92")) +
geom_vline(xintercept = 92, linetype = "dashed", color = "black")
(loglik3b.int <- fit3b.interaction$deviance / (-2))
[1] -93.68108
(Gb <- (-2) * (loglik3b - loglik3b.int))
[1] 1.242437
(pval.b <- 2 * pchisq(Gb, df = 1, lower.tail = FALSE))
[1] 0.5300038
Based on the value of \(\Delta \beta \%\)(19.6008213) and p-value for the coefficient of age
(0.0197698), as well as the p-value of the likelihood ratio test, I decided that age
is a confounder of the relationship between crn
and sta
.
On the other hand, age
is not a modifier of the relationship between the risk factor and outcome, because the graph of the logit for the model with only the main effects show that the two logits are parallel, and graph for the model with interaction shows that the two logits actually don’t cross in the range of the data - the crossing of the two lines can be only imagined in extrapolation. Also, the likelihood ratio test for models without and with an interaction term (p-value = 0.5300038) says there is no difference between the two models in their predictability of the outcome variable sta
.
Therefore, I decided that age
is a confounder, but not a modifier of the relationship between history of chronic renal failure and chances of dying as a function of being admitted to an intensive care unit.
f) Consider an analysis for confounding and interaction for the model with STA as the outcome, CAN as the risk factor, and TYP as the potential confounding variable. Perform this analysis using logistic regression modeling and Mantel-Haenszel analysis. Compare the results of the two approaches.
fit4a <- glm(sta ~ can, data = icu, family = binomial(link = logit))
ztable(fit4a)
Estimate | Std. Error | z value | Pr(>|z|) | OR | lcl | ucl | |
---|---|---|---|---|---|---|---|
(Intercept) | -1.3863 | 0.1863 | -7.44 | 0.0000 | 0.25 | 0.17 | 0.36 |
fit4b <- glm(sta ~ can + typ, data = icu, family = binomial(link = logit))
ztable(fit4b)
Estimate | Std. Error | z value | Pr(>|z|) | OR | lcl | ucl | |
---|---|---|---|---|---|---|---|
(Intercept) | -3.8202 | 0.8564 | -4.46 | 0.0000 | 0.02 | 0.00 | 0.09 |
(loglik4a <- fit4a$deviance / (-2))
[1] -100.0805
(loglik4b <- fit4b$deviance / (-2))
[1] -91.01196
(G4ab <- (-2) * (loglik4a - loglik4b))
[1] 18.13706
(pval.4ab <- 2 * pchisq(G4ab, df = 1, lower.tail = FALSE))
[1] 4.111236e-05
(delta.beta.percent.4ab <- 100 * (fit4a$coef[2] - fit4b$coef[2]) / fit4b$coef[2])
can -100
P-value for the significance of can
as a risk factor for death as a function of being admitted to ICU says that can
solely is not a significant predictor: an information that cancer is a part of patient’s present problem solely on itself bears no predictive value. However, information that cancer is a part of patient’s present problem coupled with the information that type of admission to hospital is emergency rather than elective, bears more significant value, in sense that they both have positive coefficients, hence they work together in raising the chances of a lethal outcome. Also, likelihood ratio test says that model with type of admission as a confounder is significantly better than the model with only can
as a predictor (p-value = 4.111235710^{-5}), and value of \(\Delta \beta \%\) cannot be higher (\(\Delta \beta \%\) = -100).
#perform a test of homogeneity of odds ratios (odds of sta = 1 for can=1 to can = 0) over strata of variable `typ`
library(vcd)
pander::pander(tbl <-ftable(sta ~ can + typ, data = icu))
“sta” | “0” | “1” | ||
“can” | “typ” | |||
“0” | “0” | 37 | 1 | |
“1” | 107 | 35 | ||
“1” | “0” | 14 | 1 | |
“1” | 2 | 3 |
pander(wt <- woolf_test(table(icu$sta, icu$can, icu$typ))) # i.e. woolf_test for the significance of interaction
Test statistic | df | P value |
---|---|---|
0.1023 | 1 | 0.7491 |
mht <- mantelhaen.test(x = icu$can, y = icu$sta, z = icu$typ) #calculates common odds ratio
pander(mht$estimate)
common odds ratio |
---|
3.893 |
P-value of woolf_test(package = 'vcd')
is 0.7491291 says that there is homogeneity of odd ratios of outcome (sta = 1) between can=1
and can = 0
across strata of typ
. That is, the odds ratios across strata are homogenous.
fit4b.interaction <- glm(sta ~ can * typ, data = icu, family = binomial(link = logit))
ztable(fit4b.interaction)
Estimate | Std. Error | z value | Pr(>|z|) | OR | lcl | ucl | |
---|---|---|---|---|---|---|---|
(Intercept) | -3.6109 | 1.0134 | -3.56 | 0.0004 | 0.03 | 0.00 | 0.12 |
(loglik4b.int <- fit4b.interaction$deviance / (-2))
[1] -90.96089
(G4b.int <- (-2) * (loglik4b - loglik4b.int))
[1] 0.1021223
(pval.4bint <- 2 * pchisq(G4b.int, df = 1, lower.tail = FALSE))
[1] 1.498595
We can see that p-value of the Wald statistic for the coefficient of the interaction term (0.7491285) is approximately equal to the p-value of the woolf_test
(0.7491291)