Applied Logistic Regression Week 6 Homework

Exercise One

For this exercise, you will use the ICU dataset (this is the same data we used in weeks 1 through 3). You can download the icu.dta Stata file, or you can also access the data through this CSV file.

library(knitr)
knitr::opts_chunk$set(message = FALSE,
                      warning = FALSE,
                      collapse = TRUE,
                      fig.align = 'center',
                      cache = TRUE,
                      echo = TRUE,
                      results = 'asis')
library(ztable)

## Welcome to package ztable ver 0.1.5

options(ztable.type = 'html')

filename <- "icu.csv"
url = "https://learn.canvas.net/courses/1179/files/461760/download?download_frd=1"
if (!file.exists(filename)) {
    download.file(url, "icu.csv")
}
icu <- read.csv("icu.CSV")

library(dplyr)
names(icu) <- tolower(names(icu))
icu <- tbl_df(icu)   
# icu

From the ICU data, use as the outcome variable vital status (STA) and CPR prior to ICU admission (CPR) as a covariate.

fit <- glm(sta ~ cpr, data = icu, family = binomial(link = logit))
ztable(fit)


	Estimate	Std. Error	z value	Pr(>\|z\|)	OR	lcl	ucl
(Intercept)	-1.5404	0.1918	-8.03	0.0000	0.21	0.14	0.31
cpr	1.6946	0.5885	2.88	0.0040	5.44	1.70	17.94
Call: glm(formula = sta ~ cpr, family = binomial(link = logit), data = icu)

Complete the following:

a) Demonstrate that the value of the log-odds ratio obtained from the cross-classification of STA by CPR is identical to the estimated slope coefficient from the logistic regression of STA on CPR. Verify that the estimated standard error of the estimated slope coefficient for CPR obtained from the logistic regression package is identical to the square root of the sum of the inverse of the cell frequencies from the cross-classification of STA by CPR. Use either set of computations to obtain 95% CI for the odds ratio. What aspect concerning the coding of the variable CPR makes the calculations for the two methods equivalent?

library(pander)
pander(tbl <- icu %>%
     select(sta, cpr) %>%
     table)

	0	1
0	154	6
1	33	7

Odds ratio of dying for people who had received a CPR compared to people who didn’t can be calculated from the above table as:
\[OR(cpr = 1, cpr = 0) = \frac{\frac{7}{33}}{\frac{6}{154}}= 5.4\]

Standard error of the log of odds ratio, i.e. the coefficient of CPR from the regression fit can be found from the formula:
\[Var(\ln(OR)) = \frac{1}{7}+\frac{1}{33}+\frac{1}{6}+\frac{1}{154} \]

(OR.cpr <- (7/33)/(6/154))  # OR calculated from the table

[1] 5.444444

(beta.cpr <- coef(fit)[2])

cpr

1.694596

exp(beta.cpr)   # OR as exponentiated coefficient for CPR from the logistic regression fit

cpr

5.444444


# Standard error of the coefficient for CPR:
(se.cpr <- sqrt(1/7+1/33+1/6+1/154))

[1] 0.5884899

(se.cpr.fit <- summary(fit)$coef[2,2])

[1] 0.5884899


# 95 % confident interval for the odds ratio
(CI <- exp(beta.cpr + c(-1,1)*qnorm(0.975, lower.tail = TRUE)*se.cpr))

[1] 1.718027 17.253495

exp(confint(fit))

            2.5 %     97.5 %

(Intercept) 0.1447167 0.3077347 cpr 1.7047493 17.9389765

exp(confint.default(fit))

            2.5 %    97.5 %

(Intercept) 0.1471337 0.312086 cpr 1.7180273 17.253494

Odds ratio as calculated from the table and the one calculated by exponentiating the coefficient for the CPR from the logistic regression fit are the same because we used the reference cell coding.

b) For purposes of illustration, use a data transformation statement to recode, for this problem only, the variable CPR as follows: 4 = no and 2 = yes. Perform the logistic regression of STA on CPR (recoded). Demonstrate how the calculation of the logit difference of CPR = yes versus CPR = no is equivalent to the value of the log-odds ratio obtained in exercise 1-a. Use the results from the logistic regression to obtain the 95% CI for the odds ratio and verify that they are the same limits as obtained in Exercise 1-a.

icu$cpr.recoded <- ifelse(icu$cpr == 0, 4, 2)
fit.recoded <- glm(sta ~ cpr.recoded, data = icu, family = binomial(link = logit))
ztable(fit.recoded)


	Estimate	Std. Error	z value	Pr(>\|z\|)	OR	lcl	ucl
(Intercept)	1.8487	1.1291	1.64	0.1016	6.35	0.68	63.13
cpr.recoded	-0.8473	0.2942	-2.88	0.0040	0.43	0.24	0.77
Call: glm(formula = sta ~ cpr.recoded, family = binomial(link = logit), data = icu)

\[Odds(STA = 1| CPR) = \frac{\pi_{CPR}}{1-\pi_{CPR}} \]
\[OR(CPR = yes, CPR = no) = \frac{Odds(STA=1|CPR=yes)}{Odds(STA=1|CPR = no)} \] \[=\dfrac{\dfrac{\pi_{CPR = yes}}{1-\pi_{CPR = yes}}} {\dfrac{\pi_{CPR = no}}{1-\pi_{CPR = no}}} = e^{g(CPR = yes) - g(CPR = no)} \]
\[= e^{2\cdot \beta_1 - 4 \cdot \beta_1 } = e^{-2\beta_1}\]

\[Var(\log OR.recoded) = (-2\beta_1)\cdot(-2\beta_1 = 4 \cdot \beta_1^2=4 \cdot Var(\beta_1)=\]

(beta1.recoded <- coef(fit.recoded)[2])

cpr.recoded -0.8472979

(OR.recoded <- exp(-2*beta1.recoded))

cpr.recoded 5.444444

OR.cpr

[1] 5.444444

(var.beta1.recoded <- (summary(fit.recoded)$coef[2,2])^2) #variance of the slope from the recoded fit

[1] 0.08658009

(Var.OR.recoded <- 4*var.beta1.recoded)

[1] 0.3463203

(SE.OR.recoded <- sqrt(Var.OR.recoded))

[1] 0.5884899

(CI.recoded <- exp(-2*beta1.recoded + c(-1,1)*qnorm(0.975, lower.tail = TRUE)*SE.OR.recoded))

[1] 1.718027 17.253494

ztable(exp(confint(fit)))


	2.5 %	97.5 %
(Intercept)	0.14	0.31
cpr	1.70	17.94

ztable(exp(confint.default(fit)))


	2.5 %	97.5 %
(Intercept)	0.15	0.31
cpr	1.72	17.25

c) Consider the ICU data and use as the outcome variable vital status (STA) and race (RACE) as a covariate. Prepare a table showing the coding of the two design variables for RACE using the value RACE = 1, white, as the reference group. Show that the estimated log-odds ratios obtained from the cross-classification of STA by RACE, using RACE = 1 as the reference group, are identical to estimated slope coefficients for the two design variables from the logistic regression of STA on RACE. Verify that the estimated standard errors of the estimated slope coefficients for the two design variables for RACE are identical to the square root of the sum of the inverse of the cell frequencies from the cross-classification of STA by RACE used to calculate the odds ratio. Use either set of computations to compute the 95% CI for the odds ratios.

# fitting the model:
icu$race <- as.factor(icu$race)
fit2 <- glm(sta ~ race, data = icu, family = binomial(link = logit))
ztable(fit2)


	Estimate	Std. Error	z value	Pr(>\|z\|)	OR	lcl	ucl
(Intercept)	-1.3163	0.1851	-7.11	0.0000	0.27	0.18	0.38
race2	-1.3227	1.0515	-1.26	0.2084	0.27	0.01	1.39
race3	-0.0700	0.8120	-0.09	0.9313	0.93	0.14	3.92
Call: glm(formula = sta ~ race, family = binomial(link = logit), data = icu)

ztable(contrasts(icu$race))


	2	3
1	0.00	0.00
2	1.00	0.00
3	0.00	1.00


pander(table(icu$race))

1	2	3
175	15	10

pander(tbl <- icu %>%
           select(sta, race) %>%
           table)

	1	2	3
0	138	14	8
1	37	1	2


(OR.2.1 <- (1/14)/(37/138))

[1] 0.2664093

(SE.log.OR.2 <- sqrt(1/1+1/14+1/37+1/138))

[1] 1.051524

(OR.2.fit2 <- exp(coef(fit2)[2]))

race2

0.2664093

(SE.log.OR.2.fit2 <- summary(fit2)$coef[2,2])

[1] 1.05152


(OR.3.1 <- (2/8)/(37/138))

[1] 0.9324324

(OR.2.fit2 <- exp(coef(fit2)[3]))

race3

0.9324324

(SE.log.OR.3 <- sqrt(1/2+1/8+1/37+1/138))

[1] 0.8119565

(SE.log.OR.3.fit2 <- summary(fit2)$coef[3,2])

[1] 0.8119565

coefs2 <- fit2$coef
CI2 <- exp(coefs2[2] + c(-1,1)*qnorm(0.975, lower.tail = TRUE) * SE.log.OR.2)
CI3 <- exp(coefs2[3] + c(-1,1) * qnorm(0.975, lower.tail = TRUE) * SE.log.OR.3)
df <- data.frame(lower = c(CI2[1], CI3[1]), OR = c(OR.2.1, OR.3.1), upper = c(CI2[2], CI3[2]))
rownames(df) <- c("race2", "race3")
pander(df)

	lower	OR	upper
race2	0.03392	0.2664	2.092
race3	0.1899	0.9324	4.579

d) Create design variables for RACE using the method typically employed in ANOVA. Perform the logistic regression of STA on RACE. Show by calculation that the estimated logit differences of RACE = 2 versus RACE = 1 and RACE = 3 versus RACE = 1 are equivalent to the values of the log-odds ratio obtained in problem 1(c). Use the results of the logistic regression to obtain the 95% CI for the odds ratios and verify that they are the same limits as obtained in Exercise 1(c). Note that the estimated covariance matrix for the estimated coefficients is needed to obtain the estimated variances of the logit differences.

dfm <- rbind(c(-1, -1), diag(nrow = 2,ncol = 2))
pander(dfm)

-1	-1
1	0
0	1

contrasts(icu$race) <- dfm
pander(contrasts(icu$race))

-1	-1
1	0
0	1

fit2.recoded <- glm(sta ~ race, data = icu, family = binomial(link = logit))
ztable(fit2.recoded)


	Estimate	Std. Error	z value	Pr(>\|z\|)	OR	lcl	ucl
(Intercept)	-1.7806	0.4385	-4.06	0.0000	0.17	0.06	0.36
race1	-0.8585	0.7412	-1.16	0.2468	0.42	0.06	1.47
race2	0.3943	0.6330	0.62	0.5334	1.48	0.37	5.40
Call: glm(formula = sta ~ race, family = binomial(link = logit), data = icu)

e) Consider the logistic regression of STA on CRN and AGE. Consider CRN to be the risk factor and show that AGE is a confounder of the association of CRN with STA. Addition of the interaction of AGE by CRN presents an interesting modeling dilemma. Examine the main effects only and interaction models graphically. Using the graphical results and any significance tests you feel are needed, select the best model (main effects or interaction) and justify your choice. Estimate relevant odds ratios. Repeat this analysis of confounding and interaction for a model that includes CPR as the risk factor and AGE as the potential confounding variable.    
`crn` = history of chronic renal failure

fit3a <- glm(sta ~ crn, data = icu, family = binomial(link = logit))
fit3b <- glm(sta ~ crn + age, data = icu, family = binomial(link = logit))
(loglik3a <- fit3a$deviance / (-2))

[1] -97.36837

(loglik3b <- fit3b$deviance / (-2))

[1] -94.30229

(G <- (-2) * (loglik3a - loglik3b))

[1] 6.132161

(pval <- 2 * pchisq(G, df = 1, lower.tail = FALSE))

[1] 0.02654891


(delta.beta.percent <- (fit3a$coef[2] - fit3b$coef[2])/fit3b$coef[2] * 100)

crn

19.60082


library(car)
icu$logit.crn1 <- logit(predict(fit3b, newdata = data.frame(age = icu$age, crn = 1), type = "response"))
icu$logit.crn0 <- logit(predict(fit3b, newdata = data.frame(age = icu$age, crn = 0), type = "response"))

library(ggplot2)
ggplot(data = icu)  +
    geom_line(aes(x = age, y = logit.crn1), color = "darkgreen")  +
    geom_line(aes(x = age, y = logit.crn0), color = "red")  +
    labs(y = "logit")  +
    annotate(geom = "text", x= c(75, 75), y = c(-1.3, 0), label = c("CRN = 0", "CRN = 1"))  +
    ggtitle("Model without interaction")

                  

  
fit3b.interaction <- glm(sta ~ crn * age, data = icu, family = binomial(link = logit))  
ztable(fit3b.interaction)


	Estimate	Std. Error	z value	Pr(>\|z\|)	OR	lcl	ucl
(Intercept)	-3.2979	0.7705	-4.28	0.0000	0.04	0.01	0.15
crn	3.5731	2.3223	1.54	0.1239	35.63	0.32	4312.07
age	0.0292	0.0117	2.49	0.0126	1.03	1.01	1.06
crn:age	-0.0381	0.0341	-1.12	0.2634	0.96	0.90	1.03
Call: glm(formula = sta ~ crn * age, family = binomial(link = logit), data = icu)


icu$logit.crn1.int <- logit(predict(fit3b.interaction, newdata = data.frame(age = icu$age, crn = 1), type = "response"))
icu$logit.crn0.int <- logit(predict(fit3b.interaction, newdata = data.frame(age = icu$age, crn = 0), type = "response"))
max(icu$age)

[1] 92

ggplot(data = icu)  +
    geom_line(aes(x = age, y = logit.crn1.int), color = "darkgreen")  +
    geom_line(aes(x = age, y = logit.crn0.int), color = "red")  +
    ylab("logit")  +
    ggtitle("Model with interaction")  +
    annotate(geom = "text", x = c(62.5, 62.5,88), y = c(-0.2, -1.3, -2.3), label = c("CRN = 1", "CRN = 0", "max \n age = 92"))  +
    geom_vline(xintercept = 92, linetype = "dashed", color = "black")


(loglik3b.int <- fit3b.interaction$deviance / (-2))

[1] -93.68108

(Gb <- (-2) * (loglik3b - loglik3b.int))

[1] 1.242437

(pval.b <- 2 * pchisq(Gb, df = 1, lower.tail = FALSE))

[1] 0.5300038

Based on the value of $\Delta \beta \%$(19.6008213) and p-value for the coefficient of age(0.0197698), as well as the p-value of the likelihood ratio test, I decided that age is a confounder of the relationship between crn and sta.
On the other hand, age is not a modifier of the relationship between the risk factor and outcome, because the graph of the logit for the model with only the main effects show that the two logits are parallel, and graph for the model with interaction shows that the two logits actually don’t cross in the range of the data - the crossing of the two lines can be only imagined in extrapolation. Also, the likelihood ratio test for models without and with an interaction term (p-value = 0.5300038) says there is no difference between the two models in their predictability of the outcome variable sta.
Therefore, I decided that age is a confounder, but not a modifier of the relationship between history of chronic renal failure and chances of dying as a function of being admitted to an intensive care unit.

f) Consider an analysis for confounding and interaction for the model with STA as the outcome, CAN as the risk factor, and TYP as the potential confounding variable. Perform this analysis using logistic regression modeling and Mantel-Haenszel analysis. Compare the results of the two approaches.

fit4a <- glm(sta ~ can, data = icu, family = binomial(link = logit))
ztable(fit4a)


	Estimate	Std. Error	z value	Pr(>\|z\|)	OR	lcl	ucl
(Intercept)	-1.3863	0.1863	-7.44	0.0000	0.25	0.17	0.36
can	-0.0000	0.5893	-0.00	1.0000	1.00	0.27	2.93
Call: glm(formula = sta ~ can, family = binomial(link = logit), data = icu)

fit4b <- glm(sta ~ can + typ, data = icu, family = binomial(link = logit))
ztable(fit4b)


	Estimate	Std. Error	z value	Pr(>\|z\|)	OR	lcl	ucl
(Intercept)	-3.8202	0.8564	-4.46	0.0000	0.02	0.00	0.09
can	1.3640	0.7817	1.74	0.0810	3.91	0.83	20.45
typ	2.7097	0.8598	3.15	0.0016	15.03	3.57	116.36
Call: glm(formula = sta ~ can + typ, family = binomial(link = logit), data = icu)

(loglik4a <- fit4a$deviance / (-2))

[1] -100.0805

(loglik4b <- fit4b$deviance / (-2))

[1] -91.01196

(G4ab <- (-2) * (loglik4a - loglik4b))

[1] 18.13706

(pval.4ab <- 2 * pchisq(G4ab, df = 1, lower.tail = FALSE))

[1] 4.111236e-05

(delta.beta.percent.4ab <- 100 * (fit4a$coef[2] - fit4b$coef[2]) / fit4b$coef[2])

can -100

P-value for the significance of can as a risk factor for death as a function of being admitted to ICU says that can solely is not a significant predictor: an information that cancer is a part of patient’s present problem solely on itself bears no predictive value. However, information that cancer is a part of patient’s present problem coupled with the information that type of admission to hospital is emergency rather than elective, bears more significant value, in sense that they both have positive coefficients, hence they work together in raising the chances of a lethal outcome. Also, likelihood ratio test says that model with type of admission as a confounder is significantly better than the model with only can as a predictor (p-value = 4.111235710^{-5}), and value of $\Delta \beta \%$ cannot be higher ($\Delta \beta \%$ = -100).

#perform a test of homogeneity of odds ratios (odds of sta = 1 for can=1 to can = 0) over strata of variable `typ`
library(vcd)
pander::pander(tbl <-ftable(sta ~ can + typ, data = icu))

		“sta”	“0”	“1”
“can”	“typ”
“0”	“0”		37	1
	“1”		107	35
“1”	“0”		14	1
	“1”		2	3


pander(wt <- woolf_test(table(icu$sta, icu$can, icu$typ)))  # i.e. woolf_test for the significance of interaction

Woolf-test on Homogeneity of Odds Ratios (no 3-Way assoc.): `table(icu$sta, icu$can, icu$typ)`
Test statistic	df	P value
0.1023	1	0.7491

mht <- mantelhaen.test(x = icu$can, y = icu$sta, z = icu$typ) #calculates common odds ratio
pander(mht$estimate)

common odds ratio
3.893

P-value of woolf_test(package = 'vcd') is 0.7491291 says that there is homogeneity of odd ratios of outcome (sta = 1) between can=1 and can = 0 across strata of typ. That is, the odds ratios across strata are homogenous.

fit4b.interaction <- glm(sta ~ can * typ, data = icu, family = binomial(link = logit))
ztable(fit4b.interaction)


	Estimate	Std. Error	z value	Pr(>\|z\|)	OR	lcl	ucl
(Intercept)	-3.6109	1.0134	-3.56	0.0004	0.03	0.00	0.12
can	0.9719	1.4486	0.67	0.5023	2.64	0.10	69.97
typ	2.4934	1.0320	2.42	0.0157	12.10	2.47	218.77
can:typ	0.5511	1.7233	0.32	0.7491	1.74	0.04	70.11
Call: glm(formula = sta ~ can * typ, family = binomial(link = logit), data = icu)

(loglik4b.int <- fit4b.interaction$deviance / (-2))

[1] -90.96089

(G4b.int <- (-2) * (loglik4b - loglik4b.int))

[1] 0.1021223

(pval.4bint <- 2 * pchisq(G4b.int, df = 1, lower.tail = FALSE))

[1] 1.498595

We can see that p-value of the Wald statistic for the coefficient of the interaction term (0.7491285) is approximately equal to the p-value of the woolf_test(0.7491291)