For this exercise, you will use the Hyponatremia dataset (this is the same data from week 4). You can download the hyponatremia.dta Stata file, or you can also access the data through this CSV.
If you recall from last week, the data for this homework exercise derive from an epidemiological study of hyponatremia (a life-threatening condition) among runners of the 2002 Boston Marathon. Hyponatremia is defined as an electrolyte disturbance in which the serum sodium concentration is lower than normal (<135 mmol/l).
knitr::opts_chunk$set(collapse = TRUE,
fig.align = "center",
cache = TRUE,
echo = TRUE,
warning = FALSE,
message = FALSE,
results = 'asis')
library(ztable)
## Welcome to package ztable ver 0.1.5
options(ztable.type = 'html')
a) Assess the association between hyponatremia (dichotomous variable nas135) and sex (variable female) by making a 2 by 2 table. Calculate the odds ratio of hyponatremia of a female compared to a male. Compute the 95% confidence interval for this odds ratio. Interpret the findings.
library(aplore3)
library(ztable)
filename <- "hyponatremia.csv"
if (! file.exists("hyponatremia.csv")) {
fileURL <- "https://learn.canvas.net/courses/1179/files/461758/download?download_frd=1"
download.file(fileURL, filename)
}
hn <- read.csv(filename, header = TRUE, sep = ",")
ztable(head(hn))
| <th | obs | <thna | <thnas135 | <thfemale | <thurinat3p | <thwtdiff | <thbmi | <thruntime |
|---|---|---|---|---|---|---|---|---|
| 1 | 1 | 141 | 0 | 0 | 0 | -2.87 | 25.90 | 230 |
zt <- table(hn$female, hn$nas135)
library(pander)
pander::pander(zt) # ztable can't handle object of class table
| 0 | 1 | |
|---|---|---|
| 0 | 297 | 25 |
| 1 | 129 | 37 |
nas135 is a categorical binary variable with codes 0 and 1. 0 = eunatremia and 1 = hyponatremia. \[OR(female, male) = \frac{\frac{37}{129}}{\frac{25}{297}} = 3.41\]
(SE_logOR <- sqrt(1/37 + 1/129 + 1/25 + 1/297))
[1] 0.279546
(CI_OR <- exp(log(3.41) + c(-1, 1) * qnorm(0.975, lower.tail = TRUE) * SE_logOR))
[1] 1.971535 5.897995
library(epitools)
ztable(oddsratio(hn$female, hn$nas135)$measure)
| <th | estimate | <thlower | <thupper |
|---|---|---|---|
| 0 | 1.00 | ||
Female contestants of Boston Marathon are 3.4 times more likely to experience hyponatremia than their male counterparts, i.e. with 95% certainty we can say they are as low as 1.97 or as high as 5.9 times more likely to suffer from hyponatremia.
b) Perform a logistic regression analysis with Stata using nas135 as dependent variable and female as the only independent variable. Use the Likelihood Ratio test to assess the significance of the model. Is the model with female a better model than the naïve model?
fit <- glm(nas135 ~ female, data = hn, family = binomial(link = logit))
ztable(fit)
| <th | Estimate | <thStd. Error | <thz value | <thPr(>|z|) | <thOR | <thlcl | <thucl |
|---|---|---|---|---|---|---|---|
| (Intercept) | -2.4749 | 0.2082 | -11.88 | 0.0000 | 0.08 | 0.05 | 0.12 |
(LogLikNull <- fit$null.deviance / (-2))
[1] -185.8004
(LogLikFemale <- fit$deviance / (-2))
[1] -175.9655
(G <- -2 * (LogLikNull - LogLikFemale))
[1] 19.66992
(pval <- 2 * pchisq(G, df = 1, lower.tail = F))
[1] 1.84078e-05
Model with female as predictor is a better model than the naive model, since the p-value of the likelihood ratio test is 1.840780210^{-5}, which is well below the customary 0.05 significance level.
If you are using Stata, you can use their built-in statistical functions to obtain p-values (type help functions).
c) What is the nadve model? What is the probability of hyponatremia that this model predicts?
Naive model is a model with no predictors, only intercept. That model predicts that the percentage of people with hyponatremia is 12.704918, which is equal to the percentage of the hyponatremic people in the data:
d) Run a logistic regression analyses with no independent variables. Transform the coefficient obtained from this model into a probability.
nrow(hn[hn$nas135 == 1, ])/nrow(hn) * 100
[1] 12.70492
fit1 <- glm(nas135 ~ 1, data = hn, family = binomial(link = logit))
unique(predict(fit1, type = "response"))
[1] 0.1270492
d) Using the model with female as independent variable, compute the estimated probability of hyponatremia per males and females. Write down the equation for the logit.
dd <- data.frame(predict(fit, newdata = data.frame(female = c(1, 0)), type = "response"))
rownames(dd)= c("female", "male")
colnames(dd) = c("probability of hyponatremia")
pander::pander(dd)
| probability of hyponatremia | |
|---|---|
| female | 0.2229 |
| male | 0.07764 |
\[g(female) = \beta_0 + \beta_1*female\] \[g(female) = -2.475 + 1.226 *female\]
f) Use the Wald test to assess the significance of the coefficient for female.
ztable(sumtab <- summary(fit)$coef)
| <th | Estimate | <thStd. Error | <thz value | <thPr(>|z|) |
|---|---|---|---|---|
| (Intercept) | -2.47 | 0.21 | -11.88 | 0.00 |
(wald <- (sumtab[2,1] - 0 )/ sumtab[2,2])
[1] 4.385547
(pval.w <- 2 * pnorm(wald, lower.tail = F))
[1] 1.156946e-05
Since p-value of the Wald test is well below the significance level of 0.05, we can reject the null hypothesis that the slope of the female variable is equal to zero.
g) Fit a model with runtime as the only independent variable. Assess the significance of the model.
pander::pander(summary(hn$runtime))
| Min. | 1st Qu. | Median | Mean | 3rd Qu. | Max. | NA’s |
|---|---|---|---|---|---|---|
| 142 | 195 | 220 | 225.5 | 248 | 400 | 11 |
fit.run <- glm(nas135 ~ runtime, data = hn, family = binomial(link = logit))
(LogLikRuntime <- fit.run$deviance / (-2))
[1] -167.7718
(G.run <- -2 * (LogLikNull - LogLikRuntime))
[1] 36.05718
(pval.run <- 2 * pchisq(G.run, df = 1, lower.tail = FALSE))
[1] 3.832238e-09
Compared to the null model, model with runtime is significantly better, p-value from the likelihood ratio test is well below the 0.05 cutoff.
h) Calculate the probability of hyponatremia of a runner who takes 4 hours (240 minutes) to complete the marathon.
predict(fit.run, newdata = data.frame(runtime = 240), type = "response")
1
0.1332953
A runner with runtime of 240 minutes has 13.3 percent chances to experience hyponatremia.
i) Fit a model with female and runtime as independent variables. Assess the significance of the model. Which null hypothesis is tested?
fit.fr <- glm(nas135 ~ female + runtime, data = hn, family = binomial(link = logit))
ztable(fit.fr)
| <th | Estimate | <thStd. Error | <thz value | <thPr(>|z|) | <thOR | <thlcl | <thucl |
|---|---|---|---|---|---|---|---|
| (Intercept) | -5.7211 | 0.8233 | -6.95 | 0.0000 | 0.00 | 0.00 | 0.02 |
(LogLikFR <- fit.fr$deviance / (-2))
[1] -162.2398
(G.fr <- -2 * (LogLikNull - LogLikFR))
[1] 47.12115
(pvalFR <- 2 * pchisq(G.fr, df = 2, lower.tail = FALSE))
[1] 1.171659e-10
The likelihood ratio test, which tests the significance of the model compared to the null model ( in this example), yielded a very small p-value. That means that the model with female and runtime as predictors is better than the model with only intercept. The null hypothesis in this test states that the model with female and runtime as predictors is no better than the model with only intercept.
eval.model <- function(numerator, denominator, dfs, labels) {
# numerator = log-likelihood of a smaller of the two models that are compared; could be a vector of models
#denominator = log-likelihood of the bigger model in which significance we are interested in
#dfs = vector of degrees of freedom corresponding to each comparison
#labels = vector of titles for the table to denote the models
df <- data.frame(matrix(nrow = length(numerator)+1, ncol = 2))
for (i in 1:length(numerator)) {
G <- -2 * (numerator[i] - denominator)
pval <- 2 * pchisq(G, dfs[i], lower.tail = FALSE)
df[i, 1] <- round(G, 4)
df[i, 2] <- round(pval, 2)
}
df[length(numerator) + 1, ] <- c("-", "-")
rownames(df) <- labels
colnames(df) <- c("G", "p-value")
return(df)
}
numerator <- c(LogLikFemale, LogLikRuntime)
denominator <- LogLikFR
dfs <- c(1,1)
labels <- c("female", "runtime", "female + runtime")
ztable(eval.model(numerator = numerator, denominator = denominator, dfs = dfs, labels = labels))
| <th | G | <thp-value |
|---|---|---|
| female | 27.4512 | 0 |
Likelihood ratios of the model with the two predictors compared with models with each of the predictors as a single predictor reveals that the two-predictor model is significantly better than either of the single-predictor models.