Exercise One

For this exercise, you will use the Hyponatremia dataset (this is the same data from week 4). You can download the hyponatremia.dta Stata file, or you can also access the data through this CSV.

If you recall from last week, the data for this homework exercise derive from an epidemiological study of hyponatremia (a life-threatening condition) among runners of the 2002 Boston Marathon. Hyponatremia is defined as an electrolyte disturbance in which the serum sodium concentration is lower than normal (<135 mmol/l).

knitr::opts_chunk$set(collapse = TRUE,
                      fig.align = "center",
                      cache = TRUE,
                      echo = TRUE, 
                      warning = FALSE,
                      message = FALSE,
                      results = 'asis')
library(ztable)
## Welcome to package ztable ver 0.1.5
options(ztable.type = 'html')

Complete the following:

a) Assess the association between hyponatremia (dichotomous variable nas135) and sex (variable female) by making a 2 by 2 table. Calculate the odds ratio of hyponatremia of a female compared to a male. Compute the 95% confidence interval for this odds ratio. Interpret the findings.   
library(aplore3)
library(ztable)
filename <- "hyponatremia.csv"

if (! file.exists("hyponatremia.csv")) {
    fileURL <- "https://learn.canvas.net/courses/1179/files/461758/download?download_frd=1"
    download.file(fileURL, filename)
} 
hn <- read.csv(filename, header = TRUE, sep = ",")
ztable(head(hn))
<th <th <th <th <th <th <th <th
  obs na nas135 female urinat3p wtdiff bmi runtime
1 1 141 0 0 0 -2.87 25.90 230
2 2 135 1 1 0 0.00 31.70 400
3 3 140 0 0 0 -1.36 22.92 216
4 4 143 0 0 0 -1.82 24.25 201
5 5 128 1 1 0 -0.91 18.28 223
6 6 148 0 0 0 -0.45 22.60 275
zt <- table(hn$female, hn$nas135)

library(pander)
pander::pander(zt)   # ztable can't handle object of class table
  0 1
0 297 25
1 129 37

nas135 is a categorical binary variable with codes 0 and 1. 0 = eunatremia and 1 = hyponatremia. \[OR(female, male) = \frac{\frac{37}{129}}{\frac{25}{297}} = 3.41\]

(SE_logOR <- sqrt(1/37 + 1/129 + 1/25 + 1/297))

[1] 0.279546

(CI_OR <- exp(log(3.41) + c(-1, 1) * qnorm(0.975, lower.tail = TRUE) * SE_logOR))

[1] 1.971535 5.897995

library(epitools)
ztable(oddsratio(hn$female, hn$nas135)$measure)
<th <th <th
  estimate lower upper
0 1.00
1 3.39 1.97 5.94

Female contestants of Boston Marathon are 3.4 times more likely to experience hyponatremia than their male counterparts, i.e. with 95% certainty we can say they are as low as 1.97 or as high as 5.9 times more likely to suffer from hyponatremia.

b) Perform a logistic regression analysis with Stata using nas135 as dependent variable and female as the only independent variable. Use the Likelihood Ratio test to assess the significance of the model. Is the model with female a better model than the naïve model? 
fit <- glm(nas135 ~ female, data = hn, family = binomial(link = logit))
ztable(fit)
<th <th <th <th <th <th <th
  Estimate Std. Error z value Pr(>|z|) OR lcl ucl
(Intercept) -2.4749 0.2082 -11.88 0.0000 0.08 0.05 0.12
female 1.2260 0.2795 4.39 0.0000 3.41 1.98 5.95
Call: glm(formula = nas135 ~ female, family = binomial(link = logit), data = hn)
(LogLikNull <- fit$null.deviance / (-2))

[1] -185.8004

(LogLikFemale <- fit$deviance / (-2))

[1] -175.9655

(G <- -2 * (LogLikNull - LogLikFemale))

[1] 19.66992

(pval <- 2 * pchisq(G, df = 1, lower.tail = F))

[1] 1.84078e-05

Model with female as predictor is a better model than the naive model, since the p-value of the likelihood ratio test is 1.840780210^{-5}, which is well below the customary 0.05 significance level.

If you are using Stata, you can use their built-in statistical functions to obtain p-values (type help functions).

c) What is the nadve model? What is the probability of hyponatremia that this model predicts?    

Naive model is a model with no predictors, only intercept. That model predicts that the percentage of people with hyponatremia is 12.704918, which is equal to the percentage of the hyponatremic people in the data:

d) Run a logistic regression analyses with no independent variables. Transform the coefficient obtained from this model into a probability.    
nrow(hn[hn$nas135 == 1, ])/nrow(hn) * 100

[1] 12.70492

fit1 <- glm(nas135 ~ 1, data = hn, family = binomial(link = logit))
unique(predict(fit1, type = "response"))

[1] 0.1270492

d) Using the model with female as independent variable, compute the estimated probability of hyponatremia per males and females. Write down the equation for the logit.   
dd <- data.frame(predict(fit, newdata = data.frame(female = c(1, 0)), type = "response"))
rownames(dd)= c("female", "male")
colnames(dd) = c("probability of hyponatremia")
pander::pander(dd)
  probability of hyponatremia
female 0.2229
male 0.07764

\[g(female) = \beta_0 + \beta_1*female\] \[g(female) = -2.475 + 1.226 *female\]

f) Use the Wald test to assess the significance of the coefficient for female.   
ztable(sumtab <- summary(fit)$coef)
<th <th <th <th
  Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.47 0.21 -11.88 0.00
female 1.23 0.28 4.39 0.00
(wald <- (sumtab[2,1] - 0 )/ sumtab[2,2])

[1] 4.385547

(pval.w <- 2 * pnorm(wald, lower.tail = F))

[1] 1.156946e-05

Since p-value of the Wald test is well below the significance level of 0.05, we can reject the null hypothesis that the slope of the female variable is equal to zero.

g) Fit a model with runtime as the only independent variable. Assess the significance of the model.  
pander::pander(summary(hn$runtime))
Min. 1st Qu. Median Mean 3rd Qu. Max. NA’s
142 195 220 225.5 248 400 11
fit.run <- glm(nas135 ~ runtime, data = hn, family = binomial(link = logit))
(LogLikRuntime <- fit.run$deviance / (-2))

[1] -167.7718

(G.run <- -2 * (LogLikNull - LogLikRuntime))

[1] 36.05718

(pval.run <- 2 * pchisq(G.run, df = 1, lower.tail = FALSE))

[1] 3.832238e-09

Compared to the null model, model with runtime is significantly better, p-value from the likelihood ratio test is well below the 0.05 cutoff.

h) Calculate the probability of hyponatremia of a runner who takes 4 hours (240 minutes) to complete the marathon.      
predict(fit.run, newdata = data.frame(runtime = 240), type = "response")
    1 

0.1332953

A runner with runtime of 240 minutes has 13.3 percent chances to experience hyponatremia.

i) Fit a model with female and runtime as independent variables. Assess the significance of the model. Which null hypothesis is tested?   
fit.fr <- glm(nas135 ~ female + runtime, data = hn, family = binomial(link = logit))
ztable(fit.fr)
<th <th <th <th <th <th <th
  Estimate Std. Error z value Pr(>|z|) OR lcl ucl
(Intercept) -5.7211 0.8233 -6.95 0.0000 0.00 0.00 0.02
female 0.9638 0.2910 3.31 0.0009 2.62 1.49 4.67
runtime 0.0142 0.0033 4.31 0.0000 1.01 1.01 1.02
Call: glm(formula = nas135 ~ female + runtime, family = binomial(link = logit), data = hn)
(LogLikFR <- fit.fr$deviance / (-2))

[1] -162.2398

(G.fr <- -2 * (LogLikNull - LogLikFR))

[1] 47.12115

(pvalFR <- 2 * pchisq(G.fr, df = 2, lower.tail = FALSE))

[1] 1.171659e-10

The likelihood ratio test, which tests the significance of the model compared to the null model ( in this example), yielded a very small p-value. That means that the model with female and runtime as predictors is better than the model with only intercept. The null hypothesis in this test states that the model with female and runtime as predictors is no better than the model with only intercept.

eval.model <- function(numerator, denominator, dfs, labels) {
    # numerator = log-likelihood of a smaller of the two models that are compared; could be a vector of models
    #denominator = log-likelihood of the bigger model in which significance we are interested in 
    #dfs = vector of degrees of freedom corresponding to each comparison
    #labels = vector of titles for the table to denote the models
    df <- data.frame(matrix(nrow = length(numerator)+1, ncol = 2))
    for (i in 1:length(numerator)) {
        G <- -2 * (numerator[i] - denominator)
        pval <- 2 * pchisq(G, dfs[i], lower.tail = FALSE)
        df[i, 1] <- round(G, 4)
        df[i, 2] <- round(pval, 2)
    }
    df[length(numerator) + 1, ] <- c("-", "-")
    rownames(df) <- labels
    colnames(df) <- c("G", "p-value")
    return(df)
}
numerator <- c(LogLikFemale, LogLikRuntime)
denominator <- LogLikFR
dfs <- c(1,1)
labels <- c("female", "runtime", "female + runtime")
ztable(eval.model(numerator = numerator, denominator = denominator, dfs = dfs, labels = labels))
<th <th
  G p-value
female 27.4512 0
runtime 11.064 0
female + runtime - -

Likelihood ratios of the model with the two predictors compared with models with each of the predictors as a single predictor reveals that the two-predictor model is significantly better than either of the single-predictor models.