Exercise One

For this exercise, you will use the Hyponatremia dataset (this is the same data from week 4). You can download the hyponatremia.dta Stata file, or you can also access the data through this CSV.

If you recall from last week, the data for this homework exercise derive from an epidemiological study of hyponatremia (a life-threatening condition) among runners of the 2002 Boston Marathon. Hyponatremia is defined as an electrolyte disturbance in which the serum sodium concentration is lower than normal (<135 mmol/l).

knitr::opts_chunk$set(collapse = TRUE,
                      fig.align = "center",
                      cache = TRUE,
                      echo = TRUE, 
                      warning = FALSE,
                      message = FALSE,
                      results = 'asis')
library(ztable)

## Welcome to package ztable ver 0.1.5

options(ztable.type = 'html')

Complete the following:

a) Assess the association between hyponatremia (dichotomous variable nas135) and sex (variable female) by making a 2 by 2 table. Calculate the odds ratio of hyponatremia of a female compared to a male. Compute the 95% confidence interval for this odds ratio. Interpret the findings.

library(aplore3)
library(ztable)
filename <- "hyponatremia.csv"

if (! file.exists("hyponatremia.csv")) {
    fileURL <- "https://learn.canvas.net/courses/1179/files/461758/download?download_frd=1"
    download.file(fileURL, filename)
} 
hn <- read.csv(filename, header = TRUE, sep = ",")
ztable(head(hn))

<th <th <th <th <th <th <th <th


	obs	na	nas135	female	urinat3p	wtdiff	bmi	runtime
1	1	141	0	0	0	-2.87	25.90	230
2	2	135	1	1	0	0.00	31.70	400
3	3	140	0	0	0	-1.36	22.92	216
4	4	143	0	0	0	-1.82	24.25	201
5	5	128	1	1	0	-0.91	18.28	223
6	6	148	0	0	0	-0.45	22.60	275

zt <- table(hn$female, hn$nas135)

library(pander)
pander::pander(zt)   # ztable can't handle object of class table

	0	1
0	297	25
1	129	37

nas135 is a categorical binary variable with codes 0 and 1. 0 = eunatremia and 1 = hyponatremia. \[OR(female, male) = \frac{\frac{37}{129}}{\frac{25}{297}} = 3.41\]

(SE_logOR <- sqrt(1/37 + 1/129 + 1/25 + 1/297))

[1] 0.279546

(CI_OR <- exp(log(3.41) + c(-1, 1) * qnorm(0.975, lower.tail = TRUE) * SE_logOR))

[1] 1.971535 5.897995

library(epitools)
ztable(oddsratio(hn$female, hn$nas135)$measure)

<th <th <th


	estimate	lower	upper
0	1.00
1	3.39	1.97	5.94

Female contestants of Boston Marathon are 3.4 times more likely to experience hyponatremia than their male counterparts, i.e. with 95% certainty we can say they are as low as 1.97 or as high as 5.9 times more likely to suffer from hyponatremia.

b) Perform a logistic regression analysis with Stata using nas135 as dependent variable and female as the only independent variable. Use the Likelihood Ratio test to assess the significance of the model. Is the model with female a better model than the naïve model?

fit <- glm(nas135 ~ female, data = hn, family = binomial(link = logit))
ztable(fit)

<th <th <th <th <th <th <th


	Estimate	Std. Error	z value	Pr(>\|z\|)	OR	lcl	ucl
(Intercept)	-2.4749	0.2082	-11.88	0.0000	0.08	0.05	0.12
female	1.2260	0.2795	4.39	0.0000	3.41	1.98	5.95
Call: glm(formula = nas135 ~ female, family = binomial(link = logit), data = hn)

(LogLikNull <- fit$null.deviance / (-2))

[1] -185.8004

(LogLikFemale <- fit$deviance / (-2))

[1] -175.9655

(G <- -2 * (LogLikNull - LogLikFemale))

[1] 19.66992

(pval <- 2 * pchisq(G, df = 1, lower.tail = F))

[1] 1.84078e-05

Model with female as predictor is a better model than the naive model, since the p-value of the likelihood ratio test is 1.840780210^{-5}, which is well below the customary 0.05 significance level.

If you are using Stata, you can use their built-in statistical functions to obtain p-values (type help functions).

c) What is the nadve model? What is the probability of hyponatremia that this model predicts?

Naive model is a model with no predictors, only intercept. That model predicts that the percentage of people with hyponatremia is 12.704918, which is equal to the percentage of the hyponatremic people in the data:

d) Run a logistic regression analyses with no independent variables. Transform the coefficient obtained from this model into a probability.

nrow(hn[hn$nas135 == 1, ])/nrow(hn) * 100

[1] 12.70492

fit1 <- glm(nas135 ~ 1, data = hn, family = binomial(link = logit))
unique(predict(fit1, type = "response"))

[1] 0.1270492

d) Using the model with female as independent variable, compute the estimated probability of hyponatremia per males and females. Write down the equation for the logit.

dd <- data.frame(predict(fit, newdata = data.frame(female = c(1, 0)), type = "response"))
rownames(dd)= c("female", "male")
colnames(dd) = c("probability of hyponatremia")
pander::pander(dd)

	probability of hyponatremia
female	0.2229
male	0.07764

\[g(female) = \beta_0 + \beta_1*female\] \[g(female) = -2.475 + 1.226 *female\]

f) Use the Wald test to assess the significance of the coefficient for female.

ztable(sumtab <- summary(fit)$coef)

<th <th <th <th


	Estimate	Std. Error	z value	Pr(>\|z\|)
(Intercept)	-2.47	0.21	-11.88	0.00
female	1.23	0.28	4.39	0.00

(wald <- (sumtab[2,1] - 0 )/ sumtab[2,2])

[1] 4.385547

(pval.w <- 2 * pnorm(wald, lower.tail = F))

[1] 1.156946e-05

Since p-value of the Wald test is well below the significance level of 0.05, we can reject the null hypothesis that the slope of the female variable is equal to zero.

g) Fit a model with runtime as the only independent variable. Assess the significance of the model.

pander::pander(summary(hn$runtime))

Min.	1st Qu.	Median	Mean	3rd Qu.	Max.	NA’s
142	195	220	225.5	248	400	11

fit.run <- glm(nas135 ~ runtime, data = hn, family = binomial(link = logit))
(LogLikRuntime <- fit.run$deviance / (-2))

[1] -167.7718

(G.run <- -2 * (LogLikNull - LogLikRuntime))

[1] 36.05718

(pval.run <- 2 * pchisq(G.run, df = 1, lower.tail = FALSE))

[1] 3.832238e-09

Compared to the null model, model with runtime is significantly better, p-value from the likelihood ratio test is well below the 0.05 cutoff.

h) Calculate the probability of hyponatremia of a runner who takes 4 hours (240 minutes) to complete the marathon.

predict(fit.run, newdata = data.frame(runtime = 240), type = "response")

0.1332953

A runner with runtime of 240 minutes has 13.3 percent chances to experience hyponatremia.

i) Fit a model with female and runtime as independent variables. Assess the significance of the model. Which null hypothesis is tested?

fit.fr <- glm(nas135 ~ female + runtime, data = hn, family = binomial(link = logit))
ztable(fit.fr)

<th <th <th <th <th <th <th


	Estimate	Std. Error	z value	Pr(>\|z\|)	OR	lcl	ucl
(Intercept)	-5.7211	0.8233	-6.95	0.0000	0.00	0.00	0.02
female	0.9638	0.2910	3.31	0.0009	2.62	1.49	4.67
runtime	0.0142	0.0033	4.31	0.0000	1.01	1.01	1.02
Call: glm(formula = nas135 ~ female + runtime, family = binomial(link = logit), data = hn)

(LogLikFR <- fit.fr$deviance / (-2))

[1] -162.2398

(G.fr <- -2 * (LogLikNull - LogLikFR))

[1] 47.12115

(pvalFR <- 2 * pchisq(G.fr, df = 2, lower.tail = FALSE))

[1] 1.171659e-10

The likelihood ratio test, which tests the significance of the model compared to the null model ( in this example), yielded a very small p-value. That means that the model with female and runtime as predictors is better than the model with only intercept. The null hypothesis in this test states that the model with female and runtime as predictors is no better than the model with only intercept.

eval.model <- function(numerator, denominator, dfs, labels) {
    # numerator = log-likelihood of a smaller of the two models that are compared; could be a vector of models
    #denominator = log-likelihood of the bigger model in which significance we are interested in 
    #dfs = vector of degrees of freedom corresponding to each comparison
    #labels = vector of titles for the table to denote the models
    df <- data.frame(matrix(nrow = length(numerator)+1, ncol = 2))
    for (i in 1:length(numerator)) {
        G <- -2 * (numerator[i] - denominator)
        pval <- 2 * pchisq(G, dfs[i], lower.tail = FALSE)
        df[i, 1] <- round(G, 4)
        df[i, 2] <- round(pval, 2)
    }
    df[length(numerator) + 1, ] <- c("-", "-")
    rownames(df) <- labels
    colnames(df) <- c("G", "p-value")
    return(df)
}
numerator <- c(LogLikFemale, LogLikRuntime)
denominator <- LogLikFR
dfs <- c(1,1)
labels <- c("female", "runtime", "female + runtime")
ztable(eval.model(numerator = numerator, denominator = denominator, dfs = dfs, labels = labels))

<th <th


	G	p-value
female	27.4512	0
runtime	11.064	0
female + runtime	-	-

Likelihood ratios of the model with the two predictors compared with models with each of the predictors as a single predictor reveals that the two-predictor model is significantly better than either of the single-predictor models.

Week 5 Homework

Exercise One

Complete the following: