packages <- c('MASS', 'tidyverse', 'modelr', 'ggthemes', 'knitr', 'kableExtra', 
              'broom', "GGally", 'scales', 'ggsci', 'gridExtra', 'formattable',
              'caret', 'glmnet', 'bestglm')
sapply(packages, require, character.only = TRUE, quietly = TRUE)
rm(packages)
options(knitr.table.format = "html", knitr.kable.NA = "") 
theme_set(theme_tufte(base_size = 14) +
                theme(panel.border = element_rect(fill = NA),
                      panel.grid.major = element_line(color = "gray78"),
                      legend.background = element_rect(),
                      legend.position = "top",
                      axis.text.x = element_text(angle = 30, hjust = 1),
                      strip.background = element_rect(fill = "grey90", linetype = "blank")))
fig_num <- 0
table_num <- 0

This HTML file uses code_folding. You can look at any accompanying code by pressing the little code boxes and then hide it again to get a smoother look at the report. Each project also uses a tabset layout. Under the headlines are tabs, each containing answers to a part of the project.

Project 1 (25%)

Summary

Discharge can be modeled as a function of water level with the following relationship \[ Q(h) = a h^{f(h)} \]

where $Q$ is the discharge, $h$ is the water level, $a$ is a parameter and $f(h)$ is a function of water level. Let $y(h)$ = $\log(Q(h))$ and $\beta_1=\log(a)$. The relationship between $y$ and $h$ is given by

\[ y(h) = \beta_0 + f(h)\log(h). \] Assume that $f(h)$ can be modeled as

\[ f(h) = \beta_1 + \beta_2 (h - h_{\textrm{c}}) + \beta_3 (h - h_{\textrm{c}})^2 + \beta_4 (h - h_{\textrm{c}})^3 + \beta_5 (h - h_{\textrm{c}})^4 + \beta_6 (h - h_{\textrm{c}})^5 \] where $h_{\textrm{c}}$ is a constant close to the sample mean of the observed water level found in the dataset that is being analyzed. Based on this, the following statistical model for the observed pairs $(y_i,h_i)$ is proposed, namely,

\[ y_i = \beta_0 + \{\beta_1 + \beta_2 (h_i - h_{\textrm{c}}) + \beta_3 (h_i - h_{\textrm{c}})^2 + \beta_4 (h_i - h_{\textrm{c}})^3 + \beta_5 (h_i - h_{\textrm{c}})^4 + \beta_6 (h_i - h_{\textrm{c}})^5\}\log(h_i) + \epsilon_i \]

where $\epsilon_i \sim \textrm{N}(0,\sigma^2)$, $i=1,...,n$, and $\epsilon_i$ is independent of $\epsilon_j$ if $i\neq j$. \

The file dc.csv contains $n=124$ simulated pairs of $Q$ and $h$ ($Q$ in column 2; $h$ in column 3). Here $h_{\textrm{c}}=2.25$ m. The file also contains vectors with the following variables

\[ x_{i1} = \log(h_i), \quad x_{i2}=(h_i - h_{\textrm{c}})\log(h_i) , \] \[ x_{i3}=(h_i - h_{\textrm{c}})^2 \log(h_i), \quad x_{i4}=(h_i - h_{\textrm{c}})^3 \log(h_i), \] \[ x_{i5}=(h_i - h_{\textrm{c}})^4 \log(h_i), \quad x_{i6}=(h_i - h_{\textrm{c}})^5 \log(h_i), \]

where $x_{i1}$ is in column 4, $x_{i2}$ is in column 5, and so on.

dc <- read_csv("dc.csv") %>%
      select(-X1)

(a)

Plot the discharge observations versus the water level observations (water level on the x-axis). Is the relationship linear? Plot the logarithm of the discharge observations versus the logarithm of the water level observations. Is the relationship linear?

Figure 1 below shows that the relationship is not perfectly linear. It is convex near the origin but then seems to stabilize into a linear function. In spite of this there is a trace of heteroskedastic variance.

cap <- "Discharge versus water level plotted on a linear scale and log-log scale."
fig_num <- fig_num + 1
cap <- paste0("Figure ", fig_num, ". ", cap)

dc %>%
      gather(variable, true, Discharge, Water.level) %>%
      mutate(log_value = log(true)) %>%
      gather(type, value, true, log_value) %>%
      spread(variable, value) %>%
      mutate(type = factor(type, 
                           levels = c("true", "log_value"),
                           labels = c("Linear", "Log-Log"))) %>%
      ggplot(aes(Water.level, Discharge)) +
      geom_point() +
      facet_wrap("type", scales = "free")

Figure 1. Discharge versus water level plotted on a linear scale and log-log scale.

(b)

Fit the model $y_i = \beta_0 + \beta_1\log(h_i) + \epsilon_i$. Plot the estimate of the line $y = \beta_0 + \beta_1\log(h)$ as a function of $\log(h)$ along with the observed data (new figure). Does this model appear to describe the expected value of $y$ given $h$ adequately well? Also, plot the residuals as a function of $\log(h)$. Does the residual plot indicate that the standard deviation of the error term, $\epsilon$, is a constant as a function of $\log(h)$?

The simple linear model does not capture parabolic curvature in the data. This can be seen both when comparing the predicted means to the actual data, and in the obvious parabolic trend in the residuals. The residual plot shows no obvious sign of heteroscedasticity but the extra parabolic trend makes such analysis harder.

cap <- "Left: Observed points overlaid on predicted line. Right: Residuals plotted by log(waterlevel)"
fig_num <- fig_num + 1
cap <- paste0("Figure ", fig_num, ": ", cap)
dc <- dc %>% mutate(log_discharge = log(Discharge), log_waterlevel = log(Water.level))

mod1 <- lm(log_discharge ~ log_waterlevel, data = dc)

dc %>%
      select(-contains("x")) %>%
      mutate(preds = predict(mod1),
             resids = residuals(mod1)) %>%
      gather(variable, value, log_discharge, resids) %>%
      mutate(preds = ifelse(variable == "resids", 0, preds)) %>%
      ggplot(aes(log_waterlevel, value)) +
      geom_line(aes(y = preds), size = 2, col = "grey50") +
      geom_point() +
      facet_wrap("variable", scales = "free") +
      labs(x = "log(waterlevel)",
           y = "log(discharge) / residuals")

Figure 2: Left: Observed points overlaid on predicted line. Right: Residuals plotted by log(waterlevel)

(c)

Fit the model $y_i = \beta_0 + \beta_1 x_{i1} + ...+ \beta_6 x_{i6} + \epsilon_i$. Plot the estimate of the curve $y = \beta_0 + \beta_1 x_{1} + ...+ \beta_6 x_{6}$ as a function of $\log(h)$ along with the observed data (new figure). Does this model appear to describe the expected value of $y$ given $h$ adequadely well? Also, plot the residuals as a function of $\log(h)$. Does the residual plot indicate that the standard deviation of the error term, $\epsilon$, is a constant as a function of $\log(h)$?

This model fits the data well. The left part of figure 3 shows that the predicted mean responses lie well inside the observed data. On the right of figure 3 we see that there is no discernible pattern in the residuals, but there might be a trace of heteroskedasticity.

cap <- "Left: Observed points overlaid on predicted line. Right: Residuals plotted by log(waterlevel)"
fig_num <- fig_num + 1
cap <- paste0("Figure ", fig_num, ": ", cap)


mod2 <- lm(log_discharge ~ x1 + x2 + x3 + x4 + x5 + x6, data = dc)
dc %>%
      mutate(preds = predict(mod2),
             resids = residuals(mod2)) %>%
      gather(variable, value, log_discharge, resids) %>%
      mutate(preds = ifelse(variable == "resids", 0, preds)) %>%
      ggplot(aes(log_waterlevel, value)) +
      geom_line(aes(y = preds), size = 2, col = "grey50") +
      geom_point() +
      facet_wrap("variable", scales = "free") +
      labs(x = "log(waterlevel)",
           y = "log(discharge) / residuals")

Figure 3: Left: Observed points overlaid on predicted line. Right: Residuals plotted by log(waterlevel)

The absolute values of residuals were regressed on predicted values to get a quick check for unequal variances. Table 1 shows a trend of slightly decreasing variance but the effect is not significant. Figure 4 shows this weak trend with an overlaid local smoother.

table_cap <- paste0("Table ", table_num <- table_num + 1, ": ")
lm(abs(mod2$res) ~ mod2$fit) %>% 
      tidy %>% 
      set_names(c("Term", "Estimate", "SE", "t", "p")) %>% 
      kable(caption = paste0(table_cap, "Residuals regressed on fitted values."),
            digits = 3) %>% 
      kable_styling()

Table 1: Residuals regressed on fitted values.
Term	Estimate	SE	t	p
(Intercept)	0.050	0.013	3.815	0.000
mod2$fit	-0.003	0.002	-1.283	0.202

cap <- "Absolute values of residuals plotted by log(waterlevel)"
fig_num <- fig_num + 1
cap <- paste0("Figure ", fig_num, ": ", cap)

dc %>%
      mutate(preds = predict(mod2),
             resids = residuals(mod2),
             abs_resids = abs(resids)) %>%
      ggplot(aes(log_waterlevel, abs_resids)) +
      geom_point() +
      geom_smooth() +
      labs(x = "log(waterlevel)",
           y = "abs(residuals)")

Figure 4: Absolute values of residuals plotted by log(waterlevel)

(d)

Use AIC to find which of the models below is best in terms of predicting discharge. The models that should be tested are $y_i = \beta_0 + \sum_{j=1}^{k} \beta_j x_{ij} + \epsilon_i$ where $k=1,2,3,4,5,6$.

One model was trained for each value of $K$. Figure 5 and table 2 show that the lowest AIC score was found for $K = 3$.

cap <- "AIC plotted for each K = 1, 2, ..., 6"
fig_num <- fig_num + 1
cap <- paste0("Figure ", fig_num, ": ", cap)
formulae(~ log_discharge,
         "1" = ~ x1,
         "2" = ~ x1 + x2,
         "3" = ~ x1 + x2 + x3,
         "4" = ~ x1 + x2 + x3 + x4,
         "5" = ~ x1 + x2 + x3 + x4 + x5,
         "6" = ~ x1 + x2 + x3 + x4 + x5 + x6) %>%
      map(~ lm(., data = dc)) %>%
      tibble(k = names(.), model = .) %>%
      mutate(aic = map_dbl(model, AIC)) %>%
      ggplot(aes(k, aic, group = "none")) +
      geom_line() +
      geom_point() +
      labs(x = "K",
           y = "AIC")

Figure 5: AIC plotted for each K = 1, 2, …, 6

table_cap <- paste0("Table ", table_num <- table_num + 1, ": ")
formulae(~ log_discharge,
         "1" = ~ x1,
         "2" = ~ x1 + x2,
         "3" = ~ x1 + x2 + x3,
         "4" = ~ x1 + x2 + x3 + x4,
         "5" = ~ x1 + x2 + x3 + x4 + x5,
         "6" = ~ x1 + x2 + x3 + x4 + x5 + x6) %>%
      map(~ lm(., data = dc)) %>%
      tibble(k = names(.), model = .) %>%
      mutate(aic = map_dbl(model, AIC)) %>%
      arrange(aic) %>%
      select(-model) %>%
      set_names(c("K", "AIC")) %>%
      kable(caption = paste0(table_cap, "Model AIC for each K = 1, 2, ..., 6"),
            digits = 2) %>%
      kable_styling(bootstrap_options = c("striped", "hover"))

Table 2: Model AIC for each K = 1, 2, …, 6
K	AIC
3	-421.73
4	-419.78
6	-418.27
5	-418.04
2	-417.67
1	-174.98

(e)

Compute an estimate and a 95% confidence interval for the expected value of $y$ when $h=3.19$ based on the model selected in (d), that is, the quantity of interest is $\textrm{E}(y(h)) = \beta_0 + \sum_{j=1}^{k^} \beta_j x_{j}(h)$ where $k^$ was selected in (d). Denote the estimate of $\textrm{E}(y(h))$ with $\hat{y}(h)$ where $\hat{y}(h) = \hat{\beta}_0 + \sum_{j=1}^{k^*} \hat{\beta}_j x_{j}(h)$ Also, compute an estimate of $\exp(\textrm{E}(y(h)))$, and a 95% confidence interval for $\exp(\textrm{E}(y(h)))$ when $h=3.19$.

Table 3 shows the expected values with 95% confidence intervals both for log(discharge) and discharge. Keep in mind that while the standard deviation for log(discharge) is additive it becomes multiplicative on the original scale.

final_model <- lm(log_discharge ~ x1 + x2 + x3, data  = dc)
table_cap <- paste0("Table ", table_num <- table_num + 1, ": ")
data_frame(h = 3.19) %>%
      mutate(x1 = log(h),
             x2 = (h - 2.25) * log(h),
             x3 = (h - 2.25)^2 * log(h)) %>%
      augment(final_model, newdata = .) %>%
      mutate(lower = .fitted - qt(0.975, df.residual(final_model)) * .se.fit,
             upper = .fitted + qt(0.975, df.residual(final_model)) * .se.fit,
             type = "log(discharge)") %>%
      select(type, .fitted, .se.fit, lower, upper) %>%
      rbind(mutate_at(., -1, exp) %>% mutate(type = "discharge")) %>%
      set_names(c("Type", "Fitted", "$\\hat \\sigma$", "Lower", "Upper")) %>% 
      kable(caption = paste0(table_cap, 
                             "Prediction and 95% confidence interval for discharge at log-scale and regular scale."),
            digits = 3,
            escape = FALSE) %>% 
      kable_styling() %>% 
      add_header_above(c("", "", "", "95% confidence interval" = 2))

Table 3: Prediction and 95% confidence interval for discharge at log-scale and regular scale.
			95% confidence interval
Type	Fitted	$\hat \sigma$	Lower	Upper
log(discharge)	6.147	0.007	6.133	6.160
discharge	467.192	1.007	460.941	473.527

Project 2 (50%)

Summary

The goal of this project is to find a linear regression model that can be used to predict body fat with body circumference measurements as well as age, height and weight. Accurate measurement of body fat are costly and it is practical to have a simple and low-cost method for estimating body fat. In this project we are given a dataset with estimates of the percentage of body fat determined by underwater weighing for 252 men along with their various body circumference measurements. More information about this dataset can be found at

http://lib.stat.cmu.edu/datasets/bodyfat https://cran.r-project.org/web/packages/mfp/mfp.pdf

The dataset is a available in R as bodyfat and it is within the package mfp, see

https://www.rdocumentation.org/packages/mfp/versions/1.5.2/topics/bodyfat

The dataset contains 252 observations and 17 variables:

case: index
brozek: percentage of body fat using Brozek’s equation: 457/density - 414.2
siri: percentage of body fat using Siri’s equation: 495/density - 450
density: density determined from underwater weighing
age: in years
weight: in lbs
height: in inches
neck: circumference in cm
chest: circumference in cm
abdomen: circumference in cm
hip: circumference in cm
thigh: circumference in cm
knee: circumference in cm
ankle: circumference in cm
biceps: circumference in cm
forearm: circumference in cm
wrist: circumference in cm

(a)

Fit a linear model for percentage of body fat based on Siri’s equation. Start with setting the height of individual nr. 42 equal to 69.50 inches. Also, remove individual 182 from the dataset. Use all the potential predictors, that is, use age, weight, height and all the circumference variables. Report the summary of the estimation.

Table 4 shows summary statistics for the model. Four predictors were significant at the $\alpha = .05$ level, age and neck, abdomen, forearm and wrist circumference. The model $R^2$ is $0.744$, $R^2_{adj} = 0.73$ and $AIC = 1461$ on $237$ residual degrees of freedom.

data(bodyfat, package = "mfp")
bodyfat <- as_tibble(bodyfat) %>%
      filter(case != 182) %>% 
      mutate(height = ifelse(case == 42, 69.50, height))
mod3 <- lm(siri ~ age + weight + height + neck + chest + abdomen + hip + thigh + knee + ankle + biceps + forearm + wrist, data = bodyfat)
table_cap <- paste0("Table ", table_num <- table_num + 1, ": ")
tidy(mod3) %>%
      set_names(c("Term", "Estimate", "SE", "t.val", "p")) %>%
      mutate(col = ifelse(p > 0.05,
                          "black",
                          ifelse(Estimate < 0,
                                 "red",
                                 "blue")),
             Estimate = cell_spec(round(Estimate, 3), color = col),
             t.val = cell_spec(round(t.val, 3), 
                               color = col,
                               font_size = spec_font_size(t.val)),
             p = cell_spec(round(p, 3), bold = ifelse(p < 0.05, T, F)),
             p = str_replace(p, ">0<", "> < .001 <"),
             p = str_replace(p, ">0.", ">."), 
             Term = str_to_title(Term)) %>%
      select(-col) %>%
      kable(caption = paste0(table_cap, "Summary table for model"),
            digits = 3,
            escape = F,
            align = c("l", rep("c", 4))) %>%
      kable_styling(bootstrap_options = c("hover", "striped")) %>%
      footnote(general = paste0("$R^2$ = ", round(glance(mod3)$r.squared, 3), ", ",
                                "Adjusted $R^2$ = ", round(glance(mod3)$adj.r.squared, 3), ", ",
                                "AIC = ", round(glance(mod3)$AIC), ", ",
                                "Residual DF = ", glance(mod3)$df.residual), 
               escape = FALSE, general_title = "Model fit:")

Table 4: Summary table for model
Term	Estimate	SE	t.val	p
(Intercept)	-18.1	22.398	-0.808	.42
Age	0.064	0.032	1.975	.049
Weight	-0.089	0.062	-1.439	.152
Height	-0.06	0.179	-0.334	.739
Neck	-0.474	0.236	-2.012	.045
Chest	-0.032	0.104	-0.31	.757
Abdomen	0.955	0.090	10.595	< .001
Hip	-0.192	0.145	-1.326	.186
Thigh	0.231	0.147	1.571	.117
Knee	0.015	0.248	0.059	.953
Ankle	0.167	0.223	0.751	.454
Biceps	0.192	0.173	1.112	.267
Forearm	0.444	0.200	2.225	.027
Wrist	-1.669	0.533	-3.129	.002
Model fit:
$R^2$ = 0.744, Adjusted $R^2$ = 0.73, AIC = 1461, Residual DF = 237

(b)

Use the model in (a) to determine whether a Box–Cox transformation is needed or no transformation is needed. Explain your reasoning. If a Box–Cox transformation is needed, explain how the value of $\lambda$ was selected. If the Box–Cox transformation is needed then use it in the remaining scenarios.

The BoxCox transformation applies the function

\[ \begin{cases} \frac{y^\lambda - 1}{\lambda}, \lambda \neq 0 \\ \mathrm{log}(y), \lambda = 0 \end{cases} \]

to the dependent variable and and performs a profile likelihood test to see if the transformation should be applied. Values of $\lambda = 0, 0.05, ..., 1.95, 2$ were chosen. For those values the maximum likelihood was found at $\lambda = 1$, that is with no transformation of the dependent variable. The conclusion was therefore that further analysis be performed with the dependent variable as is.

cap <- "log-Likelihood plot for Box-Cox transformation of bodyfat data"
fig_num <- fig_num + 1
cap <- paste0("Figure ", fig_num, ": ", cap)
data_frame(lambda = seq(0, 2, 0.05)) %>%
      add_column(data = list(bodyfat)) %>%
      unnest(data) %>%
      mutate(siri = ifelse(lambda == 0, 
                           log(siri),
                           (siri^lambda - 1) / lambda)) %>%
      group_by(lambda) %>%
      summarize(model = list(lm(siri ~ age + weight + height + neck + chest + 
                                      abdomen + hip + thigh + knee + ankle + biceps + forearm + wrist))) %>%
      mutate(AIC = map_dbl(model, AIC),
             loglik = map2_dbl(model, lambda,
                               function(model, lambda) {
                                     RSS <- residuals(model)
                                     y <- bodyfat$siri
                                     n <- length(RSS)
                                     - (n / 2) * log(sum(RSS^2)/n) + (lambda - 1) * sum(log(y))
                               })) %>%
      mutate(max_lam = lambda[which.max(loglik)],
             max_lam = ifelse(max_lam == lambda,
                              max_lam, NA)) %>%
      ggplot(aes(lambda, loglik)) +
      geom_line(col = "grey50") +
      geom_segment(aes(y = min(loglik), yend = loglik, x = max_lam, xend = max_lam), lty = 2) +
      geom_hline(aes(yintercept = min(loglik)), lty = 2, alpha = 0.7) +
      geom_point() +
      xlab(expression(lambda)) +
      ylab("log-Likelihood")

Figure 6: log-Likelihood plot for Box-Cox transformation of bodyfat data

(c)

Draw four figures showing scatterplots and sample correlation with the percentage of body fat variable included in each figure. Can you see any potential problems with the data through these figures? You can group as follows:

siri, age, weight, height
siri, neck, chest, abdomen, hip
siri, thigh, knee, ankle
siri, biceps, forearm, wrist

Figures 7-10 below show that there is much multicollinearity in the predictor variables. This could pose problems for the model fitting process. Figure 11 shows a correlation plot with predictors arrenged according to hierarchical clustering. We see that there are clusters of variables that are more collinear than others. For example: knee, thigh and hip circumference and weight are more correlated with each other than they are with age, height and forearm circumference.

cap <- "Scatterplot matrix for siri, age, weight and height"
fig_num <- fig_num + 1
cap <- paste0("Figure ", fig_num, ": ", cap)
bodyfat %>%
      select(siri, age, weight, height) %>%
      ggpairs

Figure 7: Scatterplot matrix for siri, age, weight and height

cap <- "Scatterplot matrix for siri, neck, chest, abdomen and hip"
fig_num <- fig_num + 1
cap <- paste0("Figure ", fig_num, ": ", cap)
bodyfat %>%
      select(siri, neck, chest, abdomen, hip) %>%
      ggpairs

Figure 8: Scatterplot matrix for siri, neck, chest, abdomen and hip

cap <- "Scatterplot matrix for siri, thigh, knee and ankle"
fig_num <- fig_num + 1
cap <- paste0("Figure ", fig_num, ": ", cap)
bodyfat %>%
      select(siri, thigh, knee, ankle) %>%
      ggpairs

Figure 9: Scatterplot matrix for siri, thigh, knee and ankle

cap <- "Scatterplot matrix for siri, biceps, forearm and wrist"
fig_num <- fig_num + 1
cap <- paste0("Figure ", fig_num, ": ", cap)
bodyfat %>%
      select(siri, biceps, forearm, wrist) %>%
      ggpairs

Figure 10: Scatterplot matrix for siri, biceps, forearm and wrist

cap <- "Correlation plot for the bodyfat data"
fig_num <- fig_num + 1
cap <- paste0("Figure ", fig_num, ": ", cap)
bodyfat %>%
      select(-case, -brozek, -density) %>%
      cor %>%
      (function(x) {
            colnames(x) <- str_to_title(colnames(x))
            rownames(x) <- str_to_title(rownames(x))
            x
      }) %>%
      corrplot::corrplot(method = "color", order = "hclust", tl.col = "black", tl.cex = 1.5)

Figure 11: Correlation plot for the bodyfat data

(d)

Use the diagnostic tools to detect possible problems with the model in (a) and to identify outliers. In particular, use the following tools: (i) plot studentized residuals versus index (`case`); (ii) plot leverage versus index; (iii) plot Jackknife residuals versus index; (iv) plot a qq-plot of the Jackknife residuals with prediction bounds for each of the ordered Jackknife residual (under the assumption that the error terms, $\epsilon_i$, are normally distributed). Furthermore, apply a Bonferroni outlier test to the three most extreme Jackknife residuals. Is there a reason to remove any observations from the dataset?

Figure 12 shows residuals, leverages and cook’s distance measures plotted against indices and predicted values. For the Jackknife plots 95% Bonferroni-corrected intervals are added for outlier tests. As can be seen, no points have so high Jackknife residuals as to be considered outliers. Figure 13 below shows qq-plots for standardized, studentized and Jackknife residuals along with 95% confidence bands. All residuals are within the bounds so there is no reason to remove any observations.

diag_data <- bodyfat %>%
      mutate(pred = predict(mod3)) %>%
      select(case, siri, pred) %>%
      mutate(resid = siri - pred,
             lev = hatvalues(mod3),
             studres = resid / (summary(mod3)$sigma * sqrt(1 - lev)),
             jackknife = studres * sqrt((n() - 4 - 1) / (n() - 4 - studres^2)),
             cook = 1/4 * studres^2 * (lev) / (1 - lev),
             mean_lev = mean(lev),
             type = ifelse(lev > 2 * mean_lev, 
                           "High leverage", 
                           ifelse(abs(jackknife) > 2,
                                  "High residual",
                                  "Normal")),
             label = ifelse(type != "Normal", as.character(case), NA)) %>%
      select(-mean_lev)

cap <- "Diagnostics plot for bodyfat model. Dashed lines are 95% confidence intervals for kjackknife outlier test with bonferroni correction."
fig_num <- fig_num + 1
cap <- paste0("Figure ", fig_num, ": ", cap)
diag_data %>%
      gather(diagnostic, Y, resid, studres, jackknife, cook, lev) %>%
      gather(id, X, case, pred) %>%
      mutate(jackknife_level = ifelse(diagnostic == "jackknife",
                                      qt(0.05 / (2 * nrow(bodyfat)), 
                                         df.residual(mod3)) * c(-1, 1),
                                      NA)) %>% 
      mutate(diagnostic = factor(diagnostic, 
                                 levels = c("resid", "studres", "jackknife", 
                                            "lev", "cook"),
                                 labels = c("Residual", "Studentized", "Jackknife", 
                                            "Leverage", "Cooks Distance")),
             id = factor(id, 
                         levels = c("case", "pred"),
                         labels = c("Case", "Predicted Value")),
             alpha = ifelse(!is.na(label), 0.1, 1)) %>%
      ggplot(aes(X, Y, col = type)) +
      geom_point(aes(alpha = alpha)) +
      geom_text(aes(label = label), show.legend = FALSE) +
      geom_hline(aes(yintercept = jackknife_level), lty = 2) +
      scale_x_continuous(breaks = pretty_breaks(8)) +
      scale_y_continuous(breaks = pretty_breaks(5)) +
      guides(col = guide_legend(title = "Type of observation",
                                title.position = "top",
                                title.hjust = 0.5,
                                labels = c("High leverage", "High residual", "Normal")),
             alpha = "none",
             label = "none") +
      scale_color_jama() +
      scale_alpha_continuous(range = c(0.1, 1)) +
      facet_grid(diagnostic ~ id, scales = "free") +
      labs(x = "",
           y = "")

Figure 12: Diagnostics plot for bodyfat model. Dashed lines are 95% confidence intervals for kjackknife outlier test with bonferroni correction.

cap <- "QQ plot for raw (normal) and studentized (t) residuals."
fig_num <- fig_num + 1
cap <- paste0("Figure ", fig_num, ": ", cap)
diag_data %>%
      select(resid, studres, jackknife, label, type) %>%
      mutate(resid = (resid - mean(resid)) / sd(resid)) %>%
      gather(resid_type, residual, resid, studres, jackknife) %>%
      arrange(resid_type, residual) %>%
      group_by(resid_type) %>%
      mutate(obs_q = row_number() / (n() + 1),
             q = ifelse(type == "studres", 
                        qt(obs_q, 64),
                        qnorm(obs_q)),
             qlow = ifelse(type == "studres",
                           qt(qbeta(0.05/2, row_number(), n() - row_number() + 1), 64),
                           qnorm(qbeta(0.05/2, row_number(), n() - row_number() + 1))),
             qupp = ifelse(type == "studres",
                           qt(qbeta(1 - 0.05/2, row_number(), n() - row_number() + 1), 64),
                           qnorm(qbeta(1 - 0.05/2, row_number(), n() - row_number() + 1)))) %>%
      ungroup %>%
      mutate(resid_type = factor(resid_type, levels = c("resid", "studres", "jackknife"),
                                 labels = c("Raw", "Studentized", "Jackknife")),
             alpha = ifelse(type == "Normal", 1, 0.2)) %>%
      ggplot(aes(q, residual, col = type)) +
      geom_abline(intercept = 0, slope = 1, lty = 2, col = "grey50") +
      geom_line(aes(y = qlow), lty = 3, col = "grey30") +
      geom_line(aes(y = qupp), lty = 3, col = "grey30") +
      geom_ribbon(aes(ymin = qlow, ymax = qupp), alpha = 0.1, col = NA) +
      geom_point(aes(alpha = alpha)) +
      geom_text(aes(label = label), size = 5, show.legend = FALSE) +
      scale_color_jama() +
      scale_alpha_continuous(range = c(0.5, 1)) +
      guides(col = guide_legend(title = "Type of observation",
                                title.position = "top",
                                title.hjust = 0.5),
             labels = c("High leverage", "High residual", "Normal"),
             alpha = "none") +
      facet_grid("resid_type", scales = "free") +
      labs(x = "Empirical",
           y = "Theoretical")

Figure 13: QQ plot for raw (normal) and studentized (t) residuals.

(e)

Plot the residuals versus (i) the fitted values; (ii) age; (iii) abdomen circumference; (iv) wrist circumference. Do these plot indicate that the proposed linear model is inadequate? In particular, is the variance constant? Does $y$ vary non-linearly with these predictors? Also, plot the residuals versus the other predictors without showing the plots. Report if any of these plots indicate that the proposed linear regression model is inadequate.

Figure 14 shows residuals plotted against predicted values, and the predictors age, and abdomen and wrist circumference. There is no sign of heteroskedasticity in the residuals and there does not seem to be a need for quadratic or cubic functions of the predictors. The same analysis was performed for other predictors with the same results.

cap <- "Residuals versus predicted values, age, and abdomen and waist circumference."
fig_num <- fig_num + 1
cap <- paste0("Figure ", fig_num, ": ", cap)
bodyfat %>%
      select(siri, !!!syms(names(coef(mod3))[-1])) %>%
      mutate(pred = predict(mod3),
             resid = siri - pred) %>% 
      select(resid, pred, age, abdomen, wrist) %>%
      gather(variable, value, -resid) %>% 
      mutate(variable = str_to_title(variable)) %>% 
      ggplot(aes(value, resid)) +
      geom_point(alpha = 0.6) +
      geom_smooth(span = 0.8) +
      facet_grid("variable", scales = "free_x") +
      labs(x = "", 
           y = "Residual")

Figure 14: Residuals versus predicted values, age, and abdomen and waist circumference.

(f)

Here we use the dataset with removed observations (if suggested by (d)) and a model with additional quadratic and cubic terms (if suggested by (e)). The task is to find a good model for predicting the percentage of body fat using a subset of the predictors suggested by AIC. You can use the function `stepAIC` in the `MASS` package or another selection method based on AIC. Present the final prediction model with a formula.

Following this, an optimal model was found by stepwise predictor choice based on AIC scores using the stepAIC function from the MASS package. Table 5 below shows summary statistics for coefficients and model fit and the model formula can be seen below that.

step_mod <- stepAIC(mod3, direction = "both", trace = 0)
table_cap <- paste0("Table ", table_num <- table_num + 1, ": ")
tidy(step_mod) %>% 
      set_names(c("Term", "Estimate", "SE", "t.val", "p")) %>%
      mutate(col = ifelse(p > 0.05,
                          "black",
                          ifelse(Estimate < 0,
                                 "red",
                                 "blue")),
             Estimate = cell_spec(round(Estimate, 3), color = col),
             t.val = cell_spec(round(t.val, 3), 
                               color = col,
                               font_size = spec_font_size(t.val)),
             p = cell_spec(round(p, 3), bold = ifelse(p < 0.05, T, F)),
             p = str_replace(p, ">0<", "> < .001 <"),
             p = str_replace(p, ">0.", ">."), 
             Term = str_to_title(Term)) %>%
      select(-col) %>%
      kable(caption = paste0(table_cap, "Summary table for model selected by step-wise AIC comparison."),
            digits = 3,
            escape = F,
            align = c("l", rep("c", 4))) %>%
      kable_styling(bootstrap_options = c("hover", "striped")) %>%
      footnote(general = paste0("$R^2$ = ", round(glance(step_mod)$r.squared, 3), ", ",
                                "Adjusted $R^2$ = ", round(glance(step_mod)$adj.r.squared, 3), ", ",
                                "AIC = ", round(glance(step_mod)$AIC), ", ",
                                "Residual DF = ", glance(step_mod)$df.residual), 
               escape = FALSE, general_title = "Model fit:")

Table 5: Summary table for model selected by step-wise AIC comparison.
Term	Estimate	SE	t.val	p
(Intercept)	-21.781	11.749	-1.854	.065
Age	0.065	0.031	2.111	.036
Weight	-0.089	0.040	-2.22	.027
Neck	-0.46	0.225	-2.048	.042
Abdomen	0.942	0.072	13.092	< .001
Hip	-0.194	0.138	-1.404	.162
Thigh	0.293	0.129	2.264	.024
Forearm	0.505	0.187	2.708	.007
Wrist	-1.553	0.510	-3.048	.003
Model fit:
$R^2$ = 0.742, Adjusted $R^2$ = 0.734, AIC = 1453, Residual DF = 242

Below is the formula for predicting bodyfat based on Siri’s equation using the variables chosen via stepwise AIC comparisons.

model_fmla <- tidy(step_mod) %>% 
      select(term, estimate) %>% 
      mutate(estimate = round(estimate, 3),
             term = ifelse(str_detect(term, "Intercept"),
                           "",
                           paste0(" \\cdot ", term)),
             sign = ifelse(str_detect(estimate, "-"),
                           "&-",
                           "&+"),
             fmla = paste0(sign, " ", abs(estimate), term)) %>% 
      .$fmla %>% 
      str_flatten(collapse = " \\\\ ")

\[ \begin{aligned} \textrm{siri} = &- 21.781 \\ &+ 0.065 \cdot age \\ &- 0.089 \cdot weight \\ &- 0.46 \cdot neck \\ &+ 0.942 \cdot abdomen \\ &- 0.194 \cdot hip \\ &+ 0.293 \cdot thigh \\ &+ 0.505 \cdot forearm \\ &- 1.553 \cdot wrist \end{aligned} \]

(g)

Interpret the parameters in the model selected in (f) (except for the intercept).

all_coefs <- tidy(step_mod) %>% 
      select(term, estimate) %>% 
      filter(!str_detect(term, "Intercept")) %>% 
      mutate(estimate = round(estimate, 3)) %>% 
      spread(term, estimate)

The coefficients for the model predictors relate to the variable’s effect on a person’s Siri score.

age: Each year of age increases a person’s score by 0.065
weight: Each lbs of decreases a person’s score by 0.089

The remaining coefficients relate to circumference of bodyparts in centimeters so that each extra centimeter in circumference

neck: decreases Siri score by 0.46
abdomen: increases Siri score by 0.942
hip: decreases Siri score by 0.194
thigh: increases Siri score by 0.293
forearm: increases Siri score by 0.505
wrist: decreases Siri score by 1.553

(h)

Predict the percentage of body fat and compute a 95% confidence interval for an individual with the following predictor values using the model selected in (f).

$\texttt{age}=40$ years old

$\texttt{weight}=180$ lbs

$\texttt{height}=71$ inches

$\texttt{neck}=38$ cm

$\texttt{chest}=95$ cm

$\texttt{abdomen}=92$ cm

$\texttt{hip}=96$ cm

$\texttt{thigh}=58$ cm

$\texttt{knee}=38$ cm

$\texttt{ankle}=22$ cm

$\texttt{biceps}=32$ cm

$\texttt{forearm}=29$ cm

$\texttt{wrist}=18$ cm

Since we are predicting the Siri score for a specific individual, and not just the overall mean for individuals with those specific traits the confidence interval will be wider. To calculate this we use the variance-covariance matrix for the predictors as well as the residual variance in the overall model. The point-prediction as well as a 95% confidence interval can be seen in table 6.

x0 <- data_frame(intercept = 1, age = 40, weight = 180, neck = 38, abdomen = 92, 
                 hip = 96, thigh = 58, forearm = 29, wrist = 18) %>%
      as.matrix
betas <- matrix(coef(step_mod), ncol = 1)
xtx <- model.matrix(step_mod) %>% (function(x) t(x) %*% x)
sigma_hat <- sigma(step_mod)
y0 <- as.numeric(x0 %*% betas)
sd_y0 <- as.numeric(sigma_hat * sqrt(x0 %*% solve(xtx) %*% t(x0)))
t_q <- qt(1 - 0.05/2, df.residual(step_mod))
y_interval <- y0 + c(-1, 0, 1) * t_q * sd_y0

table_cap <- paste0("Table ", table_num <- table_num + 1, ": ")

data_frame(var = c("Lower", "Prediction", "Upper"), 
           value = y_interval) %>%
      spread(var, value) %>% 
      select(Prediction, Lower, Upper) %>%
      kable(caption = paste0(table_cap, "Predicted value and 95% prediction interval"),
            align = c("l", "c", "c"),
            digits = 3) %>%
      kable_styling %>%
      add_header_above(c("", "95% Prediction Interval" = 2))

Table 6: Predicted value and 95% prediction interval
	95% Prediction Interval
Prediction	Lower	Upper
19.101	17.896	20.307

Project 3 (25%)

Summary

Here we analyze data that can be used to predict the presence of the chronic kidney disease. The response variable takes the values ‘has chronic kidney disease’ and ‘doesn’t have chronic kidney disease’.

The dataset contains 203 observations and 12 variables:

ckdmem: there are 2 classes, ckd or notckd
age: in years
blood.pressure: in mm/Hg
blood.glucose.random: in mgs/dl
blood.urea: in mgs/dl
serum.creatinine: in mgs/dl
sodium: in mEq/L
potassium: in mEq/L
hemoglobin: in gms
packed.cell.volume}
white.blood.cell.count: in cells/cmm
red.blood.cell.count: in cells/cmm

The dataset is a available in R as ckd and it is within the package teigen, see

https://www.rdocumentation.org/packages/teigen/versions/2.2.2/topics/ckd

The goal is to predict the probability of not having the chronic kidney disease in the case of an individual with given values for the above predictors. \

The response variable is \[\begin{equation} y_{i} = \left\{\begin{array}{ll} \hbox{1 if the $i$-th individal doesn't have the chronic kidney disease,} \\ \hbox{0 if the $i$-th individal has the chronic kidney disease}, & \\ \end{array} \right. \nonumber \end{equation}\] where $i=1,...,203$. Here we assume that the $y_i$’s are independent Bernoulli random variables, that is, $y_i \sim \hbox{Bin}(1,\mu_i)$, $i=1,...,n$. The following generalized linear model is proposed; \[\begin{equation} \eta_i=g(\mu_i) = \log(\mu_i)-\log(1-\mu_i) = \beta_0 + \sum_{j=1}^{p-1} \beta_j x_{ij}, \nonumber \end{equation}\] for $i=1,...,203$, so \[\begin{equation} \mu_i = g^{-1}(\eta_i) = \frac{\exp\left(\beta_0+\sum_{j=1}^{p-1} \beta_j x_{ij}\right)}{1+\exp\left(\beta_0+\sum_{j=1}^{p-1} \beta_j x_{ij}\right)}. \nonumber \end{equation}\]

(a)

Here we look at logistic regression models for the chronic kidney disease dataset that use only two predictors. Find the two predictor model that gives the lowest AIC. Report a summary of the estimates of the selected model.

data(ckd, package = "teigen")
ckd <- ckd %>% as_tibble

First a two-predictor model was found with LASSO regression using the glmnet package. The number of coefficients for each value of the penalizing parameter $\lambda$ as well as which covariates were kept in the model can be seen in table 7. The optimal two-predictor model was found to be one with hemoglobin and packed.cell.volume.

best_log <- glmnet(x = ckd %>% select(-ckdmem) %>% as.matrix, 
                   y = ckd$ckdmem,
                   family = "binomial",
                   alpha = 1,
                   lambda = seq(0, 0.5, 0.01))
table_cap <- paste0("Table ", table_num <- table_num + 1, ": ")
tidy(best_log) %>% 
      filter(!str_detect(term, "Intercept")) %>% 
      group_by(step) %>% 
      mutate(num_coef = n(),
             estimate = 1) %>% 
      ungroup %>% 
      group_by(num_coef) %>% 
      mutate(max_dev = max(dev.ratio),
             best_lambda = lambda[which.max(dev.ratio)]) %>% 
      distinct(num_coef, estimate, best_lambda, term) %>% 
      unite(group, num_coef, best_lambda) %>% 
      spread(term, estimate) %>% 
      separate(group, c("num_coef", "best_lambda"), sep = "_") %>% 
      mutate_at(1:2, as.numeric) %>%
      arrange(num_coef) %>% 
      rename("N. Coef" = num_coef, "$\\lambda$" = best_lambda) %>% 
      kable(caption = paste0(table_cap, "Predictors in LASSO model for differing values of $\\lambda$"),
            digits = 3,
            escape = FALSE) %>% 
      kable_styling(bootstrap_options = c("striped", "hover")) %>% 
      column_spec(1, width = "10em") %>% 
      add_header_above(c("", "", "Predictors in model" = 9))

Table 7: Predictors in LASSO model for differing values of $\lambda$
		Predictors in model
N. Coef	$\lambda$	age	blood.glucose.random	blood.pressure	hemoglobin	packed.cell.volume	red.blood.cell.count	serum.creatinine	sodium	white.blood.cell.count
1	0.37				1
2	0.16				1	1
3	0.08		1		1	1
4	0.07	1	1		1	1
5	0.06	1	1	1	1	1
8	0.03	1	1	1	1	1	1	1	1
9	0.01	1	1	1	1	1	1	1	1	1

The model chosen via LASSO was retrained using regular linear regression. Summary statistics for the fitted parameters as well as model fit can be seen in table 8. MacFadden’s $R^2$ was found by comparing the model fit to the fit of the null model via:

\[ R^2_{MF} = 1 - \frac{log(L_c)}{log(L_{null})} \]

mod_log <- glm(ckdmem ~ hemoglobin + packed.cell.volume,
               family = binomial,
               data = ckd)
null_mod <- glm(ckdmem ~ 1, family = binomial, data = ckd)
table_cap <- paste0("Table ", table_num <- table_num + 1, ": ")
tidy(mod_log) %>%
      set_names(c("Term", "Estimate", "SE", "t.val", "p")) %>%
      mutate(col = ifelse(p > 0.05,
                          "black",
                          ifelse(Estimate < 0,
                                 "red",
                                 "blue")),
             Estimate = cell_spec(round(Estimate, 3), color = col),
             t.val = cell_spec(round(t.val, 3), 
                               color = col,
                               font_size = spec_font_size(t.val)),
             p = cell_spec(round(p, 3), bold = ifelse(p < 0.05, T, F)),
             p = str_replace(p, ">0<", "> < .001 <"),
             p = str_replace(p, ">0.", ">."), 
             Term = str_to_title(Term)) %>%
      select(-col) %>%
      kable(caption = paste0(table_cap, "Summary table for model chosen via LASSO"),
            digits = 3,
            escape = F,
            align = c("l", rep("c", 4))) %>%
      kable_styling(bootstrap_options = c("hover", "striped")) %>%
      footnote(general = paste0("McFadden's $R^2$ = ", round(1 - as.numeric(logLik(mod_log) / logLik(null_mod)), 
                                                             3), ", ",
                                "AIC = ", round(glance(mod_log)$AIC), ", ",
                                "Residual DF = ", glance(mod_log)$df.residual), 
               escape = FALSE, general_title = "Model fit:")

Table 8: Summary table for model chosen via LASSO
Term	Estimate	SE	t.val	p
(Intercept)	-31.625	5.909	-5.352	< .001
Hemoglobin	1.371	0.423	3.244	.001
Packed.cell.volume	0.332	0.118	2.806	.005
Model fit:
McFadden’s $R^2$ = 0.814, AIC = 56, Residual DF = 200

As a sanity check the package bestglm was used to perform model selection by exhaustive search. This procedure also chose hemoglobin as one of the two predictors but the second one chosen was serum.creatinine. Summary statistics for the model parameters and model fit can be seen in table 8. This model had a lower $R^2_{MF}$ and AIC than the LASSO model so it was used for the rest of the analysis.

mod_log_bestglm <- bestglm(Xy = ckd %>% 
                                 select(age, blood.pressure, blood.glucose.random, blood.urea,
                                        serum.creatinine, sodium, potassium, hemoglobin, ckdmem) %>% 
                                 as.data.frame,
                           family = binomial,
                           IC = "AIC", 
                           nvmax = 2) %>% 
      .$BestModel
table_cap <- paste0("Table ", table_num <- table_num + 1, ": ")
mod_log_bestglm %>% 
      tidy %>% 
      set_names(c("Term", "Estimate", "SE", "t.val", "p")) %>%
      mutate(col = ifelse(p > 0.05,
                          "black",
                          ifelse(Estimate < 0,
                                 "red",
                                 "blue")),
             Estimate = cell_spec(round(Estimate, 3), color = col),
             t.val = cell_spec(round(t.val, 3), 
                               color = col,
                               font_size = spec_font_size(t.val)),
             p = cell_spec(round(p, 3), bold = ifelse(p < 0.05, T, F)),
             p = str_replace(p, ">0<", "> < .001 <"),
             p = str_replace(p, ">0.", ">."), 
             Term = str_to_title(Term)) %>%
      select(-col) %>%
      kable(caption = paste0(table_cap, "Summary table for model chosen via stepwise AIC"),
            digits = 3,
            escape = F,
            align = c("l", rep("c", 4))) %>%
      kable_styling(bootstrap_options = c("hover", "striped")) %>%
      footnote(general = paste0("McFadden's $R^2$ = ", round(1 - as.numeric(logLik(mod_log_bestglm) / logLik(null_mod)), 
                                                             3), ", ",
                                "AIC = ", round(glance(mod_log_bestglm)$AIC), ", ",
                                "Residual DF = ", glance(mod_log_bestglm)$df.residual), 
               escape = FALSE, general_title = "Model fit:")

Table 9: Summary table for model chosen via stepwise AIC
Term	Estimate	SE	t.val	p
(Intercept)	-15.028	5.395	-2.786	.005
Serum.creatinine	-6.17	2.175	-2.836	.005
Hemoglobin	1.717	0.427	4.022	< .001
Model fit:
McFadden’s $R^2$ = 0.881, AIC = 38, Residual DF = 200

(b)

Draw a normal probability plot of the deviance residuals. Do the deviance residuals appear to follow a normal distribution? Draw the deviance residuals versus all the explanatory variables. Are any outliers found when looking at these plots? The deviance residuals are evaluated by the `glm` function, with help of the `summary` function.

Figure 15 below shows a density plot for deviance residuals as well as residuals plotted against each of the two model predictors. The density of the residuals seems to almost be normally distributed but with a sharper peak and longer tails. There are two cases with high residuals. Both of them are positive ckd cases but both have high hemoglobin and low serum.creatinine, values that seem to be more common among cases without ckd.

cap <- "Top: Density plot of deviance residuals. Bottom: Deviance residuals plotted against predictor variables."
fig_num <- fig_num + 1
cap <- paste0("Figure ", fig_num, ": ", cap)


mod_aug <- augment(mod_log_bestglm)

p1 <- ggplot(mod_aug, aes(.resid)) +
      geom_density()

p2 <- mod_aug %>% 
      gather(variable, value, serum.creatinine, hemoglobin) %>% 
      ggplot(aes(value, .resid, col = y)) +
      geom_point(alpha = 0.5) +
      facet_grid("variable", scales = "free") +
      scale_color_jama(guide = guide_legend(title = "class",
                                            title.position = "top",
                                            title.hjust = 0.5,
                                            nrow = 1)) +
      theme(legend.position = c(0.8, 0.25))

grid.arrange(arrangeGrob(p1, p2, ncol = 1))

Figure 15: Top: Density plot of deviance residuals. Bottom: Deviance residuals plotted against predictor variables.

(c)

Interpret the parameters in the model selected in (a) (except for the intercept).

all_coefs <- tidy(mod_log_bestglm) %>% 
      select(term, estimate) %>% 
      filter(!str_detect(term, "Intercept")) %>% 
      mutate(estimate = round(exp(estimate), 3)) %>% 
      spread(term, estimate)

hemoglobin: For each gms the odds of not having CKD are multiplied by 5.568, thereby decreasing the risk of CKD.
serum.creatinine: For each mgs/dl the odds of having CKD are multiplied by 0.002 thereby increasing the risk of CKD.

(d)

What is the probability of not having the chronic kidney disease in the case of an individual that has the following explanatory variables;

$\texttt{age}=54$ years old,
$\texttt{blood.pressure}=83$ mm/Hg,
$\texttt{blood.glucose.random}=135$ mgs/dl,
$\texttt{blood.urea}= 15.3$ mgs/dl,
$\texttt{serum.creatinine}=0.21$ mgs/dl,
$\texttt{sodium}=134$ mEq/L,
$\texttt{potassium}=4.6$ mEq/L,
$\texttt{hemoglobin}=12.9$ gms,
$\texttt{packed.cell.volume}=42$,
$\texttt{white.blood.cell.count}=6310$ cells/cmm,
$\texttt{red.blood.cell.count}=5.1$ cells/cmm.

chance <- data_frame(hemoglobin = 12.9, serum.creatinine = 0.21) %>% 
      predict(mod_log_bestglm, ., type = "response") %>% 
      (function(x) round(100 * x, 2) %>% paste0("%"))

The probability of not having chronic kidney disease, as predicted by the fitted model, is 99.71%

(e)

Compute an estimate and a 95% confidence interval for this probability based on the model selected in (a). Here an estimate of the covariance of the estimator for $\beta$ is needed.

Since the confidence interval is calculated for a specific individual, but not for the overall mean at a particular level the variance of the prediction increases. The predicted value as well as a 95% confidence interval is shown in table 10.

x0 <- data_frame(intercept = 1, serum.creatinine = 0.21, hemoglobin = 12.9) %>% 
      as.matrix
betas <- matrix(coef(mod_log_bestglm), ncol = 1)
xtx <- model.matrix(mod_log_bestglm) %>% (function(x) t(x) %*% x)
sigma_hat <- sigma(mod_log_bestglm)
y0 <- as.numeric(x0 %*% betas)
sd_y0 <- as.numeric(sigma_hat * sqrt(x0 %*% solve(xtx) %*% t(x0)))
t_q <- qt(1 - 0.05/2, df.residual(mod_log_bestglm))
y_interval <- y0 + c(-1, 0, 1) * t_q * sd_y0

y_interval <- 1 / (1 + exp(-y_interval))

table_cap <- paste0("Table ", table_num <- table_num + 1, ": ")

data_frame(var = c("Lower", "Prediction", "Upper"), 
           value = paste0(round(y_interval * 100, 2), "%")) %>%
      spread(var, value) %>% 
      select(Prediction, Lower, Upper) %>%
      kable(caption = paste0(table_cap, "Predicted risk and 95% prediction interval"),
            align = c("l", "c", "c"),
            digits = 3) %>%
      kable_styling %>%
      add_header_above(c("", "95% Prediction Interval" = 2))

Table 10: Predicted risk and 95% prediction interval
	95% Prediction Interval
Prediction	Lower	Upper
99.71%	99.68%	99.73%

(f)

Plot the individals that don’t have the chronic kidney disease with blue dots and the individuals that have the chronic kidney disease with red dots on a graph with the first predictor on the x-axis and the second predictor on the y-axis. Plot a line on this graph that is such that the estimated probability of having the chronic kidney disease is equal to $0.5$ along the line. Explain what this graph is showing.

Figure 16 shows the linear separation between cases having chronic kidney disease and those without based on hemoglobin and serum.creatinine. Individuals that land in the grey part of the plot are predicted to have a greater than $50\%$ chance of chronid kidney disease and those in the orange part have a less than $50\%$ chance. The figure shows a clear linear separation between the classes.

cap <- "Linear separation plot of chronic kidney disease."
fig_num <- fig_num + 1
cap <- paste0("Figure ", fig_num, ": ", cap)

separator_dat_fill <- ckd %>% 
      data_grid(hemoglobin = seq_range(hemoglobin, n = 300), 
                serum.creatinine = seq_range(serum.creatinine, n = 300)) %>% 
      add_predictions(mod_log_bestglm) %>% 
      mutate(pred = 1 / (1 + exp(-pred)),
             ckdmem = ifelse(pred < 0.5, 
                             "Yes",
                             "No"),
             ckdmem = factor(ckdmem, 
                             levels = c("Yes", "No"), 
                             labels = c("Yes", "No")))

separator_dat_line <- ckd %>% 
      data_grid(hemoglobin = seq_range(hemoglobin, n = 400), 
                serum.creatinine = seq_range(serum.creatinine, n = 400)) %>% 
      add_predictions(mod_log_bestglm) %>% 
      mutate(pred = 1 / (1 + exp(-pred))) %>% 
      filter(abs(pred - 0.5) < 0.01)


ckd %>% 
      mutate(ckdmem = factor(ckdmem, 
                             levels = c("ckd", "notckd"),
                             labels = c("Yes", "No"))) %>% 
      ggplot(aes(hemoglobin, serum.creatinine, col = ckdmem)) +
      geom_raster(data = separator_dat_fill, aes(fill = ckdmem, col = NULL), 
                  alpha = 0.5) +
      geom_line(data = separator_dat_line, size = 1, alpha = 0.6,
                color = "royalblue1", show.legend = FALSE) +
      geom_point() +
      scale_color_jama(guide = guide_legend(title = "CKD",
                                            title.position = "top",
                                            title.hjust = 0.5)) +
      scale_fill_jama(guide = guide_legend(title = "CKD",
                                            title.position = "top",
                                            title.hjust = 0.5)) +
      coord_cartesian(expand = F)

Figure 16: Linear separation plot of chronic kidney disease.

Take home test

Applied linear statistical models

Brynjólfur Gauti Jónsson

Teacher: Birgir Hrafnkelsson

2018-10-21

Project 1 (25%)

Summary

(a)

Plot the discharge observations versus the water level observations (water level on the x-axis). Is the relationship linear? Plot the logarithm of the discharge observations versus the logarithm of the water level observations. Is the relationship linear?

(b)

(c)

(d)

Use AIC to find which of the models below is best in terms of predicting discharge. The models that should be tested are \(y_i = \beta_0 + \sum_{j=1}^{k} \beta_j x_{ij} + \epsilon_i\) where \(k=1,2,3,4,5,6\).

(e)

Project 2 (50%)

Summary

(a)

(b)

(c)

Draw four figures showing scatterplots and sample correlation with the percentage of body fat variable included in each figure. Can you see any potential problems with the data through these figures? You can group as follows:

(d)

(e)

(f)

(g)

Interpret the parameters in the model selected in (f) (except for the intercept).

(h)

Predict the percentage of body fat and compute a 95% confidence interval for an individual with the following predictor values using the model selected in (f).

Project 3 (25%)

Summary

(a)

Here we look at logistic regression models for the chronic kidney disease dataset that use only two predictors. Find the two predictor model that gives the lowest AIC. Report a summary of the estimates of the selected model.

(b)

(c)

Interpret the parameters in the model selected in (a) (except for the intercept).

(d)

What is the probability of not having the chronic kidney disease in the case of an individual that has the following explanatory variables;

(e)

Compute an estimate and a 95% confidence interval for this probability based on the model selected in (a). Here an estimate of the covariance of the estimator for \(\beta\) is needed.

(f)

			95% confidence interval
Type	Fitted	\(\hat \sigma\)	Lower	Upper
log(discharge)	6.147	0.007	6.133	6.160
discharge	467.192	1.007	460.941	473.527

		Predictors in model
N. Coef	\(\lambda\)	age	blood.glucose.random	blood.pressure	hemoglobin	packed.cell.volume	red.blood.cell.count	serum.creatinine	sodium	white.blood.cell.count
1	0.37				1
2	0.16				1	1
3	0.08		1		1	1
4	0.07	1	1		1	1
5	0.06	1	1	1	1	1
8	0.03	1	1	1	1	1	1	1	1
9	0.01	1	1	1	1	1	1	1	1	1