Advanced Biostatistics | Assignment 2

Author
Affiliation

Ariel Shatsky

Northeastern University

Published

March 25, 2025

Instructions

Please respond to the questions by inserting your code and any required interpretation into this assignment template file. Submit the source Quarto and rendered HTML versions of your assignment via Canvas.

Task 1

Question 1.1

df1 <- read.csv("https://raw.githubusercontent.com/tgouhier/adv_biostats/refs/heads/main/assn2_prob1.csv")
#create a numerical version of the response called alzheimers_num by converting
#the original categorical variable alzheimers into a (discrete) numerical variable
str(df1)
'data.frame':   400 obs. of  4 variables:
 $ years     : int  1 2 3 4 5 6 7 8 9 10 ...
 $ plaque    : chr  "absent" "absent" "absent" "absent" ...
 $ rep       : int  1 1 1 1 1 1 1 1 1 1 ...
 $ alzheimers: chr  "negative" "negative" "negative" "positive" ...
df1$alzheimers_num <- ifelse(df1$alzheimers == "positive", 1, 0)
table(df1$alzheimers, df1$alzheimers_num)
          
             0   1
  negative 225   0
  positive   0 175

Question 1.2

# plot this new variable (y axis) as a function of time (x axis) using different symbols and colours for patients with vs. without amyloid plaques and briefly describe these patterns.
library(ggplot2)

df1$plaque <- as.factor(df1$plaque)
# Scatter plot with jitter to prevent overlap
library(ggplot2)

# Ensure plaque is a factor
df1$plaque <- as.factor(df1$plaque)

# Create the jittered scatter plot
ggplot(df1, aes(x = years, y = jitter(alzheimers_num), colour = plaque, shape = plaque)) +
  geom_point(size = 3, alpha = 0.7) +  
  labs(x = "Years", y = "Alzheimer's Diagnosis (0 = No, 1 = Yes)",
       title = "Alzheimer's Diagnosis Over Time by Plaque Presence",
       colour = "Plaque Presence", shape = "Plaque Presence") +
  scale_colour_manual(values = c("absent" = "blue", "present" = "red")) +  # Corrected mapping
  theme_minimal()

Question 1.3

#load the car package

library(car)
Loading required package: carData
# logistic regression model
mod.glm <- glm(alzheimers_num ~ plaque * years, data = df1, family = binomial)
# summary of model
summary(mod.glm)

Call:
glm(formula = alzheimers_num ~ plaque * years, family = binomial, 
    data = df1)

Coefficients:
                    Estimate Std. Error z value Pr(>|z|)    
(Intercept)         -2.28438    0.41876  -5.455 4.89e-08 ***
plaquepresent        1.14475    0.53283   2.148  0.03168 *  
years                0.09246    0.03142   2.943  0.00325 ** 
plaquepresent:years  0.08998    0.04496   2.001  0.04536 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 548.25  on 399  degrees of freedom
Residual deviance: 422.78  on 396  degrees of freedom
AIC: 430.78

Number of Fisher Scoring iterations: 4
#run anova table for the glm model,anodev table
Anova(mod.glm, test = "LR")
Analysis of Deviance Table (Type II tests)

Response: alzheimers_num
             LR Chisq Df Pr(>Chisq)    
plaque         85.000  1  < 2.2e-16 ***
years          45.432  1   1.58e-11 ***
plaque:years    4.030  1     0.0447 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Interpretation 1.3

Describe the exact nature of the model and interpret each row of the ANOVA table.

In this question, I fit a logistic regression model with a binomial response variable. The model will predict the probability that an individual has Alzheimer’s based on the presence or absense of plaque and the years they were observed during the study.

The plaque row of the ANODEV table indicates the df column, which = 1 so there is just 1 predictor variable, plaque with 2 levels: present or absent. The LR Chisq column of this row shows how much the fit of the model improves when we include the plaque variable, which is 85 in this case, so the model is improved when plaque is included. The p-value in this row is < 0.05, which tells us that the presence of plaque has a significant effect on the presence/likelihood of Alzheimer’s.

The years row of the table shows the same df column, which is also = 1. The next column indicates how well the model fits the data, which shows that the model’s fit improves when years is included, but not as much as it improves when plaque is included in the model. The p-value column also indicates that the p value is < 0.5, so the number of years signficantly affects the likelihood of Alzheimer’s.

Question 1.4

plot(mod.glm)

acf(residuals(mod.glm, type = "pearson"))

# pearson residuals
pearson_residuals <- residuals(mod.glm, type = "pearson")
df_residuals <- mod.glm$df.residual
dispersion_parameter <- sum(pearson_residuals^2)/df_residuals 
dispersion_parameter
[1] 1.028507

Interpretation 1.4

The Q-Q residuals plot indicates that the residuals are somewhat normally distributed until the upper right side in which the points diverge from the line of normality.

The ACF plot seems to suggest no autocorrelation because the residuals seem to fluctuate above and below the zero line randomly, while mostly remaining within the 95% confidence interval (the blue lines) other than slightly exceeding the upper limit at a lag = about 14. Since there is no pattern amongst the residuals, this indicates that the model is showing time-dependent effects.

The dispersion parameter = 1.04, which is very close to 1, suggesting that there is no overdispersion, which means the model fits the data pretty well.

Question 1.5

# the odds ratio is comparing the odds of developing Alzheimer's after 20 years with vs. without plaques, using the logistic regression model. 
#odds ratio = Odds_plaques/O_no plaques = [exp (beta0 + beta1 _ 20 beta2)]/exp (beta0 + 20beta2)
#simplified, this is odds ratio = exp(beta1) which represents the increased or decreased odds of getting Alzheimer's for patients with plaque vs. without plaque. 
mod.glm

Call:  glm(formula = alzheimers_num ~ plaque * years, family = binomial, 
    data = df1)

Coefficients:
        (Intercept)        plaquepresent                years  
           -2.28438              1.14475              0.09246  
plaquepresent:years  
            0.08998  

Degrees of Freedom: 399 Total (i.e. Null);  396 Residual
Null Deviance:      548.3 
Residual Deviance: 422.8    AIC: 430.8
odds_present <- predict(mod.glm, newdata = data.frame(years = 20, plaque = "present"), type = "response")
exp(predict(mod.glm, newdata = data.frame(years = 20, plaque = "present")))/(exp(predict(mod.glm, newdata = data.frame(years = 20, plaque = "present")))+ 1) # checked 
        1 
0.9247836 
odds_absent <- predict(mod.glm, newdata = data.frame(years = 20, plaque = "absent"), type = "response")

odds_ratio<- odds_present/odds_absent

odds_ratio
       1 
2.353779 
##2.3x more likely to get alzheimers if you already have plaques

Question 1.6

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ lubridate 1.9.3     ✔ tibble    3.2.1
✔ purrr     1.0.2     ✔ tidyr     1.3.1
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
✖ dplyr::recode() masks car::recode()
✖ purrr::some()   masks car::some()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
data <- df1 %>% select(alzheimers_num, plaque, years)

my_glm_fun <- function(data, form, niters = 1000, tol = 1e-6, phi = 1, a = 1) {
  # Prepare model components
  mf <- model.frame(formula = form, data = data)
  y <- model.response(mf)
  X <- model.matrix(attr(mf, "terms"), data = mf)
  
  n <- nrow(X)
  p <- ncol(X)
  beta <- rep(0, p)
  betas <- matrix(NA, nrow = niters, ncol = p)
  colnames(betas) <- colnames(X)
  
  # Define link and inverse link for logistic regression
  linkinv <- function(eta) 1 / (1 + exp(-eta))
  mu_eta <- function(eta) linkinv(eta) * (1 - linkinv(eta))
  var_fun <- function(mu) mu * (1 - mu)  # binomial variance
  
  for (i in 1:niters) {
    eta <- X %*% beta
    mu <- linkinv(eta)
    V <- var_fun(mu)
    
    # Calculate weights (W inverse)
    W_inv_diag <- (phi * V) / (a^2)
    W_inv <- diag(as.vector(W_inv_diag))
    
    # Calculate adjusted response Y'
    Y_prime <- (a * (y - mu)) / (phi * V)
    
    # IRLS update
    Xt_Winv <- t(X) %*% solve(W_inv)
    beta_new <- solve(Xt_Winv %*% X) %*% Xt_Winv %*% Y_prime
    
    betas[i, ] <- beta_new
    
    # Check for convergence
    if (sum(abs(beta_new - beta)) < tol) {
      beta <- beta_new
      break
    }
    
    beta <- beta_new
  }
  
  betas <- betas[1:i, , drop = FALSE]
  
  # Final estimates and standard errors
  eta <- X %*% beta
  mu <- linkinv(eta)
  V <- var_fun(mu)
  W_inv_diag <- (phi * V) / (a^2)
  W_inv <- diag(as.vector(W_inv_diag))
  
  Xt_WinvX_inv <- solve(t(X) %*% solve(W_inv) %*% X)
  se <- sqrt(phi * diag(Xt_WinvX_inv))
  z_val <- beta / se
  p_val <- 2 * pnorm(-abs(z_val))
  
  tab <- cbind(Estimate = beta, `Std. Error` = se,
               `z value` = z_val, `Pr(>|z|)` = p_val)
  rownames(tab) <- colnames(X)
  
  return(list(trace = betas, iter = i, tab = tab))
}

mod.glm <- glm(alzheimers_num ~ plaque * years, data = data, family = binomial)
summary(mod.glm)

Call:
glm(formula = alzheimers_num ~ plaque * years, family = binomial, 
    data = data)

Coefficients:
                    Estimate Std. Error z value Pr(>|z|)    
(Intercept)         -2.28438    0.41876  -5.455 4.89e-08 ***
plaquepresent        1.14475    0.53283   2.148  0.03168 *  
years                0.09246    0.03142   2.943  0.00325 ** 
plaquepresent:years  0.08998    0.04496   2.001  0.04536 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 548.25  on 399  degrees of freedom
Residual deviance: 422.78  on 396  degrees of freedom
AIC: 430.78

Number of Fisher Scoring iterations: 4
my_output <- my_glm_fun(data, alzheimers_num ~ plaque * years)
print(my_output$tab)
                                 Std. Error                        
(Intercept)         -0.98575028 0.067901427 -14.517372 9.404727e-48
plaquepresent        0.49726258 0.098866801   5.029621 4.914493e-07
years                0.03725549 0.005815955   6.405738 1.496434e-10
plaquepresent:years  0.03995492 0.008213742   4.864398 1.148055e-06

Task 2

Question 2.1

df2 <- read.csv("https://raw.githubusercontent.com/tgouhier/adv_biostats/refs/heads/main/assn2_prob2.csv")
 
df2
   generation.time relative.fitness
1          87.7012        0.0676207
2         527.3160        0.3465690
3        1058.9100        0.4649690
4        1501.2100        0.6002560
5        1945.2600        0.6425840
6        2476.3700        0.7863370
7        3010.9700        0.7441730
8        3543.3500        0.8203200
9        3898.5900        0.8541830
10       4431.9200        0.8796250
11       5055.3300        0.8459280
12       6033.1800        0.8883460
13       7099.2100        0.9730340
14       8076.1100        1.0661600
15       9053.6500        1.1254800
16      10033.1000        1.0833900
17      10923.7000        1.0328400
18      11902.2000        1.0414500
19      12968.5000        1.1092400
20      14124.8000        1.1263300
21      15012.9000        1.2109900
22      15991.5000        1.2111600
23      17059.6000        1.1859900
24      18123.0000        1.4143400
25      19104.6000        1.2539400
26      19993.7000        1.2878900
27      22129.2000        1.2713500
28      24085.5000        1.3223800
29      25952.6000        1.3903100
30      28088.9000        1.3315100
31      29956.5000        1.3740800
32      32000.3000        1.5011900
33      33958.3000        1.4677200
34      36003.9000        1.5018600
35      38047.9000        1.6205200
36      40093.2000        1.6715700
37      41965.8000        1.4437200
38      44099.2000        1.5454900
39      46058.4000        1.4444100
40      47924.0000        1.5883900
41      49970.6000        1.5718300
names(df2) <- c("time", "fitness")  # rename
df2
         time   fitness
1     87.7012 0.0676207
2    527.3160 0.3465690
3   1058.9100 0.4649690
4   1501.2100 0.6002560
5   1945.2600 0.6425840
6   2476.3700 0.7863370
7   3010.9700 0.7441730
8   3543.3500 0.8203200
9   3898.5900 0.8541830
10  4431.9200 0.8796250
11  5055.3300 0.8459280
12  6033.1800 0.8883460
13  7099.2100 0.9730340
14  8076.1100 1.0661600
15  9053.6500 1.1254800
16 10033.1000 1.0833900
17 10923.7000 1.0328400
18 11902.2000 1.0414500
19 12968.5000 1.1092400
20 14124.8000 1.1263300
21 15012.9000 1.2109900
22 15991.5000 1.2111600
23 17059.6000 1.1859900
24 18123.0000 1.4143400
25 19104.6000 1.2539400
26 19993.7000 1.2878900
27 22129.2000 1.2713500
28 24085.5000 1.3223800
29 25952.6000 1.3903100
30 28088.9000 1.3315100
31 29956.5000 1.3740800
32 32000.3000 1.5011900
33 33958.3000 1.4677200
34 36003.9000 1.5018600
35 38047.9000 1.6205200
36 40093.2000 1.6715700
37 41965.8000 1.4437200
38 44099.2000 1.5454900
39 46058.4000 1.4444100
40 47924.0000 1.5883900
41 49970.6000 1.5718300
head(df2)
       time   fitness
1   87.7012 0.0676207
2  527.3160 0.3465690
3 1058.9100 0.4649690
4 1501.2100 0.6002560
5 1945.2600 0.6425840
6 2476.3700 0.7863370
# Load minpack.lm package
options(repos = c(CRAN = "https://cloud.r-project.org"))
install.packages("minpack.lm")

The downloaded binary packages are in
    /var/folders/h4/wb2_p5v15cv6lbm_fkdwbghw0000gn/T//RtmpkVIUsv/downloaded_packages
library(minpack.lm)

# Initial parameter values
start_vals <- list(a = 0.01, b = 0.1)

# Linear model: y = a + b * time
mod_linear <- lm(fitness ~ time, data = df2)

# Power-law model: y = a * time^b
mod_power <- nlsLM(fitness ~ a * time^b, data = df2, start = start_vals)

# Hyperbolic model: y = (b * time) / (a + time)
mod_hyperbolic <- nlsLM(fitness ~ (b * time) / (a + time), data = df2, start = start_vals)

Question 2.2

# setting up data for plotting
df2$fit_linear <- predict(mod_linear)
df2$fit_power <- predict(mod_power)
df2$fit_hyperbolic <- predict(mod_hyperbolic)

# plotting the data
plot(df2$time, df2$fitness, pch = 16, col = "black", xlab = "Time (generations)",
     ylab = "Relative Fitness", main = "Model Fits to Lenski's E. coli Data")

lines(df2$time, df2$fit_linear, col = "blue", lwd = 2)
lines(df2$time, df2$fit_power, col = "red", lwd = 2)
lines(df2$time, df2$fit_hyperbolic, col = "green", lwd = 2)

legend("bottomright", legend = c("Linear", "Power", "Hyperbolic"),
       col = c("blue", "red", "green"), lwd = 2, bty = "n")

#compare the models
AIC(mod_linear, mod_power, mod_hyperbolic)
               df       AIC
mod_linear      3 -20.13691
mod_power       3 -91.11292
mod_hyperbolic  3 -69.64347

Interpretation 2.2 It looks like the power law model fits the data best since it has the lowest AIC value among the models. Since this model has a low AIC value, it indicates that the power law model fits the data well without overfitting. In terms of the biological explanation, it allows for fitness to increase more rapidly when adaptation is easier (early in the model), and allows fitness to keep increasing, just at a slower rate.

Question 2.3

# getting R^2 values to compare the models

# store models in a list

models <- list(
  linear = mod_linear,
  power = mod_power,
  hyperbolic = mod_hyperbolic
)
# get R^2 value

r_squared <- lapply(models, function(model) {
  pred <- predict(model)
  cor(df2$fitness, pred)^2
})

r_squared
$linear
[1] 0.7672747

$power
[1] 0.9597081

$hyperbolic
[1] 0.9360057

Interpretation 2.3

The R^2 values align with the previous AIC results because the power law model has the R^2 value closest to 1, at 0.96. The other models produced slightly lower R^2 values ( linear = 0.77, and hyperbolic = 0.94). This indicates that the power model has a strong correlation between predicted and actual fitness values. The other models, mainly the hyperbolic model still had a decent fit but was not as strong of a correlation as the power law model. The linear model had the lowest R^2 value and the lowest correlation between the predicted actual fitness values.

Question 2.4

my_aicc <- function(mod_list, model_names, n, k) {
  #compute aic values from model list
  aic_vals <- sapply(mod_list, AIC)
  #compute AICc: AIC + (2k(k+1))/(n-k-1)
  aicc <- aic_vals + (2*k*(k+1))/ (n - k - 1)
  #compute delta AICs: difference from best model
  delta_aicc <- aicc - min(aicc)
  #compute aic weights
  rel_likelihoods <- exp(-0.5 * delta_aicc)
  weights_aicc <- rel_likelihoods / sum(rel_likelihoods)
  
  #return data as data frame
  return(data.frame(Model = model_names,
                    AICc = round(aicc, 4),
                    Delta_AICc = round(delta_aicc, 4),
                    AICc_weights = round(weights_aicc, 4)))
}

mod_list <- list(mod_linear, mod_power, mod_hyperbolic)
model_names <- c("Linear", "Power Law", "Hyperbolic")
n <- nrow(df2)
k <- c(2,2,2) #number of parameters in each model
my_aicc(mod_list, model_names, n, k)
       Model     AICc Delta_AICc AICc_weights
1     Linear -19.8211    70.9760            0
2  Power Law -90.7971     0.0000            1
3 Hyperbolic -69.3277    21.4694            0

Interpretation 2.4 The power law remains the best fit based on it having the lowest value for AICc. Moving to the next column, the delta AICc indicates how much worse the other models are, compared to the best, the power law model, which tells us that the linear model is a poor fit compared to the other models, and the hyperbolic model is not as poor as the linear model but is still not as a strong as the power law model. The AICc weight for the power law is at 1 (100%) which means it has the highest chance of being the best model for these data. The hyperbolic and linear models are less than 1, hovering around 0 indicating much less support for these models being the best fit for these data.