Advanced Biostatistics | Assignment 1

Instructions

Respond to the questions by inserting your code and any required interpretation into this assignment template file. Submit your knitted HTML file on Canvas.

Task 1

Question 1.1

install.packages("ggplot2", repos = "https://cloud.r-project.org/")


The downloaded binary packages are in
    /var/folders/ml/3wlw38rj1yg4whh3kfh97xdw0000gp/T//RtmpYtsxkR/downloaded_packages

library(ggplot2)
df1 <- read.csv("https://raw.githubusercontent.com/tgouhier/adv_biostats/refs/heads/main/assn1_prob1.csv")
ggplot(df1, aes(x = dist, y = corr, color = species)) +
  geom_line() +         
  geom_point() +        
  labs(x = "Distance", 
       y = "Synchrony (Correlation)", 
       title = "Synchrony vs. Distance for Each Species",
       color = "Species") +  
  theme_minimal() +     
  theme(legend.position = "right")

Question 1.2

mod.raw <- lm(corr ~ dist * species, data = df1)  
anova_result <- anova(mod.raw)  
print(anova_result)

Analysis of Variance Table

Response: corr
             Df  Sum Sq Mean Sq F value    Pr(>F)    
dist          1 1.79586 1.79586  87.595 3.543e-15 ***
species       1 0.23284 0.23284  11.357  0.001084 ** 
dist:species  1 0.63480 0.63480  30.963 2.376e-07 ***
Residuals    96 1.96817 0.02050                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

par(mfrow = c(2, 2)) 
plot(mod.raw)

Interpretation 1.2 - dist: F value is high, p-value is =<0.001 (highly significant) <- dist has strong effect on synchrony - species: f value above 10, but not super high, p-value is =<0.001. species variation has moderate effect on synchrony - dist:species = interaction. f value is high, p-value =<0.001. effect of distance on synchrony = not equal for all species - residuals: mean sq = unknown variation <- lower than dist/species/dist:species means that the model captures a lot of synchrony variation

Question 1.3

Interpretation 1.3 - Residuals v Fitted: red line is curved <-potential linearity violation? - residuals fan out, indicating heteroscedasticity - Outliers may be impacting fit (24, 14, 51 i’m lookin at you) - QQ Residuals: most points fall along diagnal line, mostly normal! - again with the outliers!! indicates potential heavy tails. - Scale-Location: red line NOT flat <- variance increases slightly for higher fitted values? 51! you’re crazy. - Residuals v Leverage: point 51 has high tandardized residuals AND leverage <- strong influence on mode. - overall: potential non-linearity, heteroscedascticity, influential point 51.

Question 1.4

#df1$corr_trans <- ifelse(df1$corr > 0, log(df1$corr), NA)
# Fit new model
mod.trans <- lm(log(corr +1) ~ dist * species, data = df1)
anova_trans <- anova(mod.trans)
print(anova_trans)

Analysis of Variance Table

Response: log(corr + 1)
             Df  Sum Sq Mean Sq F value    Pr(>F)    
dist          1 0.91938 0.91938  87.624 3.515e-15 ***
species       1 0.10938 0.10938  10.425  0.001702 ** 
dist:species  1 0.32895 0.32895  31.351 2.043e-07 ***
Residuals    96 1.00725 0.01049                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

par(mfrow = c(2, 2))
plot(mod.trans)

#CHANGE this to one line, can simplify based on other code

Interpretation 1.4 - ANOVA - point 51 is no longer as influential of an outlier (xcept in R v L) - the differences in p-values were amplified <- species is no longer significant - residuals mean sq increased - PLOTs - RvF: more linear 51 not an issue (but now 28, 24, 84 DEFINITELY are) - QQ: less influenced by 51, now more influenced by 28, 24, 84? - S-L: line of best fit = flatter, but still heteroscedastic - RvL: 51 has significantly less leverage, 24 and 84 pulling things down

Question 1.5

Interpretation 1.5 Autocorrection impacts: - Coefficient Estimates - if synchrony is spatially/temporally dependent (ex: dist), the effect may be under or over estimated depending on autocorrect structure - interaction between dist:spec may be misleading <- confusion between shared envir conditionvs vs. species differences - Standard Errors - underestimated due to assumption of each variable being independent - overestimation of precision of estimates, making them seem more reliable than they are ^^^ - overly narrow confidence intervals due to overestimation of estimate precision - P-values - standard errors UNDERestimated <- p-values will be TOO small - could cause wrong rejection of null hypothesis whe its true - ex: species became non-significant in transformed model: this may be a mis-calculation

Question 1.6

perm_glm <- function(y, x, nperms = 999) {
  #model base
  mod.base <- lm(y ~ dist*species,data=x)
  observed_fvals <- anova(mod.base)$ 'F value'[1:3]
  #vector for storing fvals from permutations
  perm_fvals <- matrix(NA, nrow=nperms+1, ncol=length(observed_fvals))
  
  #monte carlo simulations
  for (i in 1:nperms) {
    y_perm <- sample(y)  # randomly shuffle response variable
    x_perm <- as.data.frame(apply(x,2,sample))
    mod_perm <- lm(y_perm ~ dist*species,data=x_perm)  #fit model
    perm_fvals[i, ] <- anova(mod_perm)$'F value'[1:3] 
  }
  # Compute p-values as the proportion of permuted F-values greater than or equal to observed F-values
  perm_fvals[nperms+1, ] <- observed_fvals
  pval <- apply(perm_fvals, 2, function(x) mean(x >= x[nperms+1])) 
  names(pval) <- names(observed_fvals)
  
  return(pval)
}

Interpretation 1.6 - ANOVA - point 51 is no longer as influential of an outlier (xcept in R v L) - the differences in p-values were amplified <- species is no longer significant - residuals mean sq increased - PLOTs - RvF: more linear 51 not an issue (but now 28, 24, 84 DEFINITELY are) - QQ: less influenced by 51, now more influenced by 28, 24, 84? - S-L: line of best fit = flatter, but still heteroscedastic - RvL: 51 has significantly less leverage, 24 and 84 pulling things down

Question 1.7

y <- df1$corr
x <- subset(df1, select=c('dist','species'))
p_values_mc <- perm_glm(y, x, nperms = 999)
print(p_values_mc)

[1] 0.001 0.007 0.001

Task 2

Question 2.1

df2 <- read.csv("https://raw.githubusercontent.com/tgouhier/adv_biostats/refs/heads/main/assn1_prob2.csv")
#identify response and predictor variables
y_2 <- df2$wbd_incidence  # Response variable
X_2 <- as.matrix(df2[, -which(names(df2) == "wbd_incidence")])  # Extract microbial abundance and convert to matrix
#fit multiple regression model
mod_wbd <- lm(y_2 ~ X_2)
#check model summary
summary(mod_wbd)


Call:
lm(formula = y_2 ~ X_2)

Residuals:
ALL 20 residuals are 0: no residual degrees of freedom!

Coefficients: (11 not defined because of singularities)
                    Estimate Std. Error t value Pr(>|t|)
(Intercept)        -93.65355        NaN     NaN      NaN
X_2microbe_abund1   -0.08226        NaN     NaN      NaN
X_2microbe_abund2    1.69854        NaN     NaN      NaN
X_2microbe_abund3    0.25740        NaN     NaN      NaN
X_2microbe_abund4    0.25370        NaN     NaN      NaN
X_2microbe_abund5  -19.24956        NaN     NaN      NaN
X_2microbe_abund6    1.31754        NaN     NaN      NaN
X_2microbe_abund7    1.32377        NaN     NaN      NaN
X_2microbe_abund8   19.91991        NaN     NaN      NaN
X_2microbe_abund9    0.66329        NaN     NaN      NaN
X_2microbe_abund10   0.30254        NaN     NaN      NaN
X_2microbe_abund11  -1.43707        NaN     NaN      NaN
X_2microbe_abund12   1.16239        NaN     NaN      NaN
X_2microbe_abund13  -1.11403        NaN     NaN      NaN
X_2microbe_abund14  -0.98747        NaN     NaN      NaN
X_2microbe_abund15  -0.28218        NaN     NaN      NaN
X_2microbe_abund16  -0.57312        NaN     NaN      NaN
X_2microbe_abund17  -0.44631        NaN     NaN      NaN
X_2microbe_abund18  -0.18795        NaN     NaN      NaN
X_2microbe_abund19   0.19997        NaN     NaN      NaN
X_2microbe_abund20        NA         NA      NA       NA
X_2microbe_abund21        NA         NA      NA       NA
X_2microbe_abund22        NA         NA      NA       NA
X_2microbe_abund23        NA         NA      NA       NA
X_2microbe_abund24        NA         NA      NA       NA
X_2microbe_abund25        NA         NA      NA       NA
X_2microbe_abund26        NA         NA      NA       NA
X_2microbe_abund27        NA         NA      NA       NA
X_2microbe_abund28        NA         NA      NA       NA
X_2microbe_abund29        NA         NA      NA       NA
X_2microbe_abund30        NA         NA      NA       NA

Residual standard error: NaN on 0 degrees of freedom
Multiple R-squared:      1, Adjusted R-squared:    NaN 
F-statistic:   NaN on 19 and 0 DF,  p-value: NA

Interpretation 2.1 - the model didn’t really work: - no degrees of freedom suggests the same number or predictors and observations, because there are more variables than observations.!.
- alternatively, there could be extremely high collinearities between predictors “some coefficitients are”NA”

Question 2.2

options(repos = c(CRAN = "https://cloud.r-project.org/"))
library(glmnet)

Loading required package: Matrix

Loaded glmnet 4.1-8

lasso_model <- cv.glmnet(X_2, y_2, alpha = 1, nfolds = 5)  # alpha = 1 for LASSO
# Print best lambda
print(lasso_model$lambda.min)  # Lambda  minimizes cross-validation error

[1] 0.2390713

print(lasso_model$lambda.1se)  # More conservative choice

[1] 0.5785832

plot(lasso_model)  # Visualize cross-validation error vs. lambda

best_lambda <- lasso_model$lambda.min  #lambda.min or lambda.1se
lasso_coefs <- coef(lasso_model, s = best_lambda)  #selected features
print(lasso_coefs)

31 x 1 sparse Matrix of class "dgCMatrix"
                           s1
(Intercept)      0.9812231991
microbe_abund1   .           
microbe_abund2   .           
microbe_abund3   .           
microbe_abund4   .           
microbe_abund5  -0.0004631324
microbe_abund6   0.0022531169
microbe_abund7  -0.0182030651
microbe_abund8   .           
microbe_abund9   .           
microbe_abund10  .           
microbe_abund11  .           
microbe_abund12  .           
microbe_abund13  .           
microbe_abund14  .           
microbe_abund15  .           
microbe_abund16  .           
microbe_abund17  0.0053227539
microbe_abund18  .           
microbe_abund19  .           
microbe_abund20  .           
microbe_abund21  .           
microbe_abund22 -0.0007034883
microbe_abund23  0.9918729383
microbe_abund24  .           
microbe_abund25  .           
microbe_abund26  .           
microbe_abund27  .           
microbe_abund28  .           
microbe_abund29  .           
microbe_abund30  .

Interpretation 2.2 - many variables removed in lasso_coef - the variable with the highest s1 value (23 = 0.99) has the strongest predicted effect on wbd_incidence - positive coefficients: 23, 17, 6 increase likelihood of wbd_incidence - negative coefficients: 5, 7, 22 decrease wbd_incidence - this model works better because it removes insignificant and xtra large coefficients - this indicates that the multiple regression model was overfitted, and a simplier model is better (like LASSO)

Question 2.3

ridge_model <- cv.glmnet(X_2, y_2, alpha = 0, nfolds = 5)  #ridge regression
# Print best lambda
print(ridge_model$lambda.min)  #best lambda minimizing cross-validation error

[1] 207.932

print(ridge_model$lambda.1se)  #conservative lambda choice

[1] 1109.671

# Plot ridge regression cross-validation results
plot(ridge_model)

# Extract coefficients for best lambda
ridge_coefs <- coef(ridge_model, s = ridge_model$lambda.min)
print(ridge_coefs)

31 x 1 sparse Matrix of class "dgCMatrix"
                           s1
(Intercept)     47.9932359643
microbe_abund1  -0.0037851001
microbe_abund2   0.0061945568
microbe_abund3  -0.0004803214
microbe_abund4  -0.0347708536
microbe_abund5  -0.0251174884
microbe_abund6  -0.0035852412
microbe_abund7  -0.0096654090
microbe_abund8  -0.0250056295
microbe_abund9   0.0211742086
microbe_abund10  0.0184975423
microbe_abund11 -0.0045124438
microbe_abund12  0.0156161908
microbe_abund13 -0.0009162257
microbe_abund14 -0.0083157625
microbe_abund15  0.0071137495
microbe_abund16 -0.0105998261
microbe_abund17  0.0019878660
microbe_abund18 -0.0194080462
microbe_abund19  0.0180567708
microbe_abund20  0.0090027585
microbe_abund21  0.0008470935
microbe_abund22 -0.0141577470
microbe_abund23  0.0777229443
microbe_abund24 -0.0158122611
microbe_abund25 -0.0341506831
microbe_abund26  0.0146782722
microbe_abund27  0.0227067313
microbe_abund28 -0.0059243072
microbe_abund29 -0.0095566572
microbe_abund30  0.0216233415

Interpretation 2.3 - retains all variables, but shrunk toward zero - ridge regression assumes all variables impact WBD to some extent - if there is high multicollinearity in the data, RR might distirbute effect across correlated variables instead of “picking” one - so LASSO singled out varibables to deal with collinearity, culling xtras. RR distributed that collinearity, assuming all variables are at least somewhat impactful.

Question 2.4

my_ridge_fun <- function(x2k, y2k, lambda = 0) {
  K <- ncol(x2k)
  N <- nrow(x2k)
  # Scale the explanatory variables
  x2k <- as.matrix(scale(x2k))
  y2k <- as.matrix(y2k)
  # Coefficient estimates
  I_K <- diag(1, K, K) # K x K identity matrix
  # intercept is mean of Y when X standardized
  betas <- c(mean(y2k), solve(t(x2k) %*% x2k + lambda * I_K) %*% t(x2k) %*% y2k)
  betas <- as.numeric(betas)
  names(betas) <- c("Intercept", colnames(x2k))
  return(list(beta = betas))
}
kfold_ridge <- function(y2k, x2k, k = 5, lambda = seq(from = 0.1, to = 100, len = 20)) {
  # num observations
  N <- length(y2k)
  # k folds
  fold_idx <- sample(rep(1:k, length.out = N))
  # vector to store MSE for each lambda
  mse_values <- numeric(length(lambda))
  # Loop over lambda values
  for (i in seq_along(lambda)) {
    mse_fold <- numeric(k)
    #k-fold cross-validation
    for (j in 1:k) {
      test_idx <- which(fold_idx == j)
      train_idx <- setdiff(1:N, test_idx)
      #train data
      x_train <- x2k[train_idx, , drop = FALSE]
      y_train <- y2k[train_idx]
      #test data
      x_test <- x2k[test_idx, , drop = FALSE]
      y_test <- y2k[test_idx]
      # Fit ridge regression model on training data
      ridge_model <- my_ridge_fun(x_train, y_train, lambda = lambda[i])
    
      #Add intercept, make predictions on test data
      y_pred <- cbind(1, scale(x_test)) %*% ridge_model$beta  # Add intercept
      # Compute MSQ for this fold
      mse_fold[j] <- mean((y_test - y_pred)^2)
    }
    # Store mean MSE across all folds for current lambda
    mse_values[i] <- mean(mse_fold)
  }
  # Identify the best lambda (minimizing MSE)
  min_lambda <- lambda[which.min(mse_values)]
  return(list(mse = mse_values, lambda = lambda, min_lambda = min_lambda))
}

# Run k-fold cross-validation for Ridge regression
ridge_results <- kfold_ridge( y2k = df2$wbd_incidence, x2k = df2[, -1], k = 5, lambda = seq(0.1, 100, length.out = 20))
# Print results
print(ridge_results$min_lambda)  # Best lambda

[1] 0.1

print(ridge_results$mse)  # MSE values for each lambda

 [1] 227.8739 264.7185 286.9287 303.6250 317.0925 328.3650 338.0262 346.4463
 [9] 353.8787 360.5059 366.4638 371.8573 376.7687 381.2641 385.3971 389.2122
[17] 392.7465 396.0312 399.0928 401.9541

# Optional: Plot lambda vs. MSE to visualize selection
plot(ridge_results$lambda, ridge_results$mse, type = "b",
     xlab = "Lambda", ylab = "Mean Squared Error",
     main = "Ridge Regression Cross-Validation")