Advanced Biostatistics | Assignment 1

Author
Affiliation

Maya Weil-Cooley

Northeastern University

Published

February 27, 2025

Instructions

Respond to the questions by inserting your code and any required interpretation into this assignment template file. Submit your knitted HTML file on Canvas.

Task 1

Question 1.1

install.packages("ggplot2", repos = "https://cloud.r-project.org/")

The downloaded binary packages are in
    /var/folders/ml/3wlw38rj1yg4whh3kfh97xdw0000gp/T//RtmpYtsxkR/downloaded_packages
library(ggplot2)
df1 <- read.csv("https://raw.githubusercontent.com/tgouhier/adv_biostats/refs/heads/main/assn1_prob1.csv")
ggplot(df1, aes(x = dist, y = corr, color = species)) +
  geom_line() +         
  geom_point() +        
  labs(x = "Distance", 
       y = "Synchrony (Correlation)", 
       title = "Synchrony vs. Distance for Each Species",
       color = "Species") +  
  theme_minimal() +     
  theme(legend.position = "right")  

Question 1.2

mod.raw <- lm(corr ~ dist * species, data = df1)  
anova_result <- anova(mod.raw)  
print(anova_result) 
Analysis of Variance Table

Response: corr
             Df  Sum Sq Mean Sq F value    Pr(>F)    
dist          1 1.79586 1.79586  87.595 3.543e-15 ***
species       1 0.23284 0.23284  11.357  0.001084 ** 
dist:species  1 0.63480 0.63480  30.963 2.376e-07 ***
Residuals    96 1.96817 0.02050                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
par(mfrow = c(2, 2)) 
plot(mod.raw)  

Interpretation 1.2 - dist: F value is high, p-value is =<0.001 (highly significant) <- dist has strong effect on synchrony - species: f value above 10, but not super high, p-value is =<0.001. species variation has moderate effect on synchrony - dist:species = interaction. f value is high, p-value =<0.001. effect of distance on synchrony = not equal for all species - residuals: mean sq = unknown variation <- lower than dist/species/dist:species means that the model captures a lot of synchrony variation

Question 1.3

Interpretation 1.3 - Residuals v Fitted: red line is curved <-potential linearity violation? - residuals fan out, indicating heteroscedasticity - Outliers may be impacting fit (24, 14, 51 i’m lookin at you) - QQ Residuals: most points fall along diagnal line, mostly normal! - again with the outliers!! indicates potential heavy tails. - Scale-Location: red line NOT flat <- variance increases slightly for higher fitted values? 51! you’re crazy. - Residuals v Leverage: point 51 has high tandardized residuals AND leverage <- strong influence on mode. - overall: potential non-linearity, heteroscedascticity, influential point 51.

Question 1.4

#df1$corr_trans <- ifelse(df1$corr > 0, log(df1$corr), NA)
# Fit new model
mod.trans <- lm(log(corr +1) ~ dist * species, data = df1)
anova_trans <- anova(mod.trans)
print(anova_trans)
Analysis of Variance Table

Response: log(corr + 1)
             Df  Sum Sq Mean Sq F value    Pr(>F)    
dist          1 0.91938 0.91938  87.624 3.515e-15 ***
species       1 0.10938 0.10938  10.425  0.001702 ** 
dist:species  1 0.32895 0.32895  31.351 2.043e-07 ***
Residuals    96 1.00725 0.01049                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
par(mfrow = c(2, 2))
plot(mod.trans) 

#CHANGE this to one line, can simplify based on other code

Interpretation 1.4 - ANOVA - point 51 is no longer as influential of an outlier (xcept in R v L) - the differences in p-values were amplified <- species is no longer significant - residuals mean sq increased - PLOTs - RvF: more linear 51 not an issue (but now 28, 24, 84 DEFINITELY are) - QQ: less influenced by 51, now more influenced by 28, 24, 84? - S-L: line of best fit = flatter, but still heteroscedastic - RvL: 51 has significantly less leverage, 24 and 84 pulling things down

Question 1.5

Interpretation 1.5 Autocorrection impacts: - Coefficient Estimates - if synchrony is spatially/temporally dependent (ex: dist), the effect may be under or over estimated depending on autocorrect structure - interaction between dist:spec may be misleading <- confusion between shared envir conditionvs vs. species differences - Standard Errors - underestimated due to assumption of each variable being independent - overestimation of precision of estimates, making them seem more reliable than they are ^^^ - overly narrow confidence intervals due to overestimation of estimate precision - P-values - standard errors UNDERestimated <- p-values will be TOO small - could cause wrong rejection of null hypothesis whe its true - ex: species became non-significant in transformed model: this may be a mis-calculation

Question 1.6

perm_glm <- function(y, x, nperms = 999) {
  #model base
  mod.base <- lm(y ~ dist*species,data=x)
  observed_fvals <- anova(mod.base)$ 'F value'[1:3]
  #vector for storing fvals from permutations
  perm_fvals <- matrix(NA, nrow=nperms+1, ncol=length(observed_fvals))
  
  #monte carlo simulations
  for (i in 1:nperms) {
    y_perm <- sample(y)  # randomly shuffle response variable
    x_perm <- as.data.frame(apply(x,2,sample))
    mod_perm <- lm(y_perm ~ dist*species,data=x_perm)  #fit model
    perm_fvals[i, ] <- anova(mod_perm)$'F value'[1:3] 
  }
  # Compute p-values as the proportion of permuted F-values greater than or equal to observed F-values
  perm_fvals[nperms+1, ] <- observed_fvals
  pval <- apply(perm_fvals, 2, function(x) mean(x >= x[nperms+1])) 
  names(pval) <- names(observed_fvals)
  
  return(pval)
}

Interpretation 1.6 - ANOVA - point 51 is no longer as influential of an outlier (xcept in R v L) - the differences in p-values were amplified <- species is no longer significant - residuals mean sq increased - PLOTs - RvF: more linear 51 not an issue (but now 28, 24, 84 DEFINITELY are) - QQ: less influenced by 51, now more influenced by 28, 24, 84? - S-L: line of best fit = flatter, but still heteroscedastic - RvL: 51 has significantly less leverage, 24 and 84 pulling things down

Question 1.7

y <- df1$corr
x <- subset(df1, select=c('dist','species'))
p_values_mc <- perm_glm(y, x, nperms = 999)
print(p_values_mc)
[1] 0.001 0.007 0.001

Task 2

Question 2.1

df2 <- read.csv("https://raw.githubusercontent.com/tgouhier/adv_biostats/refs/heads/main/assn1_prob2.csv")
#identify response and predictor variables
y_2 <- df2$wbd_incidence  # Response variable
X_2 <- as.matrix(df2[, -which(names(df2) == "wbd_incidence")])  # Extract microbial abundance and convert to matrix
#fit multiple regression model
mod_wbd <- lm(y_2 ~ X_2)
#check model summary
summary(mod_wbd)

Call:
lm(formula = y_2 ~ X_2)

Residuals:
ALL 20 residuals are 0: no residual degrees of freedom!

Coefficients: (11 not defined because of singularities)
                    Estimate Std. Error t value Pr(>|t|)
(Intercept)        -93.65355        NaN     NaN      NaN
X_2microbe_abund1   -0.08226        NaN     NaN      NaN
X_2microbe_abund2    1.69854        NaN     NaN      NaN
X_2microbe_abund3    0.25740        NaN     NaN      NaN
X_2microbe_abund4    0.25370        NaN     NaN      NaN
X_2microbe_abund5  -19.24956        NaN     NaN      NaN
X_2microbe_abund6    1.31754        NaN     NaN      NaN
X_2microbe_abund7    1.32377        NaN     NaN      NaN
X_2microbe_abund8   19.91991        NaN     NaN      NaN
X_2microbe_abund9    0.66329        NaN     NaN      NaN
X_2microbe_abund10   0.30254        NaN     NaN      NaN
X_2microbe_abund11  -1.43707        NaN     NaN      NaN
X_2microbe_abund12   1.16239        NaN     NaN      NaN
X_2microbe_abund13  -1.11403        NaN     NaN      NaN
X_2microbe_abund14  -0.98747        NaN     NaN      NaN
X_2microbe_abund15  -0.28218        NaN     NaN      NaN
X_2microbe_abund16  -0.57312        NaN     NaN      NaN
X_2microbe_abund17  -0.44631        NaN     NaN      NaN
X_2microbe_abund18  -0.18795        NaN     NaN      NaN
X_2microbe_abund19   0.19997        NaN     NaN      NaN
X_2microbe_abund20        NA         NA      NA       NA
X_2microbe_abund21        NA         NA      NA       NA
X_2microbe_abund22        NA         NA      NA       NA
X_2microbe_abund23        NA         NA      NA       NA
X_2microbe_abund24        NA         NA      NA       NA
X_2microbe_abund25        NA         NA      NA       NA
X_2microbe_abund26        NA         NA      NA       NA
X_2microbe_abund27        NA         NA      NA       NA
X_2microbe_abund28        NA         NA      NA       NA
X_2microbe_abund29        NA         NA      NA       NA
X_2microbe_abund30        NA         NA      NA       NA

Residual standard error: NaN on 0 degrees of freedom
Multiple R-squared:      1, Adjusted R-squared:    NaN 
F-statistic:   NaN on 19 and 0 DF,  p-value: NA

Interpretation 2.1 - the model didn’t really work: - no degrees of freedom suggests the same number or predictors and observations, because there are more variables than observations.!.
- alternatively, there could be extremely high collinearities between predictors “some coefficitients are”NA”

Question 2.2

options(repos = c(CRAN = "https://cloud.r-project.org/"))
library(glmnet)
Loading required package: Matrix
Loaded glmnet 4.1-8
lasso_model <- cv.glmnet(X_2, y_2, alpha = 1, nfolds = 5)  # alpha = 1 for LASSO
# Print best lambda
print(lasso_model$lambda.min)  # Lambda  minimizes cross-validation error
[1] 0.2390713
print(lasso_model$lambda.1se)  # More conservative choice
[1] 0.5785832
plot(lasso_model)  # Visualize cross-validation error vs. lambda

best_lambda <- lasso_model$lambda.min  #lambda.min or lambda.1se
lasso_coefs <- coef(lasso_model, s = best_lambda)  #selected features
print(lasso_coefs)
31 x 1 sparse Matrix of class "dgCMatrix"
                           s1
(Intercept)      0.9812231991
microbe_abund1   .           
microbe_abund2   .           
microbe_abund3   .           
microbe_abund4   .           
microbe_abund5  -0.0004631324
microbe_abund6   0.0022531169
microbe_abund7  -0.0182030651
microbe_abund8   .           
microbe_abund9   .           
microbe_abund10  .           
microbe_abund11  .           
microbe_abund12  .           
microbe_abund13  .           
microbe_abund14  .           
microbe_abund15  .           
microbe_abund16  .           
microbe_abund17  0.0053227539
microbe_abund18  .           
microbe_abund19  .           
microbe_abund20  .           
microbe_abund21  .           
microbe_abund22 -0.0007034883
microbe_abund23  0.9918729383
microbe_abund24  .           
microbe_abund25  .           
microbe_abund26  .           
microbe_abund27  .           
microbe_abund28  .           
microbe_abund29  .           
microbe_abund30  .           

Interpretation 2.2 - many variables removed in lasso_coef - the variable with the highest s1 value (23 = 0.99) has the strongest predicted effect on wbd_incidence - positive coefficients: 23, 17, 6 increase likelihood of wbd_incidence - negative coefficients: 5, 7, 22 decrease wbd_incidence - this model works better because it removes insignificant and xtra large coefficients - this indicates that the multiple regression model was overfitted, and a simplier model is better (like LASSO)

Question 2.3

ridge_model <- cv.glmnet(X_2, y_2, alpha = 0, nfolds = 5)  #ridge regression
# Print best lambda
print(ridge_model$lambda.min)  #best lambda minimizing cross-validation error
[1] 207.932
print(ridge_model$lambda.1se)  #conservative lambda choice
[1] 1109.671
# Plot ridge regression cross-validation results
plot(ridge_model)

# Extract coefficients for best lambda
ridge_coefs <- coef(ridge_model, s = ridge_model$lambda.min)
print(ridge_coefs)
31 x 1 sparse Matrix of class "dgCMatrix"
                           s1
(Intercept)     47.9932359643
microbe_abund1  -0.0037851001
microbe_abund2   0.0061945568
microbe_abund3  -0.0004803214
microbe_abund4  -0.0347708536
microbe_abund5  -0.0251174884
microbe_abund6  -0.0035852412
microbe_abund7  -0.0096654090
microbe_abund8  -0.0250056295
microbe_abund9   0.0211742086
microbe_abund10  0.0184975423
microbe_abund11 -0.0045124438
microbe_abund12  0.0156161908
microbe_abund13 -0.0009162257
microbe_abund14 -0.0083157625
microbe_abund15  0.0071137495
microbe_abund16 -0.0105998261
microbe_abund17  0.0019878660
microbe_abund18 -0.0194080462
microbe_abund19  0.0180567708
microbe_abund20  0.0090027585
microbe_abund21  0.0008470935
microbe_abund22 -0.0141577470
microbe_abund23  0.0777229443
microbe_abund24 -0.0158122611
microbe_abund25 -0.0341506831
microbe_abund26  0.0146782722
microbe_abund27  0.0227067313
microbe_abund28 -0.0059243072
microbe_abund29 -0.0095566572
microbe_abund30  0.0216233415

Interpretation 2.3 - retains all variables, but shrunk toward zero - ridge regression assumes all variables impact WBD to some extent - if there is high multicollinearity in the data, RR might distirbute effect across correlated variables instead of “picking” one - so LASSO singled out varibables to deal with collinearity, culling xtras. RR distributed that collinearity, assuming all variables are at least somewhat impactful.

Question 2.4

my_ridge_fun <- function(x2k, y2k, lambda = 0) {
  K <- ncol(x2k)
  N <- nrow(x2k)
  # Scale the explanatory variables
  x2k <- as.matrix(scale(x2k))
  y2k <- as.matrix(y2k)
  # Coefficient estimates
  I_K <- diag(1, K, K) # K x K identity matrix
  # intercept is mean of Y when X standardized
  betas <- c(mean(y2k), solve(t(x2k) %*% x2k + lambda * I_K) %*% t(x2k) %*% y2k)
  betas <- as.numeric(betas)
  names(betas) <- c("Intercept", colnames(x2k))
  return(list(beta = betas))
}
kfold_ridge <- function(y2k, x2k, k = 5, lambda = seq(from = 0.1, to = 100, len = 20)) {
  # num observations
  N <- length(y2k)
  # k folds
  fold_idx <- sample(rep(1:k, length.out = N))
  # vector to store MSE for each lambda
  mse_values <- numeric(length(lambda))
  # Loop over lambda values
  for (i in seq_along(lambda)) {
    mse_fold <- numeric(k)
    #k-fold cross-validation
    for (j in 1:k) {
      test_idx <- which(fold_idx == j)
      train_idx <- setdiff(1:N, test_idx)
      #train data
      x_train <- x2k[train_idx, , drop = FALSE]
      y_train <- y2k[train_idx]
      #test data
      x_test <- x2k[test_idx, , drop = FALSE]
      y_test <- y2k[test_idx]
      # Fit ridge regression model on training data
      ridge_model <- my_ridge_fun(x_train, y_train, lambda = lambda[i])
    
      #Add intercept, make predictions on test data
      y_pred <- cbind(1, scale(x_test)) %*% ridge_model$beta  # Add intercept
      # Compute MSQ for this fold
      mse_fold[j] <- mean((y_test - y_pred)^2)
    }
    # Store mean MSE across all folds for current lambda
    mse_values[i] <- mean(mse_fold)
  }
  # Identify the best lambda (minimizing MSE)
  min_lambda <- lambda[which.min(mse_values)]
  return(list(mse = mse_values, lambda = lambda, min_lambda = min_lambda))
}

# Run k-fold cross-validation for Ridge regression
ridge_results <- kfold_ridge( y2k = df2$wbd_incidence, x2k = df2[, -1], k = 5, lambda = seq(0.1, 100, length.out = 20))
# Print results
print(ridge_results$min_lambda)  # Best lambda
[1] 0.1
print(ridge_results$mse)  # MSE values for each lambda
 [1] 227.8739 264.7185 286.9287 303.6250 317.0925 328.3650 338.0262 346.4463
 [9] 353.8787 360.5059 366.4638 371.8573 376.7687 381.2641 385.3971 389.2122
[17] 392.7465 396.0312 399.0928 401.9541
# Optional: Plot lambda vs. MSE to visualize selection
plot(ridge_results$lambda, ridge_results$mse, type = "b",
     xlab = "Lambda", ylab = "Mean Squared Error",
     main = "Ridge Regression Cross-Validation")