Respond to the questions by inserting your code and any required interpretation into this assignment template file. Submit your knitted HTML file on Canvas.
The downloaded binary packages are in
/var/folders/ml/3wlw38rj1yg4whh3kfh97xdw0000gp/T//RtmpYtsxkR/downloaded_packages
library(ggplot2)df1 <-read.csv("https://raw.githubusercontent.com/tgouhier/adv_biostats/refs/heads/main/assn1_prob1.csv")ggplot(df1, aes(x = dist, y = corr, color = species)) +geom_line() +geom_point() +labs(x ="Distance", y ="Synchrony (Correlation)", title ="Synchrony vs. Distance for Each Species",color ="Species") +theme_minimal() +theme(legend.position ="right")
Analysis of Variance Table
Response: corr
Df Sum Sq Mean Sq F value Pr(>F)
dist 1 1.79586 1.79586 87.595 3.543e-15 ***
species 1 0.23284 0.23284 11.357 0.001084 **
dist:species 1 0.63480 0.63480 30.963 2.376e-07 ***
Residuals 96 1.96817 0.02050
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
par(mfrow =c(2, 2)) plot(mod.raw)
Interpretation 1.2 - dist: F value is high, p-value is =<0.001 (highly significant) <- dist has strong effect on synchrony - species: f value above 10, but not super high, p-value is =<0.001. species variation has moderate effect on synchrony - dist:species = interaction. f value is high, p-value =<0.001. effect of distance on synchrony = not equal for all species - residuals: mean sq = unknown variation <- lower than dist/species/dist:species means that the model captures a lot of synchrony variation
Question 1.3
Interpretation 1.3 - Residuals v Fitted: red line is curved <-potential linearity violation? - residuals fan out, indicating heteroscedasticity - Outliers may be impacting fit (24, 14, 51 i’m lookin at you) - QQ Residuals: most points fall along diagnal line, mostly normal! - again with the outliers!! indicates potential heavy tails. - Scale-Location: red line NOT flat <- variance increases slightly for higher fitted values? 51! you’re crazy. - Residuals v Leverage: point 51 has high tandardized residuals AND leverage <- strong influence on mode. - overall: potential non-linearity, heteroscedascticity, influential point 51.
Question 1.4
#df1$corr_trans <- ifelse(df1$corr > 0, log(df1$corr), NA)# Fit new modelmod.trans <-lm(log(corr +1) ~ dist * species, data = df1)anova_trans <-anova(mod.trans)print(anova_trans)
Analysis of Variance Table
Response: log(corr + 1)
Df Sum Sq Mean Sq F value Pr(>F)
dist 1 0.91938 0.91938 87.624 3.515e-15 ***
species 1 0.10938 0.10938 10.425 0.001702 **
dist:species 1 0.32895 0.32895 31.351 2.043e-07 ***
Residuals 96 1.00725 0.01049
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
par(mfrow =c(2, 2))plot(mod.trans)
#CHANGE this to one line, can simplify based on other code
Interpretation 1.4 - ANOVA - point 51 is no longer as influential of an outlier (xcept in R v L) - the differences in p-values were amplified <- species is no longer significant - residuals mean sq increased - PLOTs - RvF: more linear 51 not an issue (but now 28, 24, 84 DEFINITELY are) - QQ: less influenced by 51, now more influenced by 28, 24, 84? - S-L: line of best fit = flatter, but still heteroscedastic - RvL: 51 has significantly less leverage, 24 and 84 pulling things down
Question 1.5
Interpretation 1.5 Autocorrection impacts: - Coefficient Estimates - if synchrony is spatially/temporally dependent (ex: dist), the effect may be under or over estimated depending on autocorrect structure - interaction between dist:spec may be misleading <- confusion between shared envir conditionvs vs. species differences - Standard Errors - underestimated due to assumption of each variable being independent - overestimation of precision of estimates, making them seem more reliable than they are ^^^ - overly narrow confidence intervals due to overestimation of estimate precision - P-values - standard errors UNDERestimated <- p-values will be TOO small - could cause wrong rejection of null hypothesis whe its true - ex: species became non-significant in transformed model: this may be a mis-calculation
Question 1.6
perm_glm <-function(y, x, nperms =999) {#model base mod.base <-lm(y ~ dist*species,data=x) observed_fvals <-anova(mod.base)$'F value'[1:3]#vector for storing fvals from permutations perm_fvals <-matrix(NA, nrow=nperms+1, ncol=length(observed_fvals))#monte carlo simulationsfor (i in1:nperms) { y_perm <-sample(y) # randomly shuffle response variable x_perm <-as.data.frame(apply(x,2,sample)) mod_perm <-lm(y_perm ~ dist*species,data=x_perm) #fit model perm_fvals[i, ] <-anova(mod_perm)$'F value'[1:3] }# Compute p-values as the proportion of permuted F-values greater than or equal to observed F-values perm_fvals[nperms+1, ] <- observed_fvals pval <-apply(perm_fvals, 2, function(x) mean(x >= x[nperms+1])) names(pval) <-names(observed_fvals)return(pval)}
Interpretation 1.6 - ANOVA - point 51 is no longer as influential of an outlier (xcept in R v L) - the differences in p-values were amplified <- species is no longer significant - residuals mean sq increased - PLOTs - RvF: more linear 51 not an issue (but now 28, 24, 84 DEFINITELY are) - QQ: less influenced by 51, now more influenced by 28, 24, 84? - S-L: line of best fit = flatter, but still heteroscedastic - RvL: 51 has significantly less leverage, 24 and 84 pulling things down
Question 1.7
y <- df1$corrx <-subset(df1, select=c('dist','species'))p_values_mc <-perm_glm(y, x, nperms =999)print(p_values_mc)
[1] 0.001 0.007 0.001
Task 2
Question 2.1
df2 <-read.csv("https://raw.githubusercontent.com/tgouhier/adv_biostats/refs/heads/main/assn1_prob2.csv")#identify response and predictor variablesy_2 <- df2$wbd_incidence # Response variableX_2 <-as.matrix(df2[, -which(names(df2) =="wbd_incidence")]) # Extract microbial abundance and convert to matrix#fit multiple regression modelmod_wbd <-lm(y_2 ~ X_2)#check model summarysummary(mod_wbd)
Call:
lm(formula = y_2 ~ X_2)
Residuals:
ALL 20 residuals are 0: no residual degrees of freedom!
Coefficients: (11 not defined because of singularities)
Estimate Std. Error t value Pr(>|t|)
(Intercept) -93.65355 NaN NaN NaN
X_2microbe_abund1 -0.08226 NaN NaN NaN
X_2microbe_abund2 1.69854 NaN NaN NaN
X_2microbe_abund3 0.25740 NaN NaN NaN
X_2microbe_abund4 0.25370 NaN NaN NaN
X_2microbe_abund5 -19.24956 NaN NaN NaN
X_2microbe_abund6 1.31754 NaN NaN NaN
X_2microbe_abund7 1.32377 NaN NaN NaN
X_2microbe_abund8 19.91991 NaN NaN NaN
X_2microbe_abund9 0.66329 NaN NaN NaN
X_2microbe_abund10 0.30254 NaN NaN NaN
X_2microbe_abund11 -1.43707 NaN NaN NaN
X_2microbe_abund12 1.16239 NaN NaN NaN
X_2microbe_abund13 -1.11403 NaN NaN NaN
X_2microbe_abund14 -0.98747 NaN NaN NaN
X_2microbe_abund15 -0.28218 NaN NaN NaN
X_2microbe_abund16 -0.57312 NaN NaN NaN
X_2microbe_abund17 -0.44631 NaN NaN NaN
X_2microbe_abund18 -0.18795 NaN NaN NaN
X_2microbe_abund19 0.19997 NaN NaN NaN
X_2microbe_abund20 NA NA NA NA
X_2microbe_abund21 NA NA NA NA
X_2microbe_abund22 NA NA NA NA
X_2microbe_abund23 NA NA NA NA
X_2microbe_abund24 NA NA NA NA
X_2microbe_abund25 NA NA NA NA
X_2microbe_abund26 NA NA NA NA
X_2microbe_abund27 NA NA NA NA
X_2microbe_abund28 NA NA NA NA
X_2microbe_abund29 NA NA NA NA
X_2microbe_abund30 NA NA NA NA
Residual standard error: NaN on 0 degrees of freedom
Multiple R-squared: 1, Adjusted R-squared: NaN
F-statistic: NaN on 19 and 0 DF, p-value: NA
Interpretation 2.1 - the model didn’t really work: - no degrees of freedom suggests the same number or predictors and observations, because there are more variables than observations.!.
- alternatively, there could be extremely high collinearities between predictors “some coefficitients are”NA”
Interpretation 2.2 - many variables removed in lasso_coef - the variable with the highest s1 value (23 = 0.99) has the strongest predicted effect on wbd_incidence - positive coefficients: 23, 17, 6 increase likelihood of wbd_incidence - negative coefficients: 5, 7, 22 decrease wbd_incidence - this model works better because it removes insignificant and xtra large coefficients - this indicates that the multiple regression model was overfitted, and a simplier model is better (like LASSO)
Interpretation 2.3 - retains all variables, but shrunk toward zero - ridge regression assumes all variables impact WBD to some extent - if there is high multicollinearity in the data, RR might distirbute effect across correlated variables instead of “picking” one - so LASSO singled out varibables to deal with collinearity, culling xtras. RR distributed that collinearity, assuming all variables are at least somewhat impactful.
Question 2.4
my_ridge_fun <-function(x2k, y2k, lambda =0) { K <-ncol(x2k) N <-nrow(x2k)# Scale the explanatory variables x2k <-as.matrix(scale(x2k)) y2k <-as.matrix(y2k)# Coefficient estimates I_K <-diag(1, K, K) # K x K identity matrix# intercept is mean of Y when X standardized betas <-c(mean(y2k), solve(t(x2k) %*% x2k + lambda * I_K) %*%t(x2k) %*% y2k) betas <-as.numeric(betas)names(betas) <-c("Intercept", colnames(x2k))return(list(beta = betas))}kfold_ridge <-function(y2k, x2k, k =5, lambda =seq(from =0.1, to =100, len =20)) {# num observations N <-length(y2k)# k folds fold_idx <-sample(rep(1:k, length.out = N))# vector to store MSE for each lambda mse_values <-numeric(length(lambda))# Loop over lambda valuesfor (i inseq_along(lambda)) { mse_fold <-numeric(k)#k-fold cross-validationfor (j in1:k) { test_idx <-which(fold_idx == j) train_idx <-setdiff(1:N, test_idx)#train data x_train <- x2k[train_idx, , drop =FALSE] y_train <- y2k[train_idx]#test data x_test <- x2k[test_idx, , drop =FALSE] y_test <- y2k[test_idx]# Fit ridge regression model on training data ridge_model <-my_ridge_fun(x_train, y_train, lambda = lambda[i])#Add intercept, make predictions on test data y_pred <-cbind(1, scale(x_test)) %*% ridge_model$beta # Add intercept# Compute MSQ for this fold mse_fold[j] <-mean((y_test - y_pred)^2) }# Store mean MSE across all folds for current lambda mse_values[i] <-mean(mse_fold) }# Identify the best lambda (minimizing MSE) min_lambda <- lambda[which.min(mse_values)]return(list(mse = mse_values, lambda = lambda, min_lambda = min_lambda))}# Run k-fold cross-validation for Ridge regressionridge_results <-kfold_ridge( y2k = df2$wbd_incidence, x2k = df2[, -1], k =5, lambda =seq(0.1, 100, length.out =20))# Print resultsprint(ridge_results$min_lambda) # Best lambda
[1] 0.1
print(ridge_results$mse) # MSE values for each lambda