Advanced Biostatistics | Assignment 1

Author
Affiliation

Ariel Shatsky

Northeastern University

Published

February 25, 2025

Instructions

Respond to the questions by inserting your code and any required interpretation into this assignment template file. Submit your knitted HTML file on Canvas.

Task 1

Question 1.1

# Code chunk 1.1
df1 <- read.csv("https://raw.githubusercontent.com/tgouhier/adv_biostats/refs/heads/main/assn1_prob1.csv") 
library(ggplot2) 
df1
      species       corr       dist
1     Mussels 0.91897467  0.0000000
2     Mussels 0.92943378  0.8163265
3     Mussels 0.76248642  1.6326531
4     Mussels 0.94565410  2.4489796
5     Mussels 0.76385089  3.2653061
6     Mussels 0.59795650  4.0816327
7     Mussels 0.68183924  4.8979592
8     Mussels 0.66369883  5.7142857
9     Mussels 0.60760329  6.5306122
10    Mussels 0.48276825  7.3469388
11    Mussels 0.63064545  8.1632653
12    Mussels 0.48726459  8.9795918
13    Mussels 0.35741388  9.7959184
14    Mussels 0.17157872 10.6122449
15    Mussels 0.48112904 11.4285714
16    Mussels 0.34164354 12.2448980
17    Mussels 0.32378252 13.0612245
18    Mussels 0.40067524 13.8775510
19    Mussels 0.37080184 14.6938776
20    Mussels 0.33183854 15.5102041
21    Mussels 0.34938719 16.3265306
22    Mussels 0.32191676 17.1428571
23    Mussels 0.23845401 17.9591837
24    Mussels 0.02035271 18.7755102
25    Mussels 0.27047873 19.5918367
26    Mussels 0.19293752 20.4081633
27    Mussels 0.17380473 21.2244898
28    Mussels 0.03386145 22.0408163
29    Mussels 0.12533630 22.8571429
30    Mussels 0.20777036 23.6734694
31    Mussels 0.29523152 24.4897959
32    Mussels 0.14299051 25.3061224
33    Mussels 0.18641990 26.1224490
34    Mussels 0.13709596 26.9387755
35    Mussels 0.00000000 27.7551020
36    Mussels 0.09180995 28.5714286
37    Mussels 0.08982851 29.3877551
38    Mussels 0.11959188 30.2040816
39    Mussels 0.23208421 31.0204082
40    Mussels 0.19522748 31.8367347
41    Mussels 0.09953441 32.6530612
42    Mussels 0.08795661 33.4693878
43    Mussels 0.18050630 34.2857143
44    Mussels 0.16418810 35.1020408
45    Mussels 0.03753740 35.9183673
46    Mussels 0.03371995 36.7346939
47    Mussels 0.13913650 37.5510204
48    Mussels 0.17788085 38.3673469
49    Mussels 0.08827159 39.1836735
50    Mussels 0.18621490 40.0000000
51  Barnacles 1.02030475  0.0000000
52  Barnacles 0.47293356  0.8163265
53  Barnacles 0.37093449  1.6326531
54  Barnacles 0.13666320  2.4489796
55  Barnacles 0.35434473  3.2653061
56  Barnacles 0.39203806  4.0816327
57  Barnacles 0.14974145  4.8979592
58  Barnacles 0.07871950  5.7142857
59  Barnacles 0.23863262  6.5306122
60  Barnacles 0.16750435  7.3469388
61  Barnacles 0.42088388  8.1632653
62  Barnacles 0.17667093  8.9795918
63  Barnacles 0.24951264  9.7959184
64  Barnacles 0.18331407 10.6122449
65  Barnacles 0.10617555 11.4285714
66  Barnacles 0.19937724 12.2448980
67  Barnacles 0.00000000 13.0612245
68  Barnacles 0.32705040 13.8775510
69  Barnacles 0.19581983 14.6938776
70  Barnacles 0.39775548 15.5102041
71  Barnacles 0.22804518 16.3265306
72  Barnacles 0.10949955 17.1428571
73  Barnacles 0.24156681 17.9591837
74  Barnacles 0.08708440 18.7755102
75  Barnacles 0.05513082 19.5918367
76  Barnacles 0.20963879 20.4081633
77  Barnacles 0.13616498 21.2244898
78  Barnacles 0.18060470 22.0408163
79  Barnacles 0.18792829 22.8571429
80  Barnacles 0.12154207 23.6734694
81  Barnacles 0.12362729 24.4897959
82  Barnacles 0.16697630 25.3061224
83  Barnacles 0.29830286 26.1224490
84  Barnacles 0.02813748 26.9387755
85  Barnacles 0.23988878 27.7551020
86  Barnacles 0.21378920 28.5714286
87  Barnacles 0.28680415 29.3877551
88  Barnacles 0.15007577 30.2040816
89  Barnacles 0.21749604 31.0204082
90  Barnacles 0.20720404 31.8367347
91  Barnacles 0.12624216 32.6530612
92  Barnacles 0.30128094 33.4693878
93  Barnacles 0.29653442 34.2857143
94  Barnacles 0.25051553 35.1020408
95  Barnacles 0.33917751 35.9183673
96  Barnacles 0.23634280 36.7346939
97  Barnacles 0.05283494 37.5510204
98  Barnacles 0.12316762 38.3673469
99  Barnacles 0.05803290 39.1836735
100 Barnacles 0.13315410 40.0000000
ggplot(data = df1, mapping = aes (x = dist, y = corr, colour = species)) + geom_point()

Interpretation 1.1

Question 1.2

mod.raw <- lm(corr ~ dist*species, data = df1)
print(summary(mod.raw))

Call:
lm(formula = corr ~ dist * species, data = df1)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.31927 -0.07954 -0.01913  0.09255  0.70715 

Coefficients:
                     Estimate Std. Error t value Pr(>|t|)    
(Intercept)          0.313151   0.039899   7.849 5.93e-12 ***
dist                -0.004612   0.001719  -2.683  0.00859 ** 
speciesMussels       0.367042   0.056425   6.505 3.51e-09 ***
dist:speciesMussels -0.013527   0.002431  -5.564 2.38e-07 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.1432 on 96 degrees of freedom
Multiple R-squared:  0.5751,    Adjusted R-squared:  0.5618 
F-statistic: 43.31 on 3 and 96 DF,  p-value: < 2.2e-16
anova(mod.raw)
Analysis of Variance Table

Response: corr
             Df  Sum Sq Mean Sq F value    Pr(>F)    
dist          1 1.79586 1.79586  87.595 3.543e-15 ***
species       1 0.23284 0.23284  11.357  0.001084 ** 
dist:species  1 0.63480 0.63480  30.963 2.376e-07 ***
Residuals    96 1.96817 0.02050                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Interpretation 1.2 The ANOVA result is significant evidenced by p-values less than 0.05. The table indicates there are significant differences in correlation between the treatments of distance, species, and the interaction between species and distance.

Question 1.3

plot(mod.raw)

Interpretation 1.3 The QQ-residuals plot shows the majority of the data is on the line, but it strays from the line near then end indicating that our residuals are likely normally distributed, but maybe another model could fit the results a bit more accurately.

Question 1.4

mod.trans <- lm(formula = log(corr + 1) ~ dist*species, data = df1)
anova(mod.trans)
Analysis of Variance Table

Response: log(corr + 1)
             Df  Sum Sq Mean Sq F value    Pr(>F)    
dist          1 0.91938 0.91938  87.624 3.515e-15 ***
species       1 0.10938 0.10938  10.425  0.001702 ** 
dist:species  1 0.32895 0.32895  31.351 2.043e-07 ***
Residuals    96 1.00725 0.01049                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
plot(mod.trans)

Interpretation 1.4 There was a slight improvement visible to the normality on the QQ-residuals plot in which the data aligns better with the line, but at both the low and high ends, it strays from the line.

Interpretation 1.5 If autocorrelation was present in our data I would expect that means that some of the residuals are not independent. I think this would impact the modeling results by affecting the significance, in which model coefficient estimates and p-values which may be overestimated and standard error to be underestimated.

Question 1.6

perm_glm <- function(x, y, nperms = 999) {
  mod_raw <- lm(y ~ species * dist, data = x)
  observed_fvals <- na.omit(anova(mod_raw)$`F value`)
 
  perm_fvals <- matrix(NA, nrow=nperms + 1, ncol = length(observed_fvals))
  
  #monte carlo sim
  for (i in 1:nperms) {
    y_perm <- sample(y) 
    x_perm <- apply(x, 2, sample)
    mod_perm <- lm(y_perm ~ species * dist, data = as.data.frame(x_perm)) #fit model
    perm_fvals[i,] <- na.omit(anova(mod_perm)$`F value`)
  }

  perm_fvals[nperms + 1,] <- observed_fvals
#compute p values 
  pval <- apply(perm_fvals, 2, function(x) mean(x >= x[nperms + 1]))
  names(pval) <- names(observed_fvals)

  return(pval)
}

Question 1.7

y <- df1$corr
x <- subset(df1, select = c("dist", "species"))
p_values_monte <- perm_glm(x, y, nperms = 999)
print(p_values_monte)
[1] 0.004 0.001 0.001

Interpretation 1.7

Each of the p values increased a bit following the monte carlo simulation. Monte carlo simulations give us an approximation of the p-value, so sometimes it increases slightly during simulations. It is also possible that the original test underestimated the variance of the null distribution, and the Monte Carlo better accounts for randomness, so now the result is less significant than the initial test.

Task 2

Question 2.1

# Code chunk 2.1
df2 <- read.csv("https://raw.githubusercontent.com/tgouhier/adv_biostats/refs/heads/main/assn1_prob2.csv")
str(df2)
'data.frame':   20 obs. of  31 variables:
 $ wbd_incidence  : num  56.2 35.5 30.7 77.6 72.8 ...
 $ microbe_abund1 : num  7.69 52.8 96.23 70.87 55.35 ...
 $ microbe_abund2 : num  81.2 78.2 26.8 76.2 98.6 ...
 $ microbe_abund3 : num  42.4 53.1 94.3 71.2 72.4 ...
 $ microbe_abund4 : num  33.2 79.5 72.1 20.4 52.6 ...
 $ microbe_abund5 : num  72 19.9 54.8 33.8 32.3 ...
 $ microbe_abund6 : num  33.36 57.83 43.28 5.16 72.98 ...
 $ microbe_abund7 : num  18.978 0.184 87.758 13.411 2.274 ...
 $ microbe_abund8 : num  73.2 20 55.2 32.8 31.4 ...
 $ microbe_abund9 : num  46.9 17.2 36.9 72.5 48.6 ...
 $ microbe_abund10: num  91.4 19.9 36.9 67.1 76.8 ...
 $ microbe_abund11: num  35.1 39.4 95.1 10.7 93.5 ...
 $ microbe_abund12: num  8.53 93.26 83.84 87.94 93.57 ...
 $ microbe_abund13: num  6.45 75.47 62.04 16.96 6.22 ...
 $ microbe_abund14: num  71.6 16.3 47.6 69 46.1 ...
 $ microbe_abund15: num  79.7 83.2 11.4 96.3 14.7 ...
 $ microbe_abund16: num  99.6 37.9 56.2 73.3 87.1 ...
 $ microbe_abund17: num  36.7 94.3 12.4 7 96.4 ...
 $ microbe_abund18: num  3.08 86.2 68.51 94.21 67.59 ...
 $ microbe_abund19: num  38.61 42.61 1.18 91.93 7.94 ...
 $ microbe_abund20: num  25.6 63.5 48.6 93.8 85.8 ...
 $ microbe_abund21: num  47.3 83.6 12.1 67.6 49.7 ...
 $ microbe_abund22: num  17.64 81.34 6.84 40.04 14.11 ...
 $ microbe_abund23: num  56 34.1 32.1 77.9 70.5 ...
 $ microbe_abund24: num  7.12 70.28 69.88 46.4 43.69 ...
 $ microbe_abund25: num  31.9 80.1 72.6 19.4 55.8 ...
 $ microbe_abund26: num  80.4 75.9 95.7 99.4 60.6 ...
 $ microbe_abund27: num  37.04 78.1 1.11 94.03 99.37 ...
 $ microbe_abund28: num  32.1 15.5 13.2 22.1 22.6 ...
 $ microbe_abund29: num  40.2 85 75.7 53.3 87.4 ...
 $ microbe_abund30: num  24.2 64.4 28.1 95.8 15.8 ...
head(df2)
  wbd_incidence microbe_abund1 microbe_abund2 microbe_abund3 microbe_abund4
1      56.22359       7.689704       81.23051       42.35584       33.18790
2      35.45888      52.804888       78.21821       53.08244       79.48331
3      30.73228      96.233331       26.78781       94.27048       72.11040
4      77.55857      70.874005       76.21515       71.22246       20.41545
5      72.79447      55.347570       98.63116       72.44906       52.60227
6      20.57576      24.295661       29.36055       47.01286       34.34406
  microbe_abund5 microbe_abund6 microbe_abund7 microbe_abund8 microbe_abund9
1       71.99167      33.360552     18.9779918       73.20798      46.925094
2       19.88514      57.828427      0.1836858       20.03184      17.180008
3       54.80967      43.277313     87.7578062       55.18989      36.918946
4       33.77944       5.159532     13.4111338       32.78699      72.540527
5       32.30500      72.980329      2.2741224       31.37119      48.614910
6       81.22108      54.817041     93.9136706       80.44817       6.380247
  microbe_abund10 microbe_abund11 microbe_abund12 microbe_abund13
1        91.37846        35.06256        8.531101        6.445754
2        19.94422        39.39491       93.257193       75.470562
3        36.90836        95.09510       83.838407       62.041003
4        67.14083        10.66483       87.943330       16.957677
5        76.81446        93.47601       93.571247        6.221405
6        52.22483        34.61621        7.246063       10.902927
  microbe_abund14 microbe_abund15 microbe_abund16 microbe_abund17
1        71.58333        79.68361        99.59655       36.656362
2        16.30855        83.16717        37.92767       94.250532
3        47.61880        11.35077        56.19881       12.372357
4        69.02567        96.33120        73.27180        7.003268
5        46.08952        14.73229        87.08056       96.431704
6        95.51467        14.36269        57.21703       44.251011
  microbe_abund18 microbe_abund19 microbe_abund20 microbe_abund21
1        3.075657       38.609963        25.59822        47.25581
2       86.203404       42.609810        63.48758        83.55179
3       68.510751        1.175973        48.56721        12.14715
4       94.207491       91.933177        93.81769        67.58081
5       67.585376        7.943969        85.75015        49.66852
6       84.312015       50.737425        37.08835        90.22377
  microbe_abund22 microbe_abund23 microbe_abund24 microbe_abund25
1       17.635071        55.99903        7.115402        31.90442
2       81.343521        34.12953       70.277874        80.11477
3        6.844664        32.13540       69.882454        72.60808
4       40.044975        77.94594       46.396238        19.44886
5       14.114433        70.49000       43.693111        55.77443
6       19.330986        21.39664       56.217679        35.91584
  microbe_abund26 microbe_abund27 microbe_abund28 microbe_abund29
1       80.439077       37.035707        32.14921        40.21971
2       75.869681       78.101754        15.48316        84.95521
3       95.724989        1.114951        13.22282        75.66449
4       99.391388       94.030871        22.13059        53.26012
5       60.644099       99.374923        22.63808        87.41497
6        2.937717       35.740575        13.14165        46.71151
  microbe_abund30
1        24.15792
2        64.41072
3        28.07502
4        95.76365
5        15.83960
6        41.83390
df2
   wbd_incidence microbe_abund1 microbe_abund2 microbe_abund3 microbe_abund4
1       56.22359       7.689704      81.230509     42.3558418       33.18790
2       35.45888      52.804888      78.218212     53.0824364       79.48331
3       30.73228      96.233331      26.787813     94.2704762       72.11040
4       77.55857      70.874005      76.215153     71.2224559       20.41545
5       72.79447      55.347570      98.631159     72.4490575       52.60227
6       20.57576      24.295661      29.360555     47.0128618       34.34406
7       54.70537      77.804253      39.935111     12.0282256       30.65204
8       36.00933      65.194094      81.213152     78.3097671       79.54620
9       32.16553      83.024570       7.715167     43.8157397       73.14090
10      80.34924      64.855095      36.369681     43.1455980       21.59729
11      72.58983      47.983579      44.259247      2.7497879       54.35633
12      21.39755      49.506411      15.671413     14.6561843       34.57788
13      54.79555      37.987259      58.220527     42.2595158       32.23066
14      33.09488      45.048544      97.016218     76.7137174       80.43727
15      31.96542      81.425175      98.949983      0.4766445       70.55324
16      79.70336      92.877723      17.645204     60.3595692       19.46981
17      71.52192      14.748105      54.213042     90.5577261       55.12174
18      19.15905      74.982166      38.430389     70.6661532       36.05616
19      54.24636      97.565735      67.616405     26.2537159       31.05991
20      35.48680      97.479246      26.929378     85.1076219       80.90662
   microbe_abund5 microbe_abund6 microbe_abund7 microbe_abund8 microbe_abund9
1        71.99167      33.360552     18.9779918       73.20798      46.925094
2        19.88514      57.828427      0.1836858       20.03184      17.180008
3        54.80967      43.277313     87.7578062       55.18989      36.918946
4        33.77944       5.159532     13.4111338       32.78699      72.540527
5        32.30500      72.980329      2.2741224       31.37119      48.614910
6        81.22108      54.817041     93.9136706       80.44817       6.380247
7        71.11023      75.122199     29.2948723       70.74228      78.454623
8        19.53912       5.077116     16.4326574       20.22956      41.832164
9        55.13233      71.496621     39.9102556       55.83556      98.101808
10       34.83329      29.769446     45.9575412       35.23809      28.288396
11       33.23978      28.347721     43.4030849       32.42738      84.788215
12       79.52348      82.987713     51.7009826       79.98302       8.223923
13       72.12384       8.639538     84.6245753       72.57819      88.645875
14       20.39282       4.265712      5.5164286       19.19371      47.193073
15       53.83160      34.874081     55.4177061       54.22199      10.910096
16       34.95716      54.233601     68.8275238       34.97852      33.327798
17       29.59893      60.946086     65.8057554       30.79420      83.741657
18       80.81654      27.137210     66.3342725       79.99342      27.684984
19       71.56505      20.523093     47.2234203       71.33117      58.703514
20       22.38170      38.163215     96.9528166       22.45476      83.673227
   microbe_abund10 microbe_abund11 microbe_abund12 microbe_abund13
1         91.37846       35.062557        8.531101        6.445754
2         19.94422       39.394906       93.257193       75.470562
3         36.90836       95.095101       83.838407       62.041003
4         67.14083       10.664832       87.943330       16.957677
5         76.81446       93.476012       93.571247        6.221405
6         52.22483       34.616210        7.246063       10.902927
7         82.80750       53.306061       37.875944       38.171635
8         52.70962       53.879430       53.786492       16.931091
9         50.17548       71.471795       10.505014       29.865254
10        41.99733       40.579050       80.168771       19.220954
11        36.22983       15.278814       73.964175       25.717002
12        12.34289       34.023276        5.214901       18.123182
13        29.81615       62.665485       48.216957       47.731371
14        27.66765        5.737268       92.051784       77.073704
15        77.02254       85.166764        4.152843        2.778712
16        77.81813       21.264535       29.399180       52.731078
17        14.37873       53.946203       50.085049       88.031907
18        51.55262       13.648759       60.974894       37.306337
19        59.72405       32.486514       26.424905        4.795913
20        50.58430       62.107629       42.309861       13.862825
   microbe_abund14 microbe_abund15 microbe_abund16 microbe_abund17
1         71.58333        79.68361       99.596548      36.6563616
2         16.30855        83.16717       37.927673      94.2505322
3         47.61880        11.35077       56.198809      12.3723565
4         69.02567        96.33120       73.271802       7.0032679
5         46.08952        14.73229       87.080555      96.4317038
6         95.51467        14.36269       57.217026      44.2510108
7         71.25401        92.52299        1.103607      37.0272380
8         39.71479        50.70356       90.631526      17.0243580
9         11.77206        15.48510       77.065363       5.4190429
10        24.01163        34.83021       38.250462      65.7828069
11        86.36306        65.98210        9.404589      57.8161917
12        43.59764        31.17724        4.965358      98.7101764
13        49.78681        35.15734       82.116232      60.3792401
14        69.19277        14.78457       82.932430       6.4949919
15        76.03133        65.88776       65.473288      16.2109082
16        15.54012        18.50700       13.282781      47.5397920
17        84.94571        95.43781       34.180990       0.1932835
18        94.68178        89.78485       73.137158      44.1459143
19        58.84192        94.36971       90.729142      26.0929737
20        50.22508        72.36908       69.619700      93.8413745
   microbe_abund18 microbe_abund19 microbe_abund20 microbe_abund21
1         3.075657       38.609963       25.598225        47.25581
2        86.203404       42.609810       63.487580        83.55179
3        68.510751        1.175973       48.567211        12.14715
4        94.207491       91.933177       93.817692        67.58081
5        67.585376        7.943969       85.750154        49.66852
6        84.312015       50.737425       37.088354        90.22377
7        36.189418       82.017162       31.420183        55.12640
8        39.236589       59.839542       82.853436        12.85647
9        56.768740       42.415353       45.184151        44.20098
10        9.515213       55.931027       31.587841        19.22977
11       19.378407       78.909447        9.780854        43.49006
12       58.806639       16.771526        6.490054        22.51334
13       75.150417       97.045173       68.945737        96.10963
14       86.723879       47.350310       66.805060        44.84275
15       37.179574       92.974321       90.454665        77.71450
16       79.881455       90.093927       30.169327        15.82198
17        5.831439       75.088219       93.280870        86.68086
18       62.343571       67.656877       20.198302        20.61456
19       35.664141       64.801345       79.237876        17.79497
20       58.792793        7.324687       22.463051        16.48911
   microbe_abund22 microbe_abund23 microbe_abund24 microbe_abund25
1        17.635071        55.99903        7.115402        31.90442
2        81.343521        34.12953       70.277874        80.11477
3         6.844664        32.13540       69.882454        72.60808
4        40.044975        77.94594       46.396238        19.44886
5        14.114433        70.49000       43.693111        55.77443
6        19.330986        21.39664       56.217679        35.91584
7        84.135172        54.00936       92.848323        29.54893
8        71.991399        34.94779       23.046641        78.86322
9        26.721208        32.01817       22.181375        71.28057
10       49.500164        79.33901       42.021589        20.18266
11        8.311390        71.81146       33.352081        54.89484
12       35.388424        20.53797       86.480755        34.91118
13       96.920881        56.46572       17.719454        31.19188
14       62.471419        33.08574       49.331873        78.71037
15       66.461825        31.78128       42.971337        70.65753
16       31.248966        80.01571       56.426384        21.68623
17       40.568961        72.47588       65.616232        54.73248
18       99.607737        20.59233       97.855406        35.82099
19       85.508236        55.07981       23.216115        31.34123
20       95.354840        36.27698       24.081160        82.26092
   microbe_abund26 microbe_abund27 microbe_abund28 microbe_abund29
1        80.439077       37.035707       32.149212      40.2197063
2        75.869681       78.101754       15.483161      84.9552104
3        95.724989        1.114951       13.222817      75.6644907
4        99.391388       94.030871       22.130593      53.2601219
5        60.644099       99.374923       22.638080      87.4149661
6         2.937717       35.740575       13.141653      46.7115116
7        33.644536       74.763506       98.156346       0.8128455
8        27.765809       79.290902       32.701373      72.7767005
9        11.719755       70.585901       50.693950      71.6589479
10        4.321826       47.582504       68.144251      18.7426371
11       37.030979       49.465453        9.916910      64.6067277
12       33.687831       30.805245       11.890256      54.1979280
13       17.365255       69.501225        5.043966      33.5320760
14       62.177328       82.279331       92.925392      63.7908723
15       39.784363       43.471764       67.371223      82.9201064
16       95.567577       51.473265        9.485786      70.8975198
17       65.334944       66.301097       49.259612      34.8550351
18       32.874372       14.316659       46.155184      12.8327875
19       19.714671       34.448739       37.521653      38.8078489
20       11.534232       40.576358       99.109922      92.8177548
   microbe_abund30
1        24.157922
2        64.410723
3        28.075021
4        95.763652
5        15.839596
6        41.833899
7        25.200980
8         9.439027
9        82.771777
10       52.530550
11       66.774763
12       40.827776
13       84.258990
14       73.730547
15       34.822440
16       94.893818
17       64.667919
18        3.527777
19       59.644846
20       41.531800
microbe_matrix <- as.matrix(df2[,-1]) #remove the response (wbd_incidence)
head(microbe_matrix)
     microbe_abund1 microbe_abund2 microbe_abund3 microbe_abund4 microbe_abund5
[1,]       7.689704       81.23051       42.35584       33.18790       71.99167
[2,]      52.804888       78.21821       53.08244       79.48331       19.88514
[3,]      96.233331       26.78781       94.27048       72.11040       54.80967
[4,]      70.874005       76.21515       71.22246       20.41545       33.77944
[5,]      55.347570       98.63116       72.44906       52.60227       32.30500
[6,]      24.295661       29.36055       47.01286       34.34406       81.22108
     microbe_abund6 microbe_abund7 microbe_abund8 microbe_abund9
[1,]      33.360552     18.9779918       73.20798      46.925094
[2,]      57.828427      0.1836858       20.03184      17.180008
[3,]      43.277313     87.7578062       55.18989      36.918946
[4,]       5.159532     13.4111338       32.78699      72.540527
[5,]      72.980329      2.2741224       31.37119      48.614910
[6,]      54.817041     93.9136706       80.44817       6.380247
     microbe_abund10 microbe_abund11 microbe_abund12 microbe_abund13
[1,]        91.37846        35.06256        8.531101        6.445754
[2,]        19.94422        39.39491       93.257193       75.470562
[3,]        36.90836        95.09510       83.838407       62.041003
[4,]        67.14083        10.66483       87.943330       16.957677
[5,]        76.81446        93.47601       93.571247        6.221405
[6,]        52.22483        34.61621        7.246063       10.902927
     microbe_abund14 microbe_abund15 microbe_abund16 microbe_abund17
[1,]        71.58333        79.68361        99.59655       36.656362
[2,]        16.30855        83.16717        37.92767       94.250532
[3,]        47.61880        11.35077        56.19881       12.372357
[4,]        69.02567        96.33120        73.27180        7.003268
[5,]        46.08952        14.73229        87.08056       96.431704
[6,]        95.51467        14.36269        57.21703       44.251011
     microbe_abund18 microbe_abund19 microbe_abund20 microbe_abund21
[1,]        3.075657       38.609963        25.59822        47.25581
[2,]       86.203404       42.609810        63.48758        83.55179
[3,]       68.510751        1.175973        48.56721        12.14715
[4,]       94.207491       91.933177        93.81769        67.58081
[5,]       67.585376        7.943969        85.75015        49.66852
[6,]       84.312015       50.737425        37.08835        90.22377
     microbe_abund22 microbe_abund23 microbe_abund24 microbe_abund25
[1,]       17.635071        55.99903        7.115402        31.90442
[2,]       81.343521        34.12953       70.277874        80.11477
[3,]        6.844664        32.13540       69.882454        72.60808
[4,]       40.044975        77.94594       46.396238        19.44886
[5,]       14.114433        70.49000       43.693111        55.77443
[6,]       19.330986        21.39664       56.217679        35.91584
     microbe_abund26 microbe_abund27 microbe_abund28 microbe_abund29
[1,]       80.439077       37.035707        32.14921        40.21971
[2,]       75.869681       78.101754        15.48316        84.95521
[3,]       95.724989        1.114951        13.22282        75.66449
[4,]       99.391388       94.030871        22.13059        53.26012
[5,]       60.644099       99.374923        22.63808        87.41497
[6,]        2.937717       35.740575        13.14165        46.71151
     microbe_abund30
[1,]        24.15792
[2,]        64.41072
[3,]        28.07502
[4,]        95.76365
[5,]        15.83960
[6,]        41.83390
#multi regression model
mod.microbe <- lm(df2$wbd_incidence ~ microbe_matrix)
summary(mod.microbe)

Call:
lm(formula = df2$wbd_incidence ~ microbe_matrix)

Residuals:
ALL 20 residuals are 0: no residual degrees of freedom!

Coefficients: (11 not defined because of singularities)
                               Estimate Std. Error t value Pr(>|t|)
(Intercept)                   -93.65355        NaN     NaN      NaN
microbe_matrixmicrobe_abund1   -0.08226        NaN     NaN      NaN
microbe_matrixmicrobe_abund2    1.69854        NaN     NaN      NaN
microbe_matrixmicrobe_abund3    0.25740        NaN     NaN      NaN
microbe_matrixmicrobe_abund4    0.25370        NaN     NaN      NaN
microbe_matrixmicrobe_abund5  -19.24956        NaN     NaN      NaN
microbe_matrixmicrobe_abund6    1.31754        NaN     NaN      NaN
microbe_matrixmicrobe_abund7    1.32377        NaN     NaN      NaN
microbe_matrixmicrobe_abund8   19.91991        NaN     NaN      NaN
microbe_matrixmicrobe_abund9    0.66329        NaN     NaN      NaN
microbe_matrixmicrobe_abund10   0.30254        NaN     NaN      NaN
microbe_matrixmicrobe_abund11  -1.43707        NaN     NaN      NaN
microbe_matrixmicrobe_abund12   1.16239        NaN     NaN      NaN
microbe_matrixmicrobe_abund13  -1.11403        NaN     NaN      NaN
microbe_matrixmicrobe_abund14  -0.98747        NaN     NaN      NaN
microbe_matrixmicrobe_abund15  -0.28218        NaN     NaN      NaN
microbe_matrixmicrobe_abund16  -0.57312        NaN     NaN      NaN
microbe_matrixmicrobe_abund17  -0.44631        NaN     NaN      NaN
microbe_matrixmicrobe_abund18  -0.18795        NaN     NaN      NaN
microbe_matrixmicrobe_abund19   0.19997        NaN     NaN      NaN
microbe_matrixmicrobe_abund20        NA         NA      NA       NA
microbe_matrixmicrobe_abund21        NA         NA      NA       NA
microbe_matrixmicrobe_abund22        NA         NA      NA       NA
microbe_matrixmicrobe_abund23        NA         NA      NA       NA
microbe_matrixmicrobe_abund24        NA         NA      NA       NA
microbe_matrixmicrobe_abund25        NA         NA      NA       NA
microbe_matrixmicrobe_abund26        NA         NA      NA       NA
microbe_matrixmicrobe_abund27        NA         NA      NA       NA
microbe_matrixmicrobe_abund28        NA         NA      NA       NA
microbe_matrixmicrobe_abund29        NA         NA      NA       NA
microbe_matrixmicrobe_abund30        NA         NA      NA       NA

Residual standard error: NaN on 0 degrees of freedom
Multiple R-squared:      1, Adjusted R-squared:    NaN 
F-statistic:   NaN on 19 and 0 DF,  p-value: NA
anova(mod.microbe)
Warning in anova.lm(mod.microbe): ANOVA F-tests on an essentially perfect fit
are unreliable
Analysis of Variance Table

Response: df2$wbd_incidence
               Df Sum Sq Mean Sq F value Pr(>F)
microbe_matrix 19 8667.6  456.19     NaN    NaN
Residuals       0    0.0     NaN               

Interpretation 2.1 This regression doesn’t seem to fit the model super well. After the 20th microbe abundance, the values go to NA.

Question 2.2

options(repos = c(CRAN = "https://cran.rstudio.com/"))  # Set CRAN mirror
install.packages("glmnet")  # Now install glmnet

The downloaded binary packages are in
    /var/folders/h4/wb2_p5v15cv6lbm_fkdwbghw0000gn/T//Rtmp54OERM/downloaded_packages
install.packages("glmnet")

The downloaded binary packages are in
    /var/folders/h4/wb2_p5v15cv6lbm_fkdwbghw0000gn/T//Rtmp54OERM/downloaded_packages
library(glmnet)
Loading required package: Matrix
Loaded glmnet 4.1-8
y <- df2$wbd_incidence
x <- as.matrix(df2[,-1])

#lasso regression
set.seed(123)
cv_lasso <- cv.glmnet(x, y, alpha = 1, nfolds = 5)
plot(cv_lasso)

#optimal lambda
optimal_lambda <- cv_lasso$lambda.min
cat("Optimal lambda:", optimal_lambda, "\n")
Optimal lambda: 0.2390713 
#fit lasso model using optimal lambda
lasso_model <- glmnet(x, y, alpha = 1, lambda = optimal_lambda)

#extract coefficients of selected features
lasso_coef <- coef(lasso_model)
print(lasso_coef)
31 x 1 sparse Matrix of class "dgCMatrix"
                           s0
(Intercept)      0.9802153348
microbe_abund1   .           
microbe_abund2   .           
microbe_abund3   .           
microbe_abund4   .           
microbe_abund5  -0.0004071347
microbe_abund6   0.0022111154
microbe_abund7  -0.0182062029
microbe_abund8   .           
microbe_abund9   .           
microbe_abund10  .           
microbe_abund11  .           
microbe_abund12  .           
microbe_abund13  .           
microbe_abund14  .           
microbe_abund15  .           
microbe_abund16  .           
microbe_abund17  0.0053385338
microbe_abund18  .           
microbe_abund19  .           
microbe_abund20  .           
microbe_abund21  .           
microbe_abund22 -0.0007220583
microbe_abund23  0.9918812638
microbe_abund24  .           
microbe_abund25  .           
microbe_abund26  .           
microbe_abund27  .           
microbe_abund28  .           
microbe_abund29  .           
microbe_abund30  .           

Interpretation 2.2

Microbes with coefficients greater than 0 (abundance 5,6,7,17,22,23) are significantly associated with wbd incidence. So the model tell us that it is likely that only these particular microbes drive wbd incidence.

Question 2.3

install.packages("glmnet")

The downloaded binary packages are in
    /var/folders/h4/wb2_p5v15cv6lbm_fkdwbghw0000gn/T//Rtmp54OERM/downloaded_packages
library(glmnet)
y <- df2$wbd_incidence
X <- as.matrix(df2[, -1])
set.seed(123) 
cv_ridge <- cv.glmnet(X, y, alpha = 0, nfolds = 5)
plot(cv_ridge)

optimal_lambda_ridge <- cv_ridge$lambda.min

cat("Optimal lambda (Ridge):", optimal_lambda_ridge, "\n")
Optimal lambda (Ridge): 207.932 
#optimal lambda = 207.932
ridge_model <- glmnet(X, y, alpha = 0, lambda = optimal_lambda_ridge)
#ridge coeff

ridge_coef <- coef(ridge_model)
print(ridge_coef)
31 x 1 sparse Matrix of class "dgCMatrix"
                           s0
(Intercept)     47.9933345932
microbe_abund1  -0.0037843589
microbe_abund2   0.0061934171
microbe_abund3  -0.0004809675
microbe_abund4  -0.0347716592
microbe_abund5  -0.0251179459
microbe_abund6  -0.0035850316
microbe_abund7  -0.0096659402
microbe_abund8  -0.0250068513
microbe_abund9   0.0211743146
microbe_abund10  0.0184975777
microbe_abund11 -0.0045118604
microbe_abund12  0.0156166038
microbe_abund13 -0.0009161306
microbe_abund14 -0.0083162012
microbe_abund15  0.0071138116
microbe_abund16 -0.0105996120
microbe_abund17  0.0019877231
microbe_abund18 -0.0194079910
microbe_abund19  0.0180571614
microbe_abund20  0.0090032279
microbe_abund21  0.0008472632
microbe_abund22 -0.0141578699
microbe_abund23  0.0777230387
microbe_abund24 -0.0158123049
microbe_abund25 -0.0341505601
microbe_abund26  0.0146782717
microbe_abund27  0.0227066494
microbe_abund28 -0.0059243288
microbe_abund29 -0.0095567147
microbe_abund30  0.0216232516

Interpretation 2.3 LASSO - shrinks some coefficients to zero, but ridge is not setting coefficients to zero, it is shrinking them all. In the LASSO model, it narrowed down the microbes that seemed to be most important to WBD incidence, but the ridge model shows the coefficients are all below zero and many of them are negative, which indicates that an abundance of these particular microbes decreases WBD incidence. For the coefficients with positive values, it indicates that that microbe’s abundance may contribute to WBD incidence. In the LASSO model, we saw microbe abundances 5,6,7,17,22,23 were linked to WBD incidence, whereas in the Ridge model, we see abundances 2,9,10,12,15,17,19,20,21,23,26,27, and 30 are the microbes that were positively correlated with WBD incidence. So, there is a clear difference in which microbes were identified by each model, only converging on abundances 17 and 23.

Question 2.4

my_ridge_fun <- function(x, y, lambda = 0) {
  K <- ncol(x)
  N <- nrow(x)
  x <- as.matrix(scale(x))
  y <- as.matrix(y)
  I_K <- diag(1, K, K)
  betas <- c(mean(y), solve(t(x) %*% x + lambda * I_K) %*% t(x) %*% y)
  betas <- as.numeric(betas)
  names(betas) <- c("Intercept", colnames(x))
  return(list(beta = betas))
}

kfold_ridge <- function(y, x, k = 5, lambda = seq(from = 0.1, to = 100, len = 20)) {
  # Ensure x is a matrix and y is a numeric vector
  x <- as.matrix(x)
  y <- as.numeric(y)
  
  # Number of observations
  N <- length(y)
  
  # Create k random folds
  set.seed(123)  # For reproducibility
  folds <- sample(rep(1:k, length.out = N))
  
  # Store MSE values for each lambda
  mse <- numeric(length(lambda))
  
  # Loop over lambda values
  for (i in seq_along(lambda)) {
    lambda_val <- lambda[i]
    mse_folds <- numeric(k)  # Store MSE for each fold
    
    # Perform k-fold cross-validation
    for (fold in 1:k) {
      # Training and testing indices
      test_idx <- which(folds == fold)
      train_idx <- setdiff(1:N, test_idx)
      
      # Training and test sets
      x_train <- x[train_idx, , drop = FALSE]
      y_train <- y[train_idx]
      x_test <- x[test_idx, , drop = FALSE]
      y_test <- y[test_idx]
      
      # Fit ridge regression model on training data
      ridge_model <- my_ridge_fun(x_train, y_train, lambda = lambda_val)
      
      # Extract coefficients (excluding intercept)
      beta <- ridge_model$beta[-1]  # Remove intercept
      
      # Make predictions on test data
      y_pred <- x_test %*% beta + ridge_model$beta[1]  # Apply intercept
      
      # Compute Mean Squared Error (MSE)
      mse_folds[fold] <- mean((y_test - y_pred)^2)
    }
    
    # Average MSE across k folds for current lambda
    mse[i] <- mean(mse_folds)
  }
  
  # Find the optimal lambda (corresponding to the lowest MSE)
  min_lambda <- lambda[which.min(mse)]
  
  return(list(mse = mse, lambda = lambda, min_lambda = min_lambda))

}

# Load dataset
url <- "https://raw.githubusercontent.com/tgouhier/adv_biostats/refs/heads/main/assn1_prob2.csv"
df2 <- read.csv(url)

# Extract response variable (WBD incidence)
y <- df2$wbd_incidence  

# Extract predictor matrix (microbial abundance)
X <- as.matrix(df2[, -1])  # Convert explanatory variables to matrix

# Run k-fold Ridge cross-validation
ridge_results <- kfold_ridge(y, X)

# Print results
print(paste("Optimal Lambda:", ridge_results$min_lambda))
[1] "Optimal Lambda: 100"
#list of mse, lambda, and min lambda
print(ridge_results)
$mse
 [1] 191936.46 118573.67  90391.91  71863.80  58777.00  49121.38  41760.86
 [8]  36002.23  31399.87  27655.48  24562.70  21974.71  19784.54  17912.65
[15]  16298.77  14896.51  13669.66  12589.56  11633.30  10782.32

$lambda
 [1]   0.100000   5.357895  10.615789  15.873684  21.131579  26.389474
 [7]  31.647368  36.905263  42.163158  47.421053  52.678947  57.936842
[13]  63.194737  68.452632  73.710526  78.968421  84.226316  89.484211
[19]  94.742105 100.000000

$min_lambda
[1] 100
# Plot Lambda vs. MSE
plot(ridge_results$lambda, ridge_results$mse, type = "b", 
     xlab = "Lambda", ylab = "Mean Squared Error", main = "Lambda Selection via k-Fold CV")

#optimal lambda = 100