Please respond to the questions by inserting your code and any required interpretation into this assignment template file. Submit the source Quarto and rendered HTML versions of your assignment via Canvas.
Task 1
Question 1.1
df1 <-read.csv("https://raw.githubusercontent.com/tgouhier/adv_biostats/refs/heads/main/assn2_prob1.csv")#create a numerical version of the response called alzheimers_num by converting#the original categorical variable alzheimers into a (discrete) numerical variablestr(df1)
# plot this new variable (y axis) as a function of time (x axis) using different symbols and colours for patients with vs. without amyloid plaques and briefly describe these patterns.library(ggplot2)df1$plaque <-as.factor(df1$plaque)# Scatter plot with jitter to prevent overlaplibrary(ggplot2)# Ensure plaque is a factordf1$plaque <-as.factor(df1$plaque)# Create the jittered scatter plotggplot(df1, aes(x = years, y =jitter(alzheimers_num), colour = plaque, shape = plaque)) +geom_point(size =3, alpha =0.7) +labs(x ="Years", y ="Alzheimer's Diagnosis (0 = No, 1 = Yes)",title ="Alzheimer's Diagnosis Over Time by Plaque Presence",colour ="Plaque Presence", shape ="Plaque Presence") +scale_colour_manual(values =c("absent"="blue", "present"="red")) +# Corrected mappingtheme_minimal()
Question 1.3
#load the car packagelibrary(car)
Loading required package: carData
# logistic regression modelmod.glm <-glm(alzheimers_num ~ plaque * years, data = df1, family = binomial)# summary of modelsummary(mod.glm)
Call:
glm(formula = alzheimers_num ~ plaque * years, family = binomial,
data = df1)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.28438 0.41876 -5.455 4.89e-08 ***
plaquepresent 1.14475 0.53283 2.148 0.03168 *
years 0.09246 0.03142 2.943 0.00325 **
plaquepresent:years 0.08998 0.04496 2.001 0.04536 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 548.25 on 399 degrees of freedom
Residual deviance: 422.78 on 396 degrees of freedom
AIC: 430.78
Number of Fisher Scoring iterations: 4
#run anova table for the glm model,anodev tableAnova(mod.glm, test ="LR")
Describe the exact nature of the model and interpret each row of the ANOVA table.
In this question, I fit a logistic regression model with a binomial response variable. The model will predict the probability that an individual has Alzheimer’s based on the presence or absense of plaque and the years they were observed during the study.
The plaque row of the ANODEV table indicates the df column, which = 1 so there is just 1 predictor variable, plaque with 2 levels: present or absent. The LR Chisq column of this row shows how much the fit of the model improves when we include the plaque variable, which is 85 in this case, so the model is improved when plaque is included. The p-value in this row is < 0.05, which tells us that the presence of plaque has a significant effect on the presence/likelihood of Alzheimer’s.
The years row of the table shows the same df column, which is also = 1. The next column indicates how well the model fits the data, which shows that the model’s fit improves when years is included, but not as much as it improves when plaque is included in the model. The p-value column also indicates that the p value is < 0.5, so the number of years signficantly affects the likelihood of Alzheimer’s.
Question 1.4
plot(mod.glm)
acf(residuals(mod.glm, type ="pearson"))
# pearson residualspearson_residuals <-residuals(mod.glm, type ="pearson")df_residuals <- mod.glm$df.residualdispersion_parameter <-sum(pearson_residuals^2)/df_residuals dispersion_parameter
[1] 1.028507
Interpretation 1.4
The Q-Q residuals plot indicates that the residuals are somewhat normally distributed until the upper right side in which the points diverge from the line of normality.
The ACF plot seems to suggest no autocorrelation because the residuals seem to fluctuate above and below the zero line randomly, while mostly remaining within the 95% confidence interval (the blue lines) other than slightly exceeding the upper limit at a lag = about 14. Since there is no pattern amongst the residuals, this indicates that the model is showing time-dependent effects.
The dispersion parameter = 1.04, which is very close to 1, suggesting that there is no overdispersion, which means the model fits the data pretty well.
Question 1.5
# the odds ratio is comparing the odds of developing Alzheimer's after 20 years with vs. without plaques, using the logistic regression model. #odds ratio = Odds_plaques/O_no plaques = [exp (beta0 + beta1 _ 20 beta2)]/exp (beta0 + 20beta2)#simplified, this is odds ratio = exp(beta1) which represents the increased or decreased odds of getting Alzheimer's for patients with plaque vs. without plaque. mod.glm
Call: glm(formula = alzheimers_num ~ plaque * years, family = binomial,
data = df1)
Coefficients:
(Intercept) plaquepresent years
-2.28438 1.14475 0.09246
plaquepresent:years
0.08998
Degrees of Freedom: 399 Total (i.e. Null); 396 Residual
Null Deviance: 548.3
Residual Deviance: 422.8 AIC: 430.8
The downloaded binary packages are in
/var/folders/h4/wb2_p5v15cv6lbm_fkdwbghw0000gn/T//RtmpkVIUsv/downloaded_packages
library(minpack.lm)# Initial parameter valuesstart_vals <-list(a =0.01, b =0.1)# Linear model: y = a + b * timemod_linear <-lm(fitness ~ time, data = df2)# Power-law model: y = a * time^bmod_power <-nlsLM(fitness ~ a * time^b, data = df2, start = start_vals)# Hyperbolic model: y = (b * time) / (a + time)mod_hyperbolic <-nlsLM(fitness ~ (b * time) / (a + time), data = df2, start = start_vals)
Question 2.2
# setting up data for plottingdf2$fit_linear <-predict(mod_linear)df2$fit_power <-predict(mod_power)df2$fit_hyperbolic <-predict(mod_hyperbolic)# plotting the dataplot(df2$time, df2$fitness, pch =16, col ="black", xlab ="Time (generations)",ylab ="Relative Fitness", main ="Model Fits to Lenski's E. coli Data")lines(df2$time, df2$fit_linear, col ="blue", lwd =2)lines(df2$time, df2$fit_power, col ="red", lwd =2)lines(df2$time, df2$fit_hyperbolic, col ="green", lwd =2)legend("bottomright", legend =c("Linear", "Power", "Hyperbolic"),col =c("blue", "red", "green"), lwd =2, bty ="n")
#compare the modelsAIC(mod_linear, mod_power, mod_hyperbolic)
Interpretation 2.2 It looks like the power law model fits the data best since it has the lowest AIC value among the models. Since this model has a low AIC value, it indicates that the power law model fits the data well without overfitting. In terms of the biological explanation, it allows for fitness to increase more rapidly when adaptation is easier (early in the model), and allows fitness to keep increasing, just at a slower rate.
Question 2.3
# getting R^2 values to compare the models# store models in a listmodels <-list(linear = mod_linear,power = mod_power,hyperbolic = mod_hyperbolic)# get R^2 valuer_squared <-lapply(models, function(model) { pred <-predict(model)cor(df2$fitness, pred)^2})r_squared
The R^2 values align with the previous AIC results because the power law model has the R^2 value closest to 1, at 0.96. The other models produced slightly lower R^2 values ( linear = 0.77, and hyperbolic = 0.94). This indicates that the power model has a strong correlation between predicted and actual fitness values. The other models, mainly the hyperbolic model still had a decent fit but was not as strong of a correlation as the power law model. The linear model had the lowest R^2 value and the lowest correlation between the predicted actual fitness values.
Question 2.4
my_aicc <-function(mod_list, model_names, n, k) {#compute aic values from model list aic_vals <-sapply(mod_list, AIC)#compute AICc: AIC + (2k(k+1))/(n-k-1) aicc <- aic_vals + (2*k*(k+1))/ (n - k -1)#compute delta AICs: difference from best model delta_aicc <- aicc -min(aicc)#compute aic weights rel_likelihoods <-exp(-0.5* delta_aicc) weights_aicc <- rel_likelihoods /sum(rel_likelihoods)#return data as data framereturn(data.frame(Model = model_names,AICc =round(aicc, 4),Delta_AICc =round(delta_aicc, 4),AICc_weights =round(weights_aicc, 4)))}mod_list <-list(mod_linear, mod_power, mod_hyperbolic)model_names <-c("Linear", "Power Law", "Hyperbolic")n <-nrow(df2)k <-c(2,2,2) #number of parameters in each modelmy_aicc(mod_list, model_names, n, k)
Model AICc Delta_AICc AICc_weights
1 Linear -19.8211 70.9760 0
2 Power Law -90.7971 0.0000 1
3 Hyperbolic -69.3277 21.4694 0
Interpretation 2.4 The power law remains the best fit based on it having the lowest value for AICc. Moving to the next column, the delta AICc indicates how much worse the other models are, compared to the best, the power law model, which tells us that the linear model is a poor fit compared to the other models, and the hyperbolic model is not as poor as the linear model but is still not as a strong as the power law model. The AICc weight for the power law is at 1 (100%) which means it has the highest chance of being the best model for these data. The hyperbolic and linear models are less than 1, hovering around 0 indicating much less support for these models being the best fit for these data.