library(mgcv)
## Loading required package: nlme
## This is mgcv 1.9-1. For overview type 'help("mgcv-package")'.
Caffeine.df <- read.csv("Caffeine.csv")
## null model (order 0)
mod.0=glm(cbind(Agrade,n-Agrade)~1, family=binomial,
data =Caffeine.df)
## linear (order 1)
mod.1=glm(cbind(Agrade,n-Agrade)~caffeine,
family=binomial, data =Caffeine.df)
## quadratic (order 2)
mod.2=glm(cbind(Agrade,n-Agrade)~caffeine+I(caffeine^2),
family=binomial, data =Caffeine.df)
## cubic (order 2)
mod.3=glm(cbind(Agrade,n-Agrade)~caffeine +I(caffeine^2)+I(caffeine^3),
family=binomial, data =Caffeine.df)
mod.gam=gam(cbind(Agrade,n-Agrade)~s(caffeine),
family=binomial, data =Caffeine.df)
# look at null, order 1 and GAM fits (adapt this below )
plot(I(Agrade/n)~caffeine, ylim=c(0,.6),
main ="Proportion of A grades vs Caffeine", data=Caffeine.df)
# add lines
caffs=seq(0, 500, by=1)
new.df=data.frame(caffeine=caffs)
p.gam=predict(mod.gam, newdata=new.df,type="response")
lines(caffs, p.gam, col="blue")
p0=predict(mod.0, newdata=new.df, type="response")
lines(caffs, p0, col="red")
p1=predict(mod.1, newdata=new.df, type="response")
lines(caffs, p1, col="orange")
legend('topright', lty=1,lwd=3, col=c("blue", "red","orange") ,
legend=c("gam", "null","linear"))
The model above shows the relationship between coffee consumption and students test grade with three models including : null model, a linear model, and a generalized additive model (GAM) shown in blue. The GAM model seems to fit the model the best as it fits better to the non linear trend and does a good job of going through most the lines. GAM is a tighter fit compared compared to linear model and provides much more information that model null. From the output above we can conclude that the relationship between caffeine intake and test performance is not a linear relationship and moderate levels of caffeine intake will enhance the performance while high doses of caffeine would cause a negative effect on students performance.
anova(mod.0)
## Analysis of Deviance Table
##
## Model: binomial, link: logit
##
## Response: cbind(Agrade, n - Agrade)
##
## Terms added sequentially (first to last)
##
##
## Df Deviance Resid. Df Resid. Dev
## NULL 10 69.358
anova(mod.0, mod.1, test = "Chisq")
## Analysis of Deviance Table
##
## Model 1: cbind(Agrade, n - Agrade) ~ 1
## Model 2: cbind(Agrade, n - Agrade) ~ caffeine
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 10 69.358
## 2 9 18.625 1 50.733 1.058e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(mod.1, mod.2, test = "Chisq")
## Analysis of Deviance Table
##
## Model 1: cbind(Agrade, n - Agrade) ~ caffeine
## Model 2: cbind(Agrade, n - Agrade) ~ caffeine + I(caffeine^2)
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 9 18.6245
## 2 8 7.6639 1 10.961 0.0009307 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(mod.2, mod.3, test = "Chisq")
## Analysis of Deviance Table
##
## Model 1: cbind(Agrade, n - Agrade) ~ caffeine + I(caffeine^2)
## Model 2: cbind(Agrade, n - Agrade) ~ caffeine + I(caffeine^2) + I(caffeine^3)
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 8 7.6639
## 2 7 5.1452 1 2.5186 0.1125
anova(mod.gam)
##
## Family: binomial
## Link function: logit
##
## Formula:
## cbind(Agrade, n - Agrade) ~ s(caffeine)
##
## Approximate significance of smooth terms:
## edf Ref.df Chi.sq p-value
## s(caffeine) 3.268 4.079 41.21 <2e-16
The null model is the simplest model with no subset, setting the base for the other models, Model 1 is a direct subset of null model plus an additional linear caffeine term, model 2 is the subset of model 1 and null with a quadratic term and lastly model 3 is the subset of model 1, 2 and null as it includes the linear, null, quadratic plus a cubic term, the GAM model does not include any subsets as it uses smoothing terms making it a non linear terms making it more complex compared to other models.
From the ANOVA table above we can analyse the results and and discuss which model is the best fit using Occam’s Razor. There is a significant improvement seen in from Null to linear model (model 0 and 1) by adding the caffeine term with a significant p value (p < 0.0001) and residual deviance reduction from 69.358 to 18.625. Linear to Quadratic model (model 1 and 2) has also a statically significant improvement after adding the quadratic term (caffeine^2) (p < 0.001), making the residual deviance to decrease from 18.625 to 7.6639. The Quadratic to cubic model (model 2 and 3) causes a reduction in the deviance from 7.6639 to 5.1452 by adding the cubic term (caffeine^3) however the P value (0.1125) indicates that the cubic term does not significantly improve the cubic model over quadratic. Given the results and the comments, by applying the Occam’s Razor method the Quadratic Model (model 2) seems to be the best model out of the ones tested as it provides reasonable complexity making it better than the linear and null model in statistically significant way without unnecessary complexity in the cubic model which shows no statistically significant improvement.
The GAM model shows strong significance for the smooth terms, showing a good fit. however due to the nature of GAM model and it complexity it would be hard to compare and interpret it compared to polynomial or linear models based on the ANOVA output as they have other terms that will make the model smooth.
aic_values <- list(
Null = AIC(mod.0),
Linear = AIC(mod.1),
Quadratic = AIC(mod.2),
Cubic = AIC(mod.3),
GAM = AIC(mod.gam)
)
print(aic_values)
## $Null
## [1] 104.6032
##
## $Linear
## [1] 55.87004
##
## $Quadratic
## [1] 46.90939
##
## $Cubic
## [1] 46.39077
##
## $GAM
## [1] 45.03682
The linear model shows improvements upon the Null model as it has a much lower AIC value. the Quadratic model has a lower AIC value compared to linear model suggesting a improvement over the linear model. The cubic model has a very slight decrease over the quadratic model indicating a marginal improvement over the quadratic model, however the effect is so marginal that it does not justify adding the cubic term and adding more complexity to the data. The GAM model has the lowest AIC compared to all the models indicating that it is the best choice in terms of complexity and fit among the models and capturing the underlying structure of the data making it the GAM the best model out of all the models above suggesting the most efficient and better explanation of the variability in the data while using less parameters. In terms of accuracy and statistical significance the GAM model would be the best however if the aim is the simplicity the quadratic model might be the choice.
library(MuMIn)
options(na.action = "na.fail")
msubset <- expression(dc(caffeine, `I(caffeine^2)`, `I(caffeine^3)`))
all.fits <- dredge(mod.3, subset=msubset)
## Fixed term is "(Intercept)"
print(all.fits
)
## Global model call: glm(formula = cbind(Agrade, n - Agrade) ~ caffeine + I(caffeine^2) +
## I(caffeine^3), family = binomial, data = Caffeine.df)
## ---
## Model selection table
## (Intrc) caffn caffn^2 caffn^3 df logLik AICc delta weight
## 4 -0.3974 0.004600 -2.762e-05 3 -20.455 50.3 0.00 0.777
## 8 -0.6506 0.014510 -8.991e-05 9.714e-08 4 -19.195 53.1 2.72 0.200
## 2 0.2385 -0.006442 2 -25.935 57.4 7.03 0.023
## 1 -1.1230 1 -51.302 105.0 54.71 0.000
## Models ranked by AICc(x)
The output above uses the dredge output from the “MuMIN” Package and shows a few models, each with different combinations of predictors such as caffeine (caffn), caffeine squared (caffn2), and caffeine cubed (caffn3). Model 4 seems to be the best model as it has the lowest AIC value of 50.3 among all the models; it includes caffn and caffn2 with the highest weight (0.777), meaning it has a 77.7% chance of being the best model. The caffeine variable is a linear term, and its presence (especially in model 4) indicts that there is a direct relation between the two caffeine intakes and test performance. Caffeine squared is another variable presented in model 4, which is a quadratic term, meaning the suggestion that there may be a non-linear relationship and the effects of caffeine might cause an increase and decrease in test performance at some points. However, there is an absence of the cubic term (caffeine^3) in model 4, suggesting that the term does not significantly improve in terms of the variability of the model, as we have seen in the previous questions too. In conclusion, model 4 has both linear and quadratic terms, and it is doing a good job capturing the relationship between caffeine intake and test performance without over fitting the data. it implies that caffeine effects grades however it is not as straightforward as a linear model and there is a certain threshold before caffeine can diminish students performance.
We assessed the relationship between caffeine and the students test performance using a few different models, we performed numerous tests, plots and functions and now I can conclude that the quadratic model is the best model as it does a good job covering the variability and going through most the points in the model, part B showed that the quadratic model is the most statistically significant compared to other models with the lowest residual deviance, part C showed that the quadratic is the best model to be selected from all the models even though the cubic and GAM model have lower AICs, this is because the improvement in model cubic is so minor that it does not justify the addition of a new cubic term and the GAM model is a complex model and cannot be used as the best model. Part D also suggested that the model with the quadratic term in it captures the essential dynamics between the relationship between caffeine intake and test performance with sufficient detail. The quadratic model in Part D has the lowest AIC, the highest weight (meaning the highest probability of it being the best model), and the lowest log likelihood value, all of which indicate that the quadratic is the best model possible. As mentioned before, the cubic model and the GAM model may have had a lower AIC value compared to the quadratic term, but the cubic model was not significant and showed very little improvement to the overall model, which did not justify the addition of the new term, and the GAM model can have risks such as overfitting and a lack of count for the number of model parameters due to the nature of the model’s complexity.
library(s20x)
model.matrix(mod.3)
## (Intercept) caffeine I(caffeine^2) I(caffeine^3)
## 1 1 0 0 0
## 2 1 50 2500 125000
## 3 1 100 10000 1000000
## 4 1 150 22500 3375000
## 5 1 200 40000 8000000
## 6 1 250 62500 15625000
## 7 1 300 90000 27000000
## 8 1 350 122500 42875000
## 9 1 400 160000 64000000
## 10 1 450 202500 91125000
## 11 1 500 250000 125000000
## attr(,"assign")
## [1] 0 1 2 3
pairs20x(model.matrix(mod.3)[,-1])
round(cor(model.matrix(mod.3)[,-1]),3)
## caffeine I(caffeine^2) I(caffeine^3)
## caffeine 1.000 0.963 0.909
## I(caffeine^2) 0.963 1.000 0.986
## I(caffeine^3) 0.909 0.986 1.000
mod.3a=glm(cbind(Agrade,n-Agrade)~poly(caffeine,3), family=binomial, data=Caffeine.df)
carelation_poly <- cor(model.matrix(mod.3a)[, -1])
pairs20x(model.matrix(mod.3a)[,-1])
print(carelation_poly)
## poly(caffeine, 3)1 poly(caffeine, 3)2 poly(caffeine, 3)3
## poly(caffeine, 3)1 1.000000e+00 -1.334517e-16 -7.684283e-18
## poly(caffeine, 3)2 -1.334517e-16 1.000000e+00 2.519415e-17
## poly(caffeine, 3)3 -7.684283e-18 2.519415e-17 1.000000e+00
predictions_mod3 <- predict(mod.3, type = "response")
predictions_mod3a <- predict(mod.3a, type = "response")
# Comparing predictions
identical(round(predictions_mod3, 10), round(predictions_mod3a, 10))
## [1] TRUE
diffrences <- (predictions_mod3 - predictions_mod3a)
max_diff <- max(abs(predictions_mod3 - predictions_mod3a))
print(diffrences)
## 1 2 3 4 5
## -5.551115e-17 -1.665335e-16 -4.440892e-16 -2.220446e-16 1.665335e-16
## 6 7 8 9 10
## 4.163336e-16 6.383782e-16 4.440892e-16 1.595946e-16 0.000000e+00
## 11
## -4.510281e-17
print(max_diff)
## [1] 6.383782e-16
There summary output gives us the values related to caffeine, caffeine^2, caffeine^3 and their coefficients. The values (0.96, 0.91, 0.99) suggest high levels of multidisciplinary between these predictors. The second output shows a histogram of the three models showing and comparing their distributions and relationships, which in this case they all seem similar. The last output is orthogonal polynomial generated with poly() function. The poly() function above transforms mod.3a to orthogonal polynomial making all the models independent and non similar suggesting zero multicollinearity unlike the second output.
caffeines <- c(0, 50, 100, 150, 200)
students_n <- rep(300, 5)
a_grades <- c(109, 155, 175, 158, 103)
caff_2 <- data.frame(caffeine = caffeines,n = students_n, Agrade = a_grades)
mod.quad <- glm(cbind(Agrade,n-Agrade)~caffeine +I(caffeine^2),
family=binomial, data =caff_2)
summary(mod.quad)
##
## Call:
## glm(formula = cbind(Agrade, n - Agrade) ~ caffeine + I(caffeine^2),
## family = binomial, data = caff_2)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -5.805e-01 1.128e-01 -5.146 2.66e-07 ***
## caffeine 1.840e-02 2.651e-03 6.940 3.93e-12 ***
## I(caffeine^2) -9.327e-05 1.272e-05 -7.331 2.28e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 55.54424 on 4 degrees of freedom
## Residual deviance: 0.22086 on 2 degrees of freedom
## AIC: 36.793
##
## Number of Fisher Scoring iterations: 3
p.quad=predict(mod.quad, newdata=new.df,type="response")
plot(I(Agrade/n)~caffeine, ylim=c(0,.6), data=caff_2)
caffs=seq(0, 500, by=1)
new.df=data.frame(caffeine=caffs)
p.quad=predict(mod.quad, newdata=new.df,type="response")
lines(caffs, p.quad, col="blue")
vcov(mod.quad)
## (Intercept) caffeine I(caffeine^2)
## (Intercept) 1.272482e-02 -2.226622e-04 8.287286e-07
## caffeine -2.226622e-04 7.027689e-06 -3.232232e-08
## I(caffeine^2) 8.287286e-07 -3.232232e-08 1.618650e-10
# Calculate deviance residuals and plot them
residuals_dev <- residuals(mod.quad, type = "deviance")
plot(caff_2$caffeine, residuals_dev, xlab = "Caffeine Levels (mg)", ylab = "Deviance Residuals",
main = "Residuals vs. Caffeine Levels")
abline(h = 0, col = "red", lty = 1)
The output above is showing a quadratic logistic regression model of the
A grade performance students based on different on amounts of different
caffeine dosages. the results above show that the quadratic model is
performing well and we see the peak of the performance at around 100 mgs
of coffee, this means that this is the optimal amount of caffeine for
getting an A grade and anything higher or lower will not have the same
efficiency on students performance. There is a residual plot added as
well to compare the residuals, the residual plot does not indicate any
major outlines and concerns as most of the residuals are evenly
distributed fairly close to zero along the caffeine levels. The summary
output indicates that all coefficient of the model are significantly
significant (p < 0.05), while the linear and intercept are negative
the quadratic term is positive this means that the quadratic model is
the right fit. The residual deviance is also low compared to the null
model as well as the AIC levels which are relatively low meaning the
model is adequate in terms of explaining the variability.
beta <- coef(mod.quad)
beta1 <- beta["caffeine"]
beta2 <- beta["I(caffeine^2)"]
x_peak <- -beta1 / (2 * beta2)
delta_g <- matrix(c(0, -1/(2*beta2), beta1/(2*beta2^2)), nrow = 3)
var_beta <- vcov(mod.quad)
var_x_peak <- t(delta_g) %*% var_beta %*% delta_g
se_x_peak <- sqrt(var_x_peak)
print(paste("Estimated x_peak:", x_peak))
## [1] "Estimated x_peak: 98.6170572087436"
print(paste("Variance of x_peak:", var_x_peak))
## [1] "Variance of x_peak: 16.5036225318431"
print(paste("Standard Error of x_peak:", se_x_peak))
## [1] "Standard Error of x_peak: 4.06246508069215"
The output above is the calculation for caffeine peak levels and its properties in terms of how and what precise levels of caffeine can enhance students performance in terms of achieving A grades. The estimate of caffeine peak is the estimated caffeine levels (Mgs) that maximise the probability of a student achieving an A grade; it is calculated by dividing beta 1 by 2 multiplied by beta 2. The standard error is the square root of the variance, and it is an indication of the caffeine peak level accuracy. The results above show that approximately 98.63 mg is the peak of caffeine levels, meaning that students have the best performance at that level. This is consistent with what was seen in the last part of the plot. The standard error suggests that the precision of the x caffeine peak is around 4.06 mg, showing high precision, and it will be close to the estimate in most cases. However, a 95% confidence interval would be a good addition for more accuracy as each student can have a different tolerance level and reaction to different caffeine levels.
beta <- coef(mod.quad)
beta1 <- beta["caffeine"]
beta2 <- beta["I(caffeine^2)"]
Delta.g <- c(0, -1/(2 * beta2), beta1 / (2 * beta2^2))
print(Delta.g)
## I(caffeine^2) caffeine
## 0.000 5360.597 1057292.592
The output above shows the values from a matrix with two main columns of caffeine and caffeine squared terms. The 0.00 under I(caffeine^2) indicates that there is no contribution of from the intercept term to the variance calculation of caffeine peak in this case of matrix meaning that the intercept does not have any contribution. The numbers (large) in I(caffeine^2) and caffeine are large enough to show that there is considerable variability between the coefficients that contribute to the x peak of caffeine meaning that any small or big changes in both these terms can lead to large changes to the estimated peak caffeine level.
var_peak <- t(Delta.g) %*% vcov(mod.quad) %*% Delta.g
print(var_peak)
## [,1]
## [1,] 16.50362
The output above is the calculated variance of x peak of caffeine. This is estimated by using the delta method, an approach that will allow us to estimate the variance function of multiple variables. 16.5 is simply the measure of how far the values are from the mean of 98.6. This also shows us the uncertainty and the accuracy of the model as it shows us how spread the values can be.
ci_lower <- x_peak - 1.96 * se_x_peak
ci_upper <- x_peak + 1.96 * se_x_peak
print(paste(ci_lower, ci_upper))
## [1] "90.654625650587 106.5794887669"
The output above provides us with the 95% confidence interval as the question asked us to. The confidence interval suggests that 95% of our values should fall between 90.65 and 106.57 mgs of caffeine for the optimal result of a A grade score, this includes our x peak caffeine level of 98.6 that falls right in our 95% confidence interval. Our confidence interval will also measures levels of uncertainty as unusually when the confidence interval values are too far apart it may mean that the model is not precise which is not the case in here.
ns <- caff_2$n
xs <- caff_2$caffeine
preds <- predict(mod.quad, type = "response")
ys <- rbinom(length(ns), size = ns, prob = preds)
print(ys)
## [1] 100 163 178 157 89
The output above shows the simulated counts of students achieving A grades under the influence of caffeine at different levels. The counts are achieved using a GLM, specifically a quadratic model and a nonlinear relationship, as the counts are different depending on the level. The numbers above represent the simulated counts of grades using a binomial distribution. There is also a normal distribution as there is a gradual increase and decrease in the values, possibly due to the nature of the model and the use of binomial or logistic regression.
library(MASS)
x <- seq(from = min(xs), to = max(xs), length.out = length(xs))
n <- rep(100, length(x))
b0 <- -5.805e-01
b1 <- 1.840e-02
b2 <- -9.327e-05
probs <- plogis(b0 + b1 * x + b2 * x^2)
n.sims <- 1000
xpeaks <- numeric(n.sims)
devs <- numeric(n.sims)
for (i in 1:n.sims) {
ys <- rbinom(length(x), size = n, prob = probs)
fit <- glm(cbind(ys, n - ys) ~ x + I(x^2), family = binomial)
coef_fit <- coef(fit)
x_peak <- -coef_fit[2] / (2 * coef_fit[3])
xpeaks[i] <- x_peak
devs[i] <- deviance(fit)
}
print(summary(xpeaks))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 39.22 93.95 98.46 98.34 102.66 125.72
print(summary(devs))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.003411 0.673409 1.470815 2.052441 2.803747 11.919967
hist(xpeaks, breaks = 30, freq = FALSE, main = "Histogram of xpeaks with Normal Curve Overlay", xlab = "xpeaks", col = "light blue")
curve(dnorm(x, mean = mean(xpeaks, na.rm = TRUE), sd = sd(xpeaks, na.rm = TRUE)), add = TRUE, col = "red", lwd = 2)
quantiles_xpeaks <- quantile(xpeaks, probs = c(0.025, 0.975), na.rm = TRUE)
print(quantiles_xpeaks)
## 2.5% 97.5%
## 84.05128 113.50258
qqnorm(residuals(fit, type = "deviance"), main = "Q-Q Plot of Model Residuals")
qqline(residuals(fit, type = "deviance"), col = "red")
The r chuck above produces a histogram with normal curve showing the
distribution of peak values. The histogram shows that the xpeak values
are normally distributed and there is nothing unusual. The qq-plot shows
that the deviance residuals from our logistic model match with the
normal distribution and most points follow the pattern of the line
meaning that the residuals do not differ significantly from the
normality. The models above similarly to question 2 showed that there is
normal distribution and the 95% confidence interval [82.80374,
114.30148] is very similar to the 95% confidence interval range in
question 2 as well (“90.654625650587 106.5794887669”)
hist(devs, breaks = 100, prob = TRUE, main = "Histogram of Deviances with Chi-squared Fit", xlab = "Deviances")
n <- length(ns)
p <- 3
df <- n - p
dvs <- sort(devs)
lines(dvs, dchisq(dvs, df), col = "blue")
The histogram above shows the chi-square distribution and it indicates a
good fit aligned with observations distributions and deviance. The
deviance are more scattered and centered around lower values and they
decline rapidly, the data however may not meet the assumptions
underlying the chi square test of deviance including independence of
observations and the high volume of low deviance values suggests that
the model sits close to the observed data which may be caused by over
fitting and the long tail of the model suggests that the there may be
some overdispertion presented in the outlines affecting the higher
deviance values.