This data set was obtained from [Data World] https://data.world.com. The data consists of information on students gathered from two different schools in Portugal about students habits and lives outside of school to see what impact these external factors might have on their final grade in mathematics. The data was collected through school surveys and questionnaires.
The question trying to be answered throughout this data analysis is what the association is between students study habits, previous grades and lives outside of school, and their final math grade. Another question I have is how good are each of the selected variables at predicting a student’s final grade in math. Is there one that is the best? Also, which model will be the best in predicting which students will have the best math grade?
First, the data is uploaded.
students0 <- read.csv("https://raw.githubusercontent.com/AvaDeSt/STA-321/refs/heads/main/student-mat.csv", header = TRUE)
students=read.table("https://raw.githubusercontent.com/AvaDeSt/STA-321/refs/heads/main/student-mat.csv",sep=";",header=TRUE)
model = lm(G3 ~ Medu + Fedu + traveltime + failures + freetime + goout + Walc + Dalc + famrel + absences + health + studytime + G1 + G2 , data = students)
kable(summary(model)$coef, caption ="Statistics of Regression Coefficients")
Estimate | Std. Error | t value | Pr(>|t|) | |
---|---|---|---|---|
(Intercept) | -3.5970277 | 0.8245461 | -4.3624334 | 0.0000166 |
Medu | 0.1294986 | 0.1154386 | 1.1217965 | 0.2626573 |
Fedu | -0.1267697 | 0.1148130 | -1.1041406 | 0.2702308 |
traveltime | 0.1270425 | 0.1427922 | 0.8897017 | 0.3741890 |
failures | -0.2918970 | 0.1440120 | -2.0268930 | 0.0433711 |
freetime | 0.0629164 | 0.1042155 | 0.6037150 | 0.5463933 |
goout | -0.0164534 | 0.1006105 | -0.1635356 | 0.8701837 |
Walc | 0.1526548 | 0.1073597 | 1.4219008 | 0.1558753 |
Dalc | -0.1465988 | 0.1434807 | -1.0217314 | 0.3075581 |
famrel | 0.3325243 | 0.1099238 | 3.0250450 | 0.0026550 |
absences | 0.0367829 | 0.0122202 | 3.0099975 | 0.0027869 |
health | 0.0702189 | 0.0701544 | 1.0009194 | 0.3175027 |
studytime | -0.1498508 | 0.1213265 | -1.2351029 | 0.2175553 |
G1 | 0.1395635 | 0.0562891 | 2.4794077 | 0.0135929 |
G2 | 0.9908354 | 0.0499823 | 19.8237135 | 0.0000000 |
Students <- students %>%
mutate(GA = (G1 + G2) / 2)
Students.num <- select(Students,"absences", "famrel", "failures", "GA", "G3")
As part of my data preparation, I looked over all the numerical variables in my data set. I opted to keep the variables that had significant p-values. This included the number of classes failed previously, the student’s family relationship quality, the number of absences a student has had, and the first and second period grades of the student. To make the model more simple, I combined the first and second period grades into a new variable that takes the average of the two. I tried combing several other variables together to create new ones but none of them held any statistical significance.
full.model = lm(G3 ~ absences + famrel + GA + failures, data = Students.num)
kable(summary(full.model)$coef, caption ="Statistics of Regression Coefficients")
Estimate | Std. Error | t value | Pr(>|t|) | |
---|---|---|---|---|
(Intercept) | -3.5358762 | 0.6143636 | -5.755348 | 0.0000000 |
absences | 0.0387264 | 0.0130061 | 2.977546 | 0.0030870 |
famrel | 0.2659179 | 0.1159748 | 2.292894 | 0.0223857 |
GA | 1.1802399 | 0.0327789 | 36.006051 | 0.0000000 |
failures | -0.2361464 | 0.1505524 | -1.568533 | 0.1175677 |
par(mfrow=c(2,2))
plot(full.model)
The QQ plot indicates that the data is not quite a normal distribution.
The variance of the data is also not constant and the data is clumped
more together at the right side of the graph. There also appears to be
one outlier to the far right in the lower right graph.
vif(full.model)
absences famrel GA failures
1.005934 1.004011 1.158279 1.163780
barplot(vif(full.model), main = "VIF Values", horiz = FALSE, col = "steelblue")
Since all of the VIF values are close to 1 and do not exceed 4,
multicollinearity is not an issue.
To help correct the non constant variance of the data, I am going to perform a boxcox transformation.
par(pty = "s", mfrow = c(2, 2), oma=c(.1,.1,.1,.1), mar=c(4, 0, 2, 0))
Students.num$G3_adjusted <- Students.num$G3 + 3
boxcox(G3_adjusted ~ absences + famrel + GA + log(failures +1)
, data = Students.num, lambda = seq(0, 1, length = 10),
xlab=expression(paste(lambda, ": log-failures")))
boxcox(G3_adjusted ~ absences+ famrel + GA + failures
, data = Students.num, lambda = seq(0, 1, length = 10),
xlab=expression(paste(lambda, ": failures")))
boxcox(G3_adjusted ~ absences + log(famrel) + GA + failures
, data = Students.num, lambda = seq(0, 1, length = 10),
xlab=expression(paste(lambda, ": log-famrel")))
boxcox(G3_adjusted ~ absences+ famrel + GA + failures
, data = Students.num, lambda = seq(0, 1, length = 10),
xlab=expression(paste(lambda, ": famrel")))
In order to do the boxcox transformation, I had to change my response
variable “G3” to be positive. Also, when taking the log of “failures” i
also had to change that to be positive by adding 1.
sqrt.G3.log.fa = lm((G3_adjusted)^0.5 ~ absences + log(famrel) + GA + failures, data = Students.num)
kable(summary(sqrt.G3.log.fa)$coef, caption = "log-transformed model")
Estimate | Std. Error | t value | Pr(>|t|) | |
---|---|---|---|---|
(Intercept) | 1.5080307 | 0.1202400 | 12.541834 | 0.0000000 |
absences | 0.0099199 | 0.0024872 | 3.988330 | 0.0000795 |
log(famrel) | 0.1182455 | 0.0677417 | 1.745534 | 0.0816795 |
GA | 0.1745119 | 0.0062707 | 27.829736 | 0.0000000 |
failures | -0.0599612 | 0.0288019 | -2.081847 | 0.0380080 |
par(mfrow = c(2,2))
plot(sqrt.G3.log.fa)
Similar to the boxcox model, I had to adjust G3 to be positive.
log.G3 = lm(log(G3_adjusted) ~ absences + failures + famrel + G3 , data = Students.num)
kable(summary(log.G3)$coef, caption = "log-transformed model")
Estimate | Std. Error | t value | Pr(>|t|) | |
---|---|---|---|---|
(Intercept) | 1.3702189 | 0.0379432 | 36.1123735 | 0.0000000 |
absences | 0.0054941 | 0.0009126 | 6.0203578 | 0.0000000 |
failures | -0.0054064 | 0.0105155 | -0.5141379 | 0.6074468 |
famrel | 0.0016374 | 0.0081272 | 0.2014675 | 0.8404381 |
G3 | 0.1051142 | 0.0017053 | 61.6397733 | 0.0000000 |
par(mfrow = c(2,2))
plot(log.G3)
## Goodness of Fit measures
select=function(m){
e = m$resid
n0 = length(e)
SSE=(m$df)*(summary(m)$sigma)^2
R.sq=summary(m)$r.squared
R.adj=summary(m)$adj.r
MSE=(summary(m)$sigma)^2
Cp=(SSE/MSE)-(n0-2*(n0-m$df))
AIC=n0*log(SSE)-n0*log(n0)+2*(n0-m$df)
SBC=n0*log(SSE)-n0*log(n0)+(log(n0))*(n0-m$df)
X=model.matrix(m)
H=X%*%solve(t(X)%*%X)%*%t(X)
d=e/(1-diag(H))
PRESS=t(d)%*%d
tbl = as.data.frame(cbind(SSE=SSE, R.sq=R.sq, R.adj = R.adj, Cp = Cp, AIC = AIC, SBC = SBC, PRD = PRESS))
names(tbl)=c("SSE", "R.sq", "R.adj", "Cp", "AIC", "SBC", "PRESS")
tbl
}
output.sum = rbind(select(full.model), select(sqrt.G3.log.fa), select(log.G3))
row.names(output.sum) = c("full.model", "sqrt.G3.log.fa", "log.G3")
kable(output.sum, caption = "Goodness-of-fit Measures of Candidate Models")
SSE | R.sq | R.adj | Cp | AIC | SBC | PRESS | |
---|---|---|---|---|---|---|---|
full.model | 1655.021182 | 0.7998743 | 0.7978217 | 5 | 575.9099 | 595.8043 | 1706.276101 |
sqrt.G3.log.fa | 60.549357 | 0.7114041 | 0.7084442 | 5 | -730.7936 | -710.8992 | 62.617761 |
log.G3 | 8.116601 | 0.9199537 | 0.9191327 | 5 | -1524.5648 | -1504.6704 | 8.442339 |
The best model to use regarding this data set is the full model because it has the second highest R squared value. I was originally going to select the log transformed model as my final model because it has the highest R squared value. But after looking at the residual plot and other graphs I believe that this model has too many violations. The residual plot is not normal as the residuals take on a strange V pattern. The data is also neither normal nor linear. There are also two P values that are not significant in the log transformed model that are not present in the other two models. So until these errors can be correct, the full model will be used as the final model.
kable(summary(full.model)$coef, caption = "Inferential Statistics of Final Model")
Estimate | Std. Error | t value | Pr(>|t|) | |
---|---|---|---|---|
(Intercept) | -3.5358762 | 0.6143636 | -5.755348 | 0.0000000 |
absences | 0.0387264 | 0.0130061 | 2.977546 | 0.0030870 |
famrel | 0.2659179 | 0.1159748 | 2.292894 | 0.0223857 |
GA | 1.1802399 | 0.0327789 | 36.006051 | 0.0000000 |
failures | -0.2361464 | 0.1505524 | -1.568533 | 0.1175677 |
The model can be written as the following:
G3 = -3.54 + 0.04(absences) + 0.27(famrel) + 1.18(GA) - 0.24(failures)
I was originally going to select the log transformed model as my final model because it has the highest R^value. But after looking at the residual plot and other graphs I believe that this model has too many violations. The residual plot is not normal as the residuals take on some strange patterns
B = 1000
num.p = dim(model.frame(full.model))[2]
smpl.n = dim(model.frame(full.model))[1]
coef.mtrx = matrix(rep(0, B*num.p), ncol = num.p)
for (i in 1:B){
bootc.id = sample(1:smpl.n, smpl.n, replace = TRUE)
full.model.btc = lm(G3 ~ absences + famrel + failures + GA, data = Students.num[bootc.id,])
coef.mtrx[i,] = coef(full.model.btc)
}
cmtrx <- summary(full.model)$coef
num.p = dim(coef.mtrx)[2]
btc.ci = NULL
btc.wd = NULL
for (i in 1:num.p){
lci.025 = round(quantile(coef.mtrx[, i], 0.025, type = 2),4)
uci.975 = round(quantile(coef.mtrx[, i],0.975, type = 2 ),4)
btc.wd[i] = uci.975 - lci.025
btc.ci[i] = paste("[", round(lci.025,4),", ", round(uci.975,4),"]")
}
kable(as.data.frame(cbind(formatC(cmtrx,4,format="f"), btc.ci.95=btc.ci)),
caption = "Regression Coefficient Matrix")
Estimate | Std. Error | t value | Pr(>|t|) | btc.ci.95 | |
---|---|---|---|---|---|
(Intercept) | -3.5359 | 0.6144 | -5.7553 | 0.0000 | [ -4.9546 , -2.2621 ] |
absences | 0.0387 | 0.0130 | 2.9775 | 0.0031 | [ 0.0177 , 0.0776 ] |
famrel | 0.2659 | 0.1160 | 2.2929 | 0.0224 | [ 0.0378 , 0.5129 ] |
GA | 1.1802 | 0.0328 | 36.0061 | 0.0000 | [ -0.6599 , 0.1241 ] |
failures | -0.2361 | 0.1506 | -1.5685 | 0.1176 | [ 1.1214 , 1.2376 ] |
The bootstrap confidence intervals look mostly good but the one for “GA” contains 0 which might indicate the model is not a good predictor for the final grade. To help visualize what the issue is, I am going to view histograms.
boot.hist = function(cmtrx, bt.coef.mtrx, var.id, var.nm){
x1.1 <- seq(min(bt.coef.mtrx[,var.id]), max(bt.coef.mtrx[,var.id]), length=300 )
y1.1 <- dnorm(x1.1, mean(bt.coef.mtrx[,var.id]), sd(bt.coef.mtrx[,var.id]))
highestbar = max(hist(bt.coef.mtrx[,var.id], plot = FALSE)$density)
ylimit <- max(c(y1.1,highestbar))
hist(bt.coef.mtrx[,var.id], probability = TRUE, main = var.nm, xlab="",
col = "azure1",ylim=c(0,ylimit), border="lightseagreen")
lines(x = x1.1, y = y1.1, col = "red3")
lines(density(bt.coef.mtrx[,var.id], adjust=2), col="blue")
}
par(mfrow=c(2,3))
boot.hist(bt.coef.mtrx=coef.mtrx, var.id=1, var.nm ="GA" )
Nothing seems off about the blue density curve histogram which
represents the bootstrap confidence intervals. I trided increasing the
number of trials but still got the same results.
B = 2000
num.p = dim(model.frame(full.model))[2]
smpl.n = dim(model.frame(full.model))[1]
coef.mtrx = matrix(rep(0, B*num.p), ncol = num.p)
for (i in 1:B){
bootc.id = sample(1:smpl.n, smpl.n, replace = TRUE)
full.model.btc = lm(G3 ~ absences + famrel + failures + GA, data = Students.num[bootc.id,])
coef.mtrx[i,] = coef(full.model.btc)
}
cmtrx <- summary(full.model)$coef
num.p = dim(coef.mtrx)[2]
btc.ci = NULL
btc.wd = NULL
for (i in 1:num.p){
lci.025 = round(quantile(coef.mtrx[, i], 0.025, type = 2),4)
uci.975 = round(quantile(coef.mtrx[, i],0.975, type = 2 ),4)
btc.wd[i] = uci.975 - lci.025
btc.ci[i] = paste("[", round(lci.025,4),", ", round(uci.975,4),"]")
}
kable(as.data.frame(cbind(formatC(cmtrx,4,format="f"), btc.ci.95=btc.ci)),
caption = "Regression Coefficient Matrix")
Estimate | Std. Error | t value | Pr(>|t|) | btc.ci.95 | |
---|---|---|---|---|---|
(Intercept) | -3.5359 | 0.6144 | -5.7553 | 0.0000 | [ -4.9538 , -2.2056 ] |
absences | 0.0387 | 0.0130 | 2.9775 | 0.0031 | [ 0.0173 , 0.0755 ] |
famrel | 0.2659 | 0.1160 | 2.2929 | 0.0224 | [ 0.0395 , 0.5074 ] |
GA | 1.1802 | 0.0328 | 36.0061 | 0.0000 | [ -0.6385 , 0.1252 ] |
failures | -0.2361 | 0.1506 | -1.5685 | 0.1176 | [ 1.1218 , 1.2365 ] |
model.resid = full.model$residuals
B=1000
num.p = dim(model.matrix(full.model))[2]
samp.n = dim(model.matrix(full.model))[1]
btr.mtrx = matrix(rep(0,6*B), ncol=num.p)
for (i in 1:B){
bt.lg.fm = full.model$fitted.values +
sample(full.model$residuals, samp.n, replace = TRUE)
Students.num$bt.lg.fm = bt.lg.fm
btr.model = lm(bt.lg.fm ~ absences + famrel + GA +
failures, data = Students.num)
btr.mtrx[i,]=btr.model$coefficients
}
num.p = dim(coef.mtrx)[2]
btr.ci = NULL
btr.wd = NULL
for (i in 1:num.p){
lci.025 = round(quantile(btr.mtrx[, i], 0.025, type = 2),4)
uci.975 = round(quantile(btr.mtrx[, i],0.975, type = 2 ),4)
btr.wd[i] = uci.975 - lci.025
btr.ci[i] = paste("[", round(lci.025,4),", ", round(uci.975,4),"]")
}
kable(as.data.frame(cbind(formatC(cmtrx,4,format="f"), btr.ci.95=btr.ci)),
caption = "Regression Coefficient Matrix with 95% Residual Bootstrap CI")
Estimate | Std. Error | t value | Pr(>|t|) | btr.ci.95 | |
---|---|---|---|---|---|
(Intercept) | -3.5359 | 0.6144 | -5.7553 | 0.0000 | [ -4.6452 , 0 ] |
absences | 0.0387 | 0.0130 | 2.9775 | 0.0031 | [ 0 , 0.0601 ] |
famrel | 0.2659 | 0.1160 | 2.2929 | 0.0224 | [ 0 , 0.47 ] |
GA | 1.1802 | 0.0328 | 36.0061 | 0.0000 | [ 0 , 1.2384 ] |
failures | -0.2361 | 0.1506 | -1.5685 | 0.1176 | [ -0.5352 , 0.0506 ] |
Even though using the residual bootstrap confidence intervals is not always recommended, I just wanted to compare it to the previous one. We see from the chart above that the bootstrap confidence intervals for the residuals contain 0. this can indicate that there is not a lot of bias in the model
kable(as.data.frame(cbind(formatC(cmtrx[,-3],4,format="f"), btc.ci.95=btc.ci,btr.ci.95=btr.ci)),
caption="Final Combined Inferential Statistics: p-values and Bootstrap CIs")
Estimate | Std. Error | Pr(>|t|) | btc.ci.95 | btr.ci.95 | |
---|---|---|---|---|---|
(Intercept) | -3.5359 | 0.6144 | 0.0000 | [ -4.9538 , -2.2056 ] | [ -4.6452 , 0 ] |
absences | 0.0387 | 0.0130 | 0.0031 | [ 0.0173 , 0.0755 ] | [ 0 , 0.0601 ] |
famrel | 0.2659 | 0.1160 | 0.0224 | [ 0.0395 , 0.5074 ] | [ 0 , 0.47 ] |
GA | 1.1802 | 0.0328 | 0.0000 | [ -0.6385 , 0.1252 ] | [ 0 , 1.2384 ] |
failures | -0.2361 | 0.1506 | 0.1176 | [ 1.1218 , 1.2365 ] | [ -0.5352 , 0.0506 ] |
kable(cmtrx, caption = "Inferential Statistics of Final Model")
Estimate | Std. Error | t value | Pr(>|t|) | |
---|---|---|---|---|
(Intercept) | -3.5358762 | 0.6143636 | -5.755348 | 0.0000000 |
absences | 0.0387264 | 0.0130061 | 2.977546 | 0.0030870 |
famrel | 0.2659179 | 0.1159748 | 2.292894 | 0.0223857 |
GA | 1.1802399 | 0.0327789 | 36.006051 | 0.0000000 |
failures | -0.2361464 | 0.1505524 | -1.568533 | 0.1175677 |
The bootstrap and the parametric models all gave similar results, but I am choosing to use the bootstrap method to report the final model of my data because my data is not normally distributed and it contains enough observations where the bootstrap method can be applied. The final model can be described as:
G3 = -3.536 + 0.039(absences) + 0.266(famrel) + 1.18(GA) - 0.236(failures)
From this model, we can see that first and second period grades, the number of absences, and the family relationship positively impact the final grade a student receives in math. The number of class failures typically cause the final grade in math to decrease. We can see from the model that when the amount of failures increases by one, the final math grade decreases by 0.236 points. When the average of the student’s first and second period grades increases by a point, the students final grade increases by 1.18.
kable(as.data.frame(cbind(formatC(cmtrx,4,format="f"), btc.ci.95=btc.ci)),
caption = "Inferential Statistics of Final Model")
Estimate | Std. Error | t value | Pr(>|t|) | btc.ci.95 | |
---|---|---|---|---|---|
(Intercept) | -3.5359 | 0.6144 | -5.7553 | 0.0000 | [ -4.9538 , -2.2056 ] |
absences | 0.0387 | 0.0130 | 2.9775 | 0.0031 | [ 0.0173 , 0.0755 ] |
famrel | 0.2659 | 0.1160 | 2.2929 | 0.0224 | [ 0.0395 , 0.5074 ] |
GA | 1.1802 | 0.0328 | 36.0061 | 0.0000 | [ -0.6385 , 0.1252 ] |
failures | -0.2361 | 0.1506 | -1.5685 | 0.1176 | [ 1.1218 , 1.2365 ] |
To conclude, I only took into account several explanatory variables that had significant p values. I created a total of three different models: the full model, the log transformed model, and the square root transformed model. Each model contains the same number of variables. Even though it ended up being the best model, the full model had several violations. The variance of the model was not constant, and the box cox transformation was done in order to try to correct this. The data was also not normal, this violation remains uncorrected. Also, the variable “failures” is not as statistically significant like the other ones. the full model had and R squared value of about 0.79. Even though the log transformation model had the higher R squared value, I selected this model to use for the second part of the assignment.
For the second part of the assignment I did 95% bootstrap confidence intervals for the coefficients and the residuals. I concluded that a bootstrap model of the coefficients would be the best way to evaluate the full model, despite the fact that the bootstrap confidence interval contains a 0 for the variable “GA”. Recommendations for pursuing this project further include figuring out how the data for the log transformed model could be used as the final model since it has a higher R squared value. I would also either use different variables or transform the existing variables so the bootstrap confidence interval for GA did not contain 0. i would also use some of the variables from another set on data.world.com that was by the same author with data from the same schools and surveys about the same student’s final English grades. A students final grade in English is likely a good predictor of their final math grade.