This data set was obtained from [Data World] https://data.world.com. The data consists of information on students gathered from two different schools in Portugal about students habits and lives outside of school to see what impact these external factors might have on their final grade in mathematics. The data was collected through school surveys and questionnaires.
The question trying to be answered throughout this data analysis is what the association is between students study habits and lives outside of school, and their final math grade.
First, the data is uploaded.
students0 <- read.csv("https://raw.githubusercontent.com/AvaDeSt/STA-321/refs/heads/main/student-mat.csv", header = TRUE)
students=read.table("https://raw.githubusercontent.com/AvaDeSt/STA-321/refs/heads/main/student-mat.csv",sep=";",header=TRUE)
model = lm(G3 ~ Medu + Fedu + traveltime + failures + freetime + famsize + goout + Walc + Dalc + famrel + absences + health + studytime , data = students)
kable(summary(model)$coef, caption ="Statistics of Regression Coefficients")
Estimate | Std. Error | t value | Pr(>|t|) | |
---|---|---|---|---|
(Intercept) | 9.0797575 | 1.6801061 | 5.4042762 | 0.0000001 |
Medu | 0.6241196 | 0.2549488 | 2.4480195 | 0.0148147 |
Fedu | -0.0559475 | 0.2543342 | -0.2199761 | 0.8260075 |
traveltime | -0.4317401 | 0.3140260 | -1.3748548 | 0.1699845 |
failures | -1.8833355 | 0.3065013 | -6.1446257 | 0.0000000 |
freetime | 0.3498435 | 0.2304954 | 1.5177890 | 0.1298969 |
famsizeLE3 | 0.8561716 | 0.4729041 | 1.8104548 | 0.0710129 |
goout | -0.6475661 | 0.2207275 | -2.9337814 | 0.0035512 |
Walc | 0.3367182 | 0.2381571 | 1.4138492 | 0.1582227 |
Dalc | -0.1982684 | 0.3192112 | -0.6211197 | 0.5348923 |
famrel | 0.2536250 | 0.2441219 | 1.0389275 | 0.2994973 |
absences | 0.0240459 | 0.0271578 | 0.8854119 | 0.3764930 |
health | -0.1444087 | 0.1554869 | -0.9287521 | 0.3536056 |
studytime | 0.2688117 | 0.2685105 | 1.0011220 | 0.3174032 |
stepwise_model <- step(model, direction = "both")
Start: AIC=1148.3
G3 ~ Medu + Fedu + traveltime + failures + freetime + famsize +
goout + Walc + Dalc + famrel + absences + health + studytime
Df Sum of Sq RSS AIC
- Fedu 1 0.86 6736.0 1146.3
- Dalc 1 6.82 6742.0 1146.7
- absences 1 13.86 6749.0 1147.1
- health 1 15.25 6750.4 1147.2
- studytime 1 17.72 6752.9 1147.3
- famrel 1 19.08 6754.3 1147.4
- traveltime 1 33.41 6768.6 1148.3
<none> 6735.2 1148.3
- Walc 1 35.34 6770.5 1148.4
- freetime 1 40.72 6775.9 1148.7
- famsize 1 57.94 6793.1 1149.7
- Medu 1 105.94 6841.1 1152.5
- goout 1 152.15 6887.3 1155.1
- failures 1 667.44 7402.6 1183.6
Step: AIC=1146.35
G3 ~ Medu + traveltime + failures + freetime + famsize + goout +
Walc + Dalc + famrel + absences + health + studytime
Df Sum of Sq RSS AIC
- Dalc 1 6.76 6742.8 1144.8
- absences 1 14.22 6750.2 1145.2
- health 1 15.72 6751.8 1145.3
- studytime 1 18.61 6754.6 1145.4
- famrel 1 19.09 6755.1 1145.5
- traveltime 1 32.84 6768.9 1146.3
<none> 6736.0 1146.3
- Walc 1 35.06 6771.1 1146.4
- freetime 1 41.49 6777.5 1146.8
- famsize 1 58.73 6794.8 1147.8
+ Fedu 1 0.86 6735.2 1148.3
- Medu 1 146.08 6882.1 1152.8
- goout 1 152.85 6888.9 1153.2
- failures 1 674.80 7410.8 1182.1
Step: AIC=1144.75
G3 ~ Medu + traveltime + failures + freetime + famsize + goout +
Walc + famrel + absences + health + studytime
Df Sum of Sq RSS AIC
- absences 1 13.60 6756.4 1143.5
- health 1 16.06 6758.9 1143.7
- studytime 1 18.80 6761.6 1143.8
- famrel 1 19.65 6762.4 1143.9
- Walc 1 29.32 6772.1 1144.5
<none> 6742.8 1144.8
- traveltime 1 35.55 6778.3 1144.8
- freetime 1 37.35 6780.1 1144.9
- famsize 1 57.05 6799.9 1146.1
+ Dalc 1 6.76 6736.0 1146.3
+ Fedu 1 0.80 6742.0 1146.7
- Medu 1 141.66 6884.5 1151.0
- goout 1 149.96 6892.8 1151.4
- failures 1 685.66 7428.5 1181.0
Step: AIC=1143.55
G3 ~ Medu + traveltime + failures + freetime + famsize + goout +
Walc + famrel + health + studytime
Df Sum of Sq RSS AIC
- health 1 17.20 6773.6 1142.5
- studytime 1 17.68 6774.1 1142.6
- famrel 1 19.50 6775.9 1142.7
- freetime 1 33.81 6790.2 1143.5
<none> 6756.4 1143.5
- Walc 1 35.00 6791.4 1143.6
- traveltime 1 36.64 6793.0 1143.7
+ absences 1 13.60 6742.8 1144.8
- famsize 1 58.65 6815.0 1145.0
+ Dalc 1 6.15 6750.2 1145.2
+ Fedu 1 1.14 6755.3 1145.5
- goout 1 150.81 6907.2 1150.3
- Medu 1 155.07 6911.5 1150.5
- failures 1 674.71 7431.1 1179.2
Step: AIC=1142.55
G3 ~ Medu + traveltime + failures + freetime + famsize + goout +
Walc + famrel + studytime
Df Sum of Sq RSS AIC
- famrel 1 16.00 6789.6 1141.5
- studytime 1 19.12 6792.7 1141.7
- Walc 1 30.35 6804.0 1142.3
- freetime 1 31.28 6804.9 1142.4
<none> 6773.6 1142.5
- traveltime 1 35.96 6809.6 1142.6
+ health 1 17.20 6756.4 1143.5
+ absences 1 14.75 6758.9 1143.7
- famsize 1 61.38 6835.0 1144.1
+ Dalc 1 6.46 6767.1 1144.2
+ Fedu 1 1.72 6771.9 1144.5
- goout 1 143.80 6917.4 1148.8
- Medu 1 157.93 6931.5 1149.7
- failures 1 685.65 7459.3 1178.6
Step: AIC=1141.48
G3 ~ Medu + traveltime + failures + freetime + famsize + goout +
Walc + studytime
Df Sum of Sq RSS AIC
- studytime 1 19.77 6809.4 1140.6
- Walc 1 24.72 6814.3 1140.9
<none> 6789.6 1141.5
- traveltime 1 35.63 6825.2 1141.5
- freetime 1 39.24 6828.8 1141.8
+ famrel 1 16.00 6773.6 1142.5
+ absences 1 14.49 6775.1 1142.6
+ health 1 13.70 6775.9 1142.7
- famsize 1 60.63 6850.2 1143.0
+ Dalc 1 6.92 6782.7 1143.1
+ Fedu 1 1.66 6787.9 1143.4
- goout 1 136.40 6926.0 1147.3
- Medu 1 154.57 6944.2 1148.4
- failures 1 698.49 7488.1 1178.2
Step: AIC=1140.63
G3 ~ Medu + traveltime + failures + freetime + famsize + goout +
Walc
Df Sum of Sq RSS AIC
- Walc 1 16.68 6826.1 1139.6
- freetime 1 33.24 6842.6 1140.6
<none> 6809.4 1140.6
- traveltime 1 38.91 6848.3 1140.9
+ studytime 1 19.77 6789.6 1141.5
+ famrel 1 16.65 6792.7 1141.7
+ health 1 14.94 6794.4 1141.8
+ absences 1 13.32 6796.1 1141.9
- famsize 1 57.44 6866.8 1142.0
+ Dalc 1 7.17 6802.2 1142.2
+ Fedu 1 2.88 6806.5 1142.5
- goout 1 128.72 6938.1 1146.0
- Medu 1 155.50 6964.9 1147.5
- failures 1 743.24 7552.6 1179.5
Step: AIC=1139.6
G3 ~ Medu + traveltime + failures + freetime + famsize + goout
Df Sum of Sq RSS AIC
- traveltime 1 33.72 6859.8 1139.5
<none> 6826.1 1139.6
- freetime 1 34.72 6860.8 1139.6
+ absences 1 17.32 6808.7 1140.6
+ Walc 1 16.68 6809.4 1140.6
+ health 1 11.78 6814.3 1140.9
+ studytime 1 11.73 6814.3 1140.9
+ famrel 1 11.52 6814.5 1140.9
- famsize 1 64.17 6890.2 1141.3
+ Fedu 1 2.27 6823.8 1141.5
+ Dalc 1 0.11 6825.9 1141.6
- goout 1 112.20 6938.3 1144.0
- Medu 1 151.95 6978.0 1146.3
- failures 1 730.38 7556.4 1177.8
Step: AIC=1139.54
G3 ~ Medu + failures + freetime + famsize + goout
Df Sum of Sq RSS AIC
<none> 6859.8 1139.5
+ traveltime 1 33.72 6826.1 1139.6
- freetime 1 36.72 6896.5 1139.7
+ absences 1 17.62 6842.2 1140.5
+ studytime 1 15.15 6844.6 1140.7
+ famrel 1 12.01 6847.8 1140.8
+ health 1 11.81 6848.0 1140.9
+ Walc 1 11.49 6848.3 1140.9
- famsize 1 59.08 6918.9 1140.9
+ Fedu 1 1.34 6858.4 1141.5
+ Dalc 1 0.21 6859.6 1141.5
- goout 1 117.06 6976.8 1144.2
- Medu 1 178.98 7038.8 1147.7
- failures 1 748.93 7608.7 1178.5
summary(stepwise_model)
Call:
lm(formula = G3 ~ Medu + failures + freetime + famsize + goout,
data = students)
Residuals:
Min 1Q Median 3Q Max
-12.5232 -2.0881 0.4697 2.7259 9.1755
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 9.6298 0.9771 9.855 < 2e-16 ***
Medu 0.6378 0.2002 3.186 0.00156 **
failures -1.9333 0.2967 -6.517 2.23e-10 ***
freetime 0.3196 0.2215 1.443 0.14983
famsizeLE3 0.8551 0.4672 1.830 0.06796 .
goout -0.5156 0.2001 -2.576 0.01035 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 4.199 on 389 degrees of freedom
Multiple R-squared: 0.1705, Adjusted R-squared: 0.1599
F-statistic: 15.99 on 5 and 389 DF, p-value: 2.432e-14
Students.num <- select(students,"Medu", "failures", "freetime", "goout", "G3", "famsize")
I got rid of the variable “school” because which school the students go to is not really being taken into account since the data is being collected from only two schools that are in a similar location. I also got rid of the separate period grades because in this study, I am mostly looking at the final grade. I also got rid of age because that is just a demographic. I also did stepwise selection since some of the p-values for some of the variables were pretty high. This left me with the variables “Medu”, “failures”, “freetime”, “goout”, and “famsize”.
full.model = lm(G3 ~ Medu + failures + freetime + famsize + goout, data = Students.num)
kable(summary(full.model)$coef, caption ="Statistics of Regression Coefficients")
Estimate | Std. Error | t value | Pr(>|t|) | |
---|---|---|---|---|
(Intercept) | 9.6298289 | 0.9771261 | 9.855258 | 0.0000000 |
Medu | 0.6377844 | 0.2001929 | 3.185849 | 0.0015598 |
failures | -1.9333273 | 0.2966648 | -6.516875 | 0.0000000 |
freetime | 0.3195798 | 0.2214706 | 1.442989 | 0.1498282 |
famsizeLE3 | 0.8551003 | 0.4671671 | 1.830395 | 0.0679558 |
goout | -0.5155712 | 0.2001114 | -2.576420 | 0.0103508 |
par(mfrow=c(2,2))
plot(full.model)
The QQ plot indicates that the data is not quite a normal distribution.
The variance of the data is also not constant and the data is clumped
more together at the right side of the graph. There also appears to be
one outlier to the far right in the lower right graph.
vif(full.model)
Medu failures freetime famsize goout
1.073128 1.087442 1.093401 1.003687 1.108887
barplot(vif(full.model), main = "VIF Values", horiz = FALSE, col = "steelblue")
Since all of the VIF values are close to 1 and do not exceed 4,
multicollinearity is not an issue.
To help correct the non constant variance of the data, I am going to perform a boxcox transformation.
par(pty = "s", mfrow = c(2, 2), oma=c(.1,.1,.1,.1), mar=c(4, 0, 2, 0))
Students.num$G3_adjusted <- Students.num$G3 + 3
boxcox(G3_adjusted ~ Medu + freetime + famsize + goout + log(failures +1)
, data = Students.num, lambda = seq(0, 1, length = 10),
xlab=expression(paste(lambda, ": log-failures")))
boxcox(G3_adjusted ~ Medu + freetime + famsize + goout + failures
, data = Students.num, lambda = seq(0, 1, length = 10),
xlab=expression(paste(lambda, ": failures")))
boxcox(G3_adjusted ~ Medu + log(freetime) + famsize + goout + failures
, data = Students.num, lambda = seq(0, 1, length = 10),
xlab=expression(paste(lambda, ": log-freetime")))
boxcox(G3_adjusted ~ Medu + freetime + famsize + goout + failures
, data = Students.num, lambda = seq(0, 1, length = 10),
xlab=expression(paste(lambda, ": freetime")))
In order to do the boxcox transformation, I had to change my response
variable “G3” to be positive. Also, when taking the log of “failures” i
also had to change that to be positive by adding 1.
sqrt.G3.log.fa = lm((G3_adjusted)^0.5 ~ Medu + log(failures +1) + freetime + famsize + goout, data = Students.num)
kable(summary(sqrt.G3.log.fa)$coef, caption = "log-transformed model")
Estimate | Std. Error | t value | Pr(>|t|) | |
---|---|---|---|---|
(Intercept) | 3.4895320 | 0.1562355 | 22.335082 | 0.0000000 |
Medu | 0.0884132 | 0.0319745 | 2.765116 | 0.0059612 |
log(failures + 1) | -0.6048948 | 0.0895496 | -6.754855 | 0.0000000 |
freetime | 0.0492631 | 0.0353184 | 1.394829 | 0.1638632 |
famsizeLE3 | 0.1408791 | 0.0745153 | 1.890607 | 0.0594198 |
goout | -0.0732050 | 0.0319050 | -2.294469 | 0.0222956 |
par(mfrow = c(2,2))
plot(sqrt.G3.log.fa)
Similar to the boxcox model, I had to adjust G3 and “failures” to be
positive.
log.G3 = lm(log(G3_adjusted) ~ Medu + failures + freetime + famsize + goout , data = Students.num)
kable(summary(log.G3)$coef, caption = "log-transformed model")
Estimate | Std. Error | t value | Pr(>|t|) | |
---|---|---|---|---|
(Intercept) | 2.4138563 | 0.1097202 | 22.000113 | 0.0000000 |
Medu | 0.0556047 | 0.0224794 | 2.473584 | 0.0138027 |
failures | -0.2110588 | 0.0333121 | -6.335803 | 0.0000000 |
freetime | 0.0338659 | 0.0248686 | 1.361792 | 0.1740516 |
famsizeLE3 | 0.0962690 | 0.0524576 | 1.835179 | 0.0672425 |
goout | -0.0426323 | 0.0224702 | -1.897276 | 0.0585318 |
par(mfrow = c(2,2))
plot(log.G3)
## Goodness of Fit measures
select=function(m){
e = m$resid
n0 = length(e)
SSE=(m$df)*(summary(m)$sigma)^2
R.sq=summary(m)$r.squared
R.adj=summary(m)$adj.r
MSE=(summary(m)$sigma)^2
Cp=(SSE/MSE)-(n0-2*(n0-m$df))
AIC=n0*log(SSE)-n0*log(n0)+2*(n0-m$df)
SBC=n0*log(SSE)-n0*log(n0)+(log(n0))*(n0-m$df)
X=model.matrix(m)
H=X%*%solve(t(X)%*%X)%*%t(X)
d=e/(1-diag(H))
PRESS=t(d)%*%d
tbl = as.data.frame(cbind(SSE=SSE, R.sq=R.sq, R.adj = R.adj, Cp = Cp, AIC = AIC, SBC = SBC, PRD = PRESS))
names(tbl)=c("SSE", "R.sq", "R.adj", "Cp", "AIC", "SBC", "PRESS")
tbl
}
output.sum = rbind(select(full.model), select(sqrt.G3.log.fa), select(log.G3))
row.names(output.sum) = c("full.model", "sqrt.G3.log.fa", "log.G3")
kable(output.sum, caption = "Goodness-of-fit Measures of Candidate Models")
SSE | R.sq | R.adj | Cp | AIC | SBC | PRESS | |
---|---|---|---|---|---|---|---|
full.model | 6859.77432 | 0.1705139 | 0.1598521 | 6 | 1139.5449 | 1163.4182 | 7080.63373 |
sqrt.G3.log.fa | 174.57636 | 0.1679182 | 0.1572231 | 6 | -310.5268 | -286.6535 | 180.35971 |
log.G3 | 86.49313 | 0.1470004 | 0.1360364 | 6 | -587.9342 | -564.0609 | 89.50556 |
The best model to use regarding this data set is the full model because it has the highest R^2 value. It therefore is best able to explain the variance for the final math grade of the students based on the explanatory variables.
kable(summary(full.model)$coef, caption = "Inferential Statistics of Final Model")
Estimate | Std. Error | t value | Pr(>|t|) | |
---|---|---|---|---|
(Intercept) | 9.6298289 | 0.9771261 | 9.855258 | 0.0000000 |
Medu | 0.6377844 | 0.2001929 | 3.185849 | 0.0015598 |
failures | -1.9333273 | 0.2966648 | -6.516875 | 0.0000000 |
freetime | 0.3195798 | 0.2214706 | 1.442989 | 0.1498282 |
famsizeLE3 | 0.8551003 | 0.4671671 | 1.830395 | 0.0679558 |
goout | -0.5155712 | 0.2001114 | -2.576420 | 0.0103508 |
The model can be written as the following:
G3 = 9.63 + (0.638)Medu - (1.933)failures + (0.320)freetime + (0.855)famsize - (0.516)goout
From this model, we can see that the variables mother’s education, amount of free time, and the size of the family positively impact the final grade a student receives in math. The number of class failures, and the amount of time spent going out typically cause the final grade in math to decrease. We can see from the model that when the amount of time going out increases by one point, the final math grade decreases by 0.51.
To conclude, I only took into account the explanatory variables that were numeric. I also did stepwise regression to determine which variables were the most significant. This left five numeric variables. Each model contains the same number of variables. Even though it ended up being the best model, the full model had several violations. The variance of the model was not constant, and the box cox transformation was done in order to try to correct this. The data was also not normal, this violation remains uncorrected. There also appears to be an outlier in the data. Despite the full model having the highest R-squared value, it is still fairly low at only 0.17. An explanation for this can be that there is a model better suited for this data that was not included in this project. There also may have been better explanatory variables selected that would better explain why a student might have a certain math grade.