Loading in the libraries
library(dplyr)
library(ggplot2)
library(tidyr)
library(car)
library(stats)
I upload the raw data, not filtered or cleaned yet
setwd("C:/Users/josep/Joseph Schoolwork/R class/data")
#because [why], i used read.csv2 instead of read.csv
data <- read.csv2("student-por.csv")
head(data)
I now select 14 variables which I deem most important, out of the 33 total seen above. I also changed the variable “sex” into a categorical variable. Then, I look at the summary statistics.
edu <- data %>%
dplyr::select(G1, G2, G3, Medu, Fedu, studytime, failures, absences, sex, age, goout, Dalc, Walc, health) %>%
tidyr::drop_na() %>%
dplyr::mutate(sex = factor(sex))
head(edu)
summary(edu)
G1 G2 G3 Medu Fedu studytime failures absences
Min. : 0.0 Min. : 0.00 Min. : 0.00 Min. :0.000 Min. :0.000 Min. :1.000 Min. :0.0000 Min. : 0.000
1st Qu.:10.0 1st Qu.:10.00 1st Qu.:10.00 1st Qu.:2.000 1st Qu.:1.000 1st Qu.:1.000 1st Qu.:0.0000 1st Qu.: 0.000
Median :11.0 Median :11.00 Median :12.00 Median :2.000 Median :2.000 Median :2.000 Median :0.0000 Median : 2.000
Mean :11.4 Mean :11.57 Mean :11.91 Mean :2.515 Mean :2.307 Mean :1.931 Mean :0.2219 Mean : 3.659
3rd Qu.:13.0 3rd Qu.:13.00 3rd Qu.:14.00 3rd Qu.:4.000 3rd Qu.:3.000 3rd Qu.:2.000 3rd Qu.:0.0000 3rd Qu.: 6.000
Max. :19.0 Max. :19.00 Max. :19.00 Max. :4.000 Max. :4.000 Max. :4.000 Max. :3.0000 Max. :32.000
sex age goout Dalc Walc health
F:383 Min. :15.00 Min. :1.000 Min. :1.000 Min. :1.00 Min. :1.000
M:266 1st Qu.:16.00 1st Qu.:2.000 1st Qu.:1.000 1st Qu.:1.00 1st Qu.:2.000
Median :17.00 Median :3.000 Median :1.000 Median :2.00 Median :4.000
Mean :16.74 Mean :3.185 Mean :1.502 Mean :2.28 Mean :3.536
3rd Qu.:18.00 3rd Qu.:4.000 3rd Qu.:2.000 3rd Qu.:3.00 3rd Qu.:5.000
Max. :22.00 Max. :5.000 Max. :5.000 Max. :5.00 Max. :5.000
Now I look at the distribution of final grades (G3). It showed that it is a bell shaped distribution centered around the middle, which means that most students perform near average. There is also a small amount of zeros on the left.
ggplot(edu, aes(x = G3)) +
geom_histogram(binwidth = 1) +
labs(title = "Final Grade (G3) Distribution", x = "G3", y = "Count")
Next, I check the relationship between the middle of year grades (G2) and final grades (G3)
ggplot(edu, aes(x = G2, y = G3)) +
geom_point() +
geom_smooth(method = "lm", se = T) +
labs(title = "Mid year (G2) vs Final (G3)", x = "G2", y = "G3")
Now I build the regression model using all 14 variables. G3 is the dependent variable, and the other 13 are predictors. From here, I can see which ones hold significance, while controlling for others.
mdl_full <- lm(G3 ~ Medu + Fedu + studytime + failures + absences +
G1 + G2 + sex + age + goout + Dalc + Walc + health, data = edu)
summary(mdl_full)
Call:
lm(formula = G3 ~ Medu + Fedu + studytime + failures + absences +
G1 + G2 + sex + age + goout + Dalc + Walc + health, data = edu)
Residuals:
Min 1Q Median 3Q Max
-8.8959 -0.4682 -0.0404 0.6328 6.1408
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.089909 0.792775 -0.113 0.909741
Medu -0.036002 0.058436 -0.616 0.538061
Fedu 0.031191 0.059497 0.524 0.600297
studytime 0.069455 0.063643 1.091 0.275548
failures -0.230665 0.095247 -2.422 0.015724 *
absences 0.024421 0.011029 2.214 0.027164 *
G1 0.143081 0.036682 3.901 0.000106 ***
G2 0.876506 0.034485 25.417 < 2e-16 ***
sexM -0.109568 0.110391 -0.993 0.321308
age 0.029547 0.044111 0.670 0.503218
goout -0.028043 0.045870 -0.611 0.541180
Dalc -0.078583 0.069255 -1.135 0.256936
Walc -0.004022 0.053302 -0.075 0.939870
health -0.046183 0.034749 -1.329 0.184301
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.253 on 635 degrees of freedom
Multiple R-squared: 0.8525, Adjusted R-squared: 0.8495
F-statistic: 282.3 on 13 and 635 DF, p-value: < 2.2e-16
Now I simplify the model using stepwise selection, to remove variables that don’t improve the model.
mdl_step <- step(mdl_full, trace = 0)
summary(mdl_step)
Call:
lm(formula = G3 ~ failures + absences + G1 + G2 + sex + Dalc,
data = edu)
Residuals:
Min 1Q Median 3Q Max
-8.9645 -0.4869 -0.0648 0.6214 5.9651
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.23657 0.27359 0.865 0.387534
failures -0.21692 0.09061 -2.394 0.016955 *
absences 0.02466 0.01086 2.271 0.023496 *
G1 0.14066 0.03602 3.905 0.000104 ***
G2 0.88356 0.03387 26.091 < 2e-16 ***
sexM -0.16005 0.10434 -1.534 0.125524
Dalc -0.08878 0.05695 -1.559 0.119511
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.251 on 642 degrees of freedom
Multiple R-squared: 0.8515, Adjusted R-squared: 0.8501
F-statistic: 613.6 on 6 and 642 DF, p-value: < 2.2e-16
Now I do a diagnostic check of the model. Shapiro-Wilk test shows that the residuals were normal, the histogram shows residuals centered around zero, and the residuals vs fitted plot showed no pattern.
shapiro.test(mdl_step$residuals)
Shapiro-Wilk normality test
data: mdl_step$residuals
W = 0.75943, p-value < 2.2e-16
ggplot(data.frame(resid = mdl_step$residuals), aes(x = resid)) +
geom_histogram(binwidth = 0.5) + labs(title = "Residuals", x = "Residual", y = "Count")
ggplot(data.frame(fitted = mdl_step$fitted.values,
resid = mdl_step$residuals),
aes(x = fitted, y = resid)) +
geom_point() +
geom_smooth(method = "lm", se = F) +
labs(title = "Residuals vs Fitted", x = "Fitted", y = "Residual")