Relevant summary statistics
student_mat_csv <- "https://raw.githubusercontent.com/moiyajosephs/Data606-Final/main/student-mat.csv"
student_mat <- read_delim(curl(student_mat_csv),delim = ";")
student_por_csv <- "https://raw.githubusercontent.com/moiyajosephs/Data606-Final/main/student-por.csv"
student_por <- read_delim(curl(student_por_csv),delim = ";")
## Rows: 649 Columns: 33
## -- Column specification --------------------------------------------------------
## Delimiter: ";"
## chr (17): school, sex, address, famsize, Pstatus, Mjob, Fjob, reason, guardi...
## dbl (16): age, Medu, Fedu, traveltime, studytime, failures, famrel, freetime...
##
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
summary(student_mat)
## school sex age address
## Length:395 Length:395 Min. :15.0 Length:395
## Class :character Class :character 1st Qu.:16.0 Class :character
## Mode :character Mode :character Median :17.0 Mode :character
## Mean :16.7
## 3rd Qu.:18.0
## Max. :22.0
## famsize Pstatus Medu Fedu
## Length:395 Length:395 Min. :0.000 Min. :0.000
## Class :character Class :character 1st Qu.:2.000 1st Qu.:2.000
## Mode :character Mode :character Median :3.000 Median :2.000
## Mean :2.749 Mean :2.522
## 3rd Qu.:4.000 3rd Qu.:3.000
## Max. :4.000 Max. :4.000
## Mjob Fjob reason guardian
## Length:395 Length:395 Length:395 Length:395
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## traveltime studytime failures schoolsup
## Min. :1.000 Min. :1.000 Min. :0.0000 Length:395
## 1st Qu.:1.000 1st Qu.:1.000 1st Qu.:0.0000 Class :character
## Median :1.000 Median :2.000 Median :0.0000 Mode :character
## Mean :1.448 Mean :2.035 Mean :0.3342
## 3rd Qu.:2.000 3rd Qu.:2.000 3rd Qu.:0.0000
## Max. :4.000 Max. :4.000 Max. :3.0000
## famsup paid activities nursery
## Length:395 Length:395 Length:395 Length:395
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## higher internet romantic famrel
## Length:395 Length:395 Length:395 Min. :1.000
## Class :character Class :character Class :character 1st Qu.:4.000
## Mode :character Mode :character Mode :character Median :4.000
## Mean :3.944
## 3rd Qu.:5.000
## Max. :5.000
## freetime goout Dalc Walc
## Min. :1.000 Min. :1.000 Min. :1.000 Min. :1.000
## 1st Qu.:3.000 1st Qu.:2.000 1st Qu.:1.000 1st Qu.:1.000
## Median :3.000 Median :3.000 Median :1.000 Median :2.000
## Mean :3.235 Mean :3.109 Mean :1.481 Mean :2.291
## 3rd Qu.:4.000 3rd Qu.:4.000 3rd Qu.:2.000 3rd Qu.:3.000
## Max. :5.000 Max. :5.000 Max. :5.000 Max. :5.000
## health absences G1 G2
## Min. :1.000 Min. : 0.000 Min. : 3.00 Min. : 0.00
## 1st Qu.:3.000 1st Qu.: 0.000 1st Qu.: 8.00 1st Qu.: 9.00
## Median :4.000 Median : 4.000 Median :11.00 Median :11.00
## Mean :3.554 Mean : 5.709 Mean :10.91 Mean :10.71
## 3rd Qu.:5.000 3rd Qu.: 8.000 3rd Qu.:13.00 3rd Qu.:13.00
## Max. :5.000 Max. :75.000 Max. :19.00 Max. :19.00
## G3
## Min. : 0.00
## 1st Qu.: 8.00
## Median :11.00
## Mean :10.42
## 3rd Qu.:14.00
## Max. :20.00
Attributes for student-mat.csv (Math course): 1 school - student’s
school (binary: ‘GP’ - Gabriel Pereira or ‘MS’ - Mousinho da Silveira) 2
sex - student’s sex (binary: ‘F’ - female or ‘M’ - male) 3 age -
student’s age (numeric: from 15 to 22) 4 address - student’s home
address type (binary: ‘U’ - urban or ‘R’ - rural) 5 famsize - family
size (binary: ‘LE3’ - less or equal to 3 or ‘GT3’ - greater than 3) 6
Pstatus - parent’s cohabitation status (binary: ‘T’ - living together or
‘A’ - apart) 7 Medu - mother’s education (numeric: 0 - none, 1 - primary
education (4th grade), 2 – 5th to 9th grade, 3 – secondary education
or 4 – higher education) 8 Fedu - father’s education (numeric: 0 -
none, 1 - primary education (4th grade), 2 – 5th to 9th grade, 3 –
secondary education or 4 – higher education) 9 Mjob - mother’s job
(nominal: ‘teacher’, ‘health’ care related, civil ‘services’
(e.g. administrative or police), ‘at_home’ or ‘other’) 10 Fjob -
father’s job (nominal: ‘teacher’, ‘health’ care related, civil
‘services’ (e.g. administrative or police), ‘at_home’ or ‘other’) 11
reason - reason to choose this school (nominal: close to ‘home’, school
‘reputation’, ‘course’ preference or ‘other’) 12 guardian - student’s
guardian (nominal: ‘mother’, ‘father’ or ‘other’) 13 traveltime - home
to school travel time (numeric: 1 - <15 min., 2 - 15 to 30 min., 3 -
30 min. to 1 hour, or 4 - >1 hour) 14 studytime - weekly study time
(numeric: 1 - <2 hours, 2 - 2 to 5 hours, 3 - 5 to 10 hours, or 4 -
>10 hours) 15 failures - number of past class failures (numeric: n if
1<=n<3, else 4) 16 schoolsup - extra educational support (binary:
yes or no) 17 famsup - family educational support (binary: yes or no) 18
paid - extra paid classes within the course subject (Math or Portuguese)
(binary: yes or no) 19 activities - extra-curricular activities (binary:
yes or no) 20 nursery - attended nursery school (binary: yes or no) 21
higher - wants to take higher education (binary: yes or no) 22 internet
- Internet access at home (binary: yes or no) 23 romantic - with a
romantic relationship (binary: yes or no) 24 famrel - quality of family
relationships (numeric: from 1 - very bad to 5 - excellent) 25 freetime
- free time after school (numeric: from 1 - very low to 5 - very high)
26 goout - going out with friends (numeric: from 1 - very low to 5 -
very high) 27 Dalc - workday alcohol consumption (numeric: from 1 - very
low to 5 - very high) 28 Walc - weekend alcohol consumption (numeric:
from 1 - very low to 5 - very high) 29 health - current health status
(numeric: from 1 - very bad to 5 - very good) 30 absences - number of
school absences (numeric: from 0 to 93)
these grades are related with the course subject, Math or
Portuguese:
31 G1 - first period grade (numeric: from 0 to 20) 31 G2 - second
period grade (numeric: from 0 to 20) 32 G3 - final grade (numeric: from
0 to 20, output target)
ggplot(student_mat, aes(sex)) + geom_bar()

ggplot(student_mat, aes(G3)) + geom_bar() + facet_wrap(vars(sex))

ggplot(student_mat, aes(G3)) + geom_bar() + facet_grid(vars(school), vars(sex))

The Romance Model
Romantic
Students reply either yes or no if they are in a romantic
relationship, indicated by the romantic variable.
ggplot(student_mat, aes(romantic)) + geom_bar()

romantic.model <- lm(G3 ~ romantic + sex + age, data = student_mat)
summary(romantic.model)
##
## Call:
## lm(formula = G3 ~ romantic + sex + age, data = student_mat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12.0331 -1.9794 0.3268 2.9669 9.3268
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 18.9147 3.0017 6.301 7.96e-10 ***
## romanticyes -0.9438 0.4884 -1.932 0.05403 .
## sexM 0.8196 0.4553 1.800 0.07261 .
## age -0.5134 0.1799 -2.854 0.00455 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.494 on 391 degrees of freedom
## Multiple R-squared: 0.045, Adjusted R-squared: 0.03768
## F-statistic: 6.142 on 3 and 391 DF, p-value: 0.0004341
ggplot(student_mat, aes(x= G3)) +
geom_histogram(bins = 10) +
facet_grid(vars(romantic), vars(sex))

Romantic is statistically significant according to the anova
test.
summary_stats <- student_mat %>%
group_by(romantic) %>%
summarise(mean_by_group = mean(G3))
summary_stats
## # A tibble: 2 x 2
## romantic mean_by_group
## <chr> <dbl>
## 1 no 10.8
## 2 yes 9.58
describeBy(student_mat$G3,
group = student_mat$romantic, mat=TRUE)
## item group1 vars n mean sd median trimmed mad min max
## X11 1 no 1 263 10.836502 4.385946 11 11.113744 4.4478 0 20
## X12 2 yes 1 132 9.575758 4.856916 11 9.971698 3.7065 0 18
## range skew kurtosis se
## X11 20 -0.6186634 0.5452026 0.2704490
## X12 18 -0.8212904 -0.2272406 0.4227403
Analyzing the r^2
ggplot(data = romantic.model, aes(x = .fitted, y = .resid)) +
geom_point() +
geom_jitter()+
geom_hline(yintercept = 0, linetype = "dashed") +
xlab("Fitted values") +
ylab("Residuals")

The Activities Model
Students reply yes or no if they have extra curricular activities,
indicated by the activities variables.
ggplot(student_mat, aes(x= G3)) +
geom_histogram(bins = 10) +
facet_wrap(vars(activities))

activities.model <- lm(G3 ~ activities + sex + age, data = student_mat)
summary(activities.model)
##
## Call:
## lm(formula = G3 ~ activities + sex + age, data = student_mat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.9186 -1.8871 0.2918 3.0814 9.7181
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 19.60836 3.04399 6.442 3.48e-10 ***
## activitiesyes -0.09467 0.45909 -0.206 0.83673
## sexM 0.91567 0.45740 2.002 0.04599 *
## age -0.57369 0.17926 -3.200 0.00148 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.515 on 391 degrees of freedom
## Multiple R-squared: 0.03599, Adjusted R-squared: 0.02859
## F-statistic: 4.866 on 3 and 391 DF, p-value: 0.002466
By looking at the summary of the activities.model we can
see that activites estimate is not statistically significant as
indicated by p-value which is greater than 0.05. As well as the
F-statistic and p value given, which is also greater than 0.05.
ggplot(data = activities.model, aes(x = .fitted, y = .resid)) +
geom_point() +
geom_jitter() +
geom_hline(yintercept = 0, linetype = "dashed") +
xlab("Fitted values") +
ylab("Residuals")

Freetime
Freetime after school rated from 1 very low to 5 very high.
ggplot(student_mat, aes(freetime)) + geom_bar()

ggplot(student_mat, aes(x= G3)) +
geom_histogram(bins = 10) +
facet_wrap(vars(activities))

freetime.model <- lm(G3 ~ freetime + sex + age, data = student_mat)
summary(freetime.model)
##
## Call:
## lm(formula = G3 ~ freetime + sex + age, data = student_mat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.9281 -1.8236 0.2774 3.0719 9.8027
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 19.6300 3.0666 6.401 4.42e-10 ***
## freetime -0.0472 0.2346 -0.201 0.84064
## sexM 0.9291 0.4688 1.982 0.04823 *
## age -0.5691 0.1784 -3.190 0.00154 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.515 on 391 degrees of freedom
## Multiple R-squared: 0.03598, Adjusted R-squared: 0.02859
## F-statistic: 4.865 on 3 and 391 DF, p-value: 0.002469
Not significant!
Family Relationship
famrel indicated the quality of family relationships on
a scale from (numeric: from 1 - very bad to 5 - excellent)
ggplot(student_mat, aes(famrel)) + geom_bar()

famrel.model <- lm(G3 ~ famrel+ sex + age, data = student_mat)
summary(famrel.model)
##
## Call:
## lm(formula = G3 ~ famrel + sex + age, data = student_mat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12.1556 -1.7315 0.2844 2.8738 9.7415
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 18.6010 3.1070 5.987 4.85e-09 ***
## famrel 0.2782 0.2542 1.095 0.27436
## sexM 0.8763 0.4554 1.924 0.05506 .
## age -0.5809 0.1784 -3.257 0.00123 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.509 on 391 degrees of freedom
## Multiple R-squared: 0.03883, Adjusted R-squared: 0.03145
## F-statistic: 5.265 on 3 and 391 DF, p-value: 0.001432
No significance
Going Out
goout indicates if students go out with friends rated
from (numeric: from 1 - very low to 5 - very high)
ggplot(student_mat, aes(goout)) + geom_bar()

goout.model <- lm(G3 ~ goout+ sex + age, data = student_mat)
summary(goout.model)
##
## Call:
## lm(formula = G3 ~ goout + sex + age, data = student_mat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12.3706 -1.8464 0.1759 2.9101 10.1759
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 20.0801 2.9866 6.723 6.31e-11 ***
## goout -0.5058 0.2051 -2.466 0.01409 *
## sexM 0.9961 0.4532 2.198 0.02854 *
## age -0.5129 0.1785 -2.874 0.00428 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.481 on 391 degrees of freedom
## Multiple R-squared: 0.05065, Adjusted R-squared: 0.04337
## F-statistic: 6.954 on 3 and 391 DF, p-value: 0.0001437
Significance!!
Using Multiple Regression
Upon further implementation, I realized there may be more than one
explanatory variable for the output. To consider the variable in
conjunction with the G3 variable, I will be using multiple
linear regression, and establishing my best naively implemented model to
predict G3.
When using many variables in a linear regression, there may also
exist a collinearity between them. To determine if there are any such
variables I used vif to check for collinearity.
Linear Regression for full model of social activitiess
model.full <- lm(G3 ~ romantic + freetime + goout + Walc + famrel + activities+ sex + age + Dalc, data= student_mat)
summary(model.full)
##
## Call:
## lm(formula = G3 ~ romantic + freetime + goout + Walc + famrel +
## activities + sex + age + Dalc, data = student_mat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.9606 -1.8145 0.4119 2.9367 9.3786
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 18.39330 3.17914 5.786 1.5e-08 ***
## romanticyes -0.90796 0.48930 -1.856 0.0643 .
## freetime 0.11460 0.24727 0.463 0.6433
## goout -0.53579 0.23438 -2.286 0.0228 *
## Walc 0.06730 0.25043 0.269 0.7883
## famrel 0.25665 0.26037 0.986 0.3249
## activitiesyes -0.04855 0.46062 -0.105 0.9161
## sexM 0.89641 0.49223 1.821 0.0694 .
## age -0.45677 0.18371 -2.486 0.0133 *
## Dalc -0.21582 0.34085 -0.633 0.5270
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.483 on 385 degrees of freedom
## Multiple R-squared: 0.06422, Adjusted R-squared: 0.04234
## F-statistic: 2.936 on 9 and 385 DF, p-value: 0.002216
vif(model.full)
## romantic freetime goout Walc famrel activities sex
## 1.046795 1.195705 1.334533 2.038981 1.068367 1.042021 1.186951
## age Dalc
## 1.077172 1.806766
Through backwards selection I removed activities,
Walc, freetime and Dalc. The
model with the highest R adjusted is shown below.
Model reduced linear model equation is:
y = 18.6197 -0.9137 * romanticyes -0.5201 * goout + 0.2802 * famrel +
0.8842 * sexM - 0.4675 * age
The model has an adjusted R squared of 0.05082.
model.reduced <- lm(G3 ~ romantic + famrel + goout + sex + age, data= student_mat)
summary(model.reduced)
##
## Call:
## lm(formula = G3 ~ romantic + famrel + goout + sex + age, data = student_mat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12.0110 -1.7786 0.3513 2.8961 9.4753
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 18.6197 3.0914 6.023 3.97e-09 ***
## romanticyes -0.9137 0.4862 -1.879 0.06097 .
## famrel 0.2802 0.2526 1.110 0.26786
## goout -0.5201 0.2046 -2.542 0.01141 *
## sexM 0.8842 0.4542 1.947 0.05229 .
## age -0.4675 0.1805 -2.590 0.00995 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.464 on 389 degrees of freedom
## Multiple R-squared: 0.06286, Adjusted R-squared: 0.05082
## F-statistic: 5.219 on 5 and 389 DF, p-value: 0.0001191
vif(model.reduced)
## romantic famrel goout sex age
## 1.042849 1.014230 1.025925 1.019550 1.048983
When removing the other variables the most ideal variables are
whether the student goes out or if they are in a romantic relationship.
According to this model, as the grade increases, the
romanticyes and goout variable will decrease
by the given estimates. Indicating that the less the student does/has
these things the higher their grade will be.
| model.full |
0.04234 |
| model.reduced |
0.05082 |
Model reduced has a better adjusted R but the increase is not by a
significant amount.
ggplot(data = model.reduced, aes(x = .fitted, y = .resid)) +
geom_point() +
geom_jitter() +
geom_hline(yintercept = 0, linetype = "dashed") +
xlab("Fitted values") +
ylab("Residuals")

ggplot(model.reduced, aes(x=.resid)) +
geom_histogram(binwidth = 1) +
xlab("resid")

ggplot(model.reduced, aes(sample = .resid)) +
stat_qq_line() +
stat_qq() +
xlab("resid")
Plotting the residuals show that the residuals are not normally
distributed. Likely due to the outliers of 0 values in the data as
well.
Using robust linear regression
In trying to find the best model to fit this data, I researched
models that perform better than lm when there are outliers. In my search
I discovered the lmrob function from the library MASS.
By using the lmrob model the error is lower than the
lm.
Robust Full
library(robustbase)
## Warning: package 'robustbase' was built under R version 4.1.3
res <- lmrob(G3 ~ romantic + freetime + goout + Walc + Dalc + famrel + activities + sex + age,
data=student_mat)
summary(res)
##
## Call:
## lmrob(formula = G3 ~ romantic + freetime + goout + Walc + Dalc + famrel +
## activities + sex + age, data = student_mat)
## \--> method = "MM"
## Residuals:
## Min 1Q Median 3Q Max
## -12.8795 -2.2319 -0.1279 2.3922 9.0019
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 19.367015 3.056071 6.337 6.54e-10 ***
## romanticyes -0.315221 0.503447 -0.626 0.5316
## freetime 0.083992 0.239092 0.351 0.7256
## goout -0.507998 0.250583 -2.027 0.0433 *
## Walc -0.177547 0.235605 -0.754 0.4516
## Dalc -0.125698 0.246057 -0.511 0.6097
## famrel 0.116678 0.265095 0.440 0.6601
## activitiesyes -0.008661 0.422479 -0.021 0.9837
## sexM 1.057875 0.442807 2.389 0.0174 *
## age -0.441836 0.174074 -2.538 0.0115 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Robust residual standard error: 3.611
## Multiple R-squared: 0.07457, Adjusted R-squared: 0.05294
## Convergence in 14 IRWLS iterations
##
## Robustness weights:
## 28 weights are ~= 1. The remaining 367 ones are summarized as
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1767 0.8479 0.9556 0.8699 0.9868 0.9989
## Algorithmic parameters:
## tuning.chi bb tuning.psi refine.tol
## 1.548e+00 5.000e-01 4.685e+00 1.000e-07
## rel.tol scale.tol solve.tol eps.outlier
## 1.000e-07 1.000e-10 1.000e-07 2.532e-04
## eps.x warn.limit.reject warn.limit.meanrw
## 4.002e-11 5.000e-01 5.000e-01
## nResample max.it best.r.s k.fast.s k.max
## 500 50 2 1 200
## maxit.scale trace.lev mts compute.rd fast.s.large.n
## 200 0 1000 0 2000
## psi subsampling cov
## "bisquare" "nonsingular" ".vcov.avar1"
## compute.outlier.stats
## "SM"
## seed : int(0)
Through backwards selection I removed famrel,
Dalc, freetime and `romantic``. The model with
the highest R adjusted is shown below. famrel
robust.model.reduced <- lmrob(G3 ~ goout + Walc + sex + age,
data=student_mat)
summary(robust.model.reduced)
##
## Call:
## lmrob(formula = G3 ~ goout + Walc + sex + age, data = student_mat)
## \--> method = "MM"
## Residuals:
## Min 1Q Median 3Q Max
## -13.20825 -2.18920 -0.05832 2.33241 9.28603
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 20.3540 2.6585 7.656 1.52e-13 ***
## goout -0.4857 0.2358 -2.060 0.04007 *
## Walc -0.2576 0.1896 -1.359 0.17497
## sexM 1.1163 0.4253 2.625 0.00901 **
## age -0.4689 0.1632 -2.874 0.00428 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Robust residual standard error: 3.605
## Multiple R-squared: 0.07392, Adjusted R-squared: 0.06443
## Convergence in 13 IRWLS iterations
##
## Robustness weights:
## 31 weights are ~= 1. The remaining 364 ones are summarized as
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1508 0.8518 0.9533 0.8677 0.9870 0.9990
## Algorithmic parameters:
## tuning.chi bb tuning.psi refine.tol
## 1.548e+00 5.000e-01 4.685e+00 1.000e-07
## rel.tol scale.tol solve.tol eps.outlier
## 1.000e-07 1.000e-10 1.000e-07 2.532e-04
## eps.x warn.limit.reject warn.limit.meanrw
## 4.002e-11 5.000e-01 5.000e-01
## nResample max.it best.r.s k.fast.s k.max
## 500 50 2 1 200
## maxit.scale trace.lev mts compute.rd fast.s.large.n
## 200 0 1000 0 2000
## psi subsampling cov
## "bisquare" "nonsingular" ".vcov.avar1"
## compute.outlier.stats
## "SM"
## seed : int(0)
plot(robust.model.reduced$residuals)

The drawback of this robust function is there is no way to plot the
residuals in a QQ without turning them into a dataframe.
df <- data.frame(matrix(unlist(robust.model.reduced$residuals), nrow=length(robust.model.reduced$residuals), byrow=TRUE))
names(df)[1] <- ".resid"
head(df)
## .resid
## 1 -3.713970
## 2 -4.668544
## 3 -1.576744
## 4 2.908016
## 5 -1.365499
## 6 2.518236
ggplot(df, aes(sample = .resid)) +
stat_qq() +
stat_qq_line() +
xlab("resid")

| ## Removing the Outliers |
| Next I tried to remove the outliers to see if they were having a
significant affect on the models. |
r student.pass <- student_mat %>% filter(G3 != 0) student.pass |
## # A tibble: 357 x 33 ## school sex age address famsize Pstatus Medu Fedu Mjob Fjob reason ## <chr> <chr> <dbl> <chr> <chr> <chr> <dbl> <dbl> <chr> <chr> <chr> ## 1 GP F 18 U GT3 A 4 4 at_home teach~ course ## 2 GP F 17 U GT3 T 1 1 at_home other course ## 3 GP F 15 U LE3 T 1 1 at_home other other ## 4 GP F 15 U GT3 T 4 2 health servi~ home ## 5 GP F 16 U GT3 T 3 3 other other home ## 6 GP M 16 U LE3 T 4 3 services other reput~ ## 7 GP M 16 U LE3 T 2 2 other other home ## 8 GP F 17 U GT3 A 4 4 other teach~ home ## 9 GP M 15 U LE3 A 3 2 services other home ## 10 GP M 15 U GT3 T 3 4 other other home ## # ... with 347 more rows, and 22 more variables: guardian <chr>, ## # traveltime <dbl>, studytime <dbl>, failures <dbl>, schoolsup <chr>, ## # famsup <chr>, paid <chr>, activities <chr>, nursery <chr>, higher <chr>, ## # internet <chr>, romantic <chr>, famrel <dbl>, freetime <dbl>, goout <dbl>, ## # Dalc <dbl>, Walc <dbl>, health <dbl>, absences <dbl>, G1 <dbl>, G2 <dbl>, ## # G3 <dbl>
### The New Full Model |
r model.full <- lm(G3 ~ romantic + freetime + goout + famrel + activities + sex + age +Walc + Dalc, data= student.pass) summary(model.full) |
## ## Call: ## lm(formula = G3 ~ romantic + freetime + goout + famrel + activities + ## sex + age + Walc + Dalc, data = student.pass) ## ## Residuals: ## Min 1Q Median 3Q Max ## -6.9412 -2.1242 -0.2452 2.2083 8.0987 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 16.89594 2.36438 7.146 5.3e-12 *** ## romanticyes -0.08557 0.36671 -0.233 0.81563 ## freetime -0.01091 0.17931 -0.061 0.95150 ## goout -0.29155 0.17844 -1.634 0.10320 ## famrel 0.05066 0.19376 0.261 0.79392 ## activitiesyes 0.15759 0.34197 0.461 0.64520 ## sexM 0.96277 0.36097 2.667 0.00801 ** ## age -0.24615 0.13736 -1.792 0.07400 . ## Walc -0.35484 0.18434 -1.925 0.05506 . ## Dalc -0.14943 0.24291 -0.615 0.53886 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 3.132 on 347 degrees of freedom ## Multiple R-squared: 0.08204, Adjusted R-squared: 0.05823 ## F-statistic: 3.446 on 9 and 347 DF, p-value: 0.0004389 |
r vif(model.full) |
## romantic freetime goout famrel activities sex age ## 1.053421 1.193749 1.374550 1.068645 1.063611 1.183653 1.101094 ## Walc Dalc ## 2.067608 1.811555
### Model reduced |
Using backwards elimination I removed freetime,
romantic, famrel, activities,
dalc from the model and arrive at the final model
below. |
r model.reduced <- lm(G3 ~ goout + age + sex + Walc, data= student.pass) summary(model.reduced) |
## ## Call: ## lm(formula = G3 ~ goout + age + sex + Walc, data = student.pass) ## ## Residuals: ## Min 1Q Median 3Q Max ## -7.0856 -2.1074 -0.2585 2.1972 8.1284 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 17.3500 2.1932 7.911 3.34e-14 *** ## goout -0.2828 0.1698 -1.666 0.09668 . ## age -0.2647 0.1317 -2.010 0.04519 * ## sexM 0.9692 0.3424 2.831 0.00491 ** ## Walc -0.4326 0.1477 -2.928 0.00363 ** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 3.114 on 352 degrees of freedom ## Multiple R-squared: 0.07993, Adjusted R-squared: 0.06947 ## F-statistic: 7.645 on 4 and 352 DF, p-value: 6.467e-06 |
r vif(model.reduced) |
## goout age sex Walc ## 1.259462 1.024419 1.077554 1.344071 |
r ggplot(data = model.reduced, aes(x = .fitted, y = .resid)) + geom_point() + geom_jitter() + geom_hline(yintercept = 0, linetype = "dashed") + xlab("Fitted values") + ylab("Residuals") |
 |
r ggplot(model.reduced, aes(sample = .resid)) + stat_qq() + stat_qq_line() + xlab("resid") |
 |
r ggplot(model.reduced, aes(x=.resid)) + geom_histogram(binwidth = 1) + xlab("resid") |
 |
Logistic
Change to a logistic model
student.mat.binary <- student_mat %>%
mutate(PassFail = ifelse(G3 >= 12,1,0))
student.mat.binary
## # A tibble: 395 x 34
## school sex age address famsize Pstatus Medu Fedu Mjob Fjob reason
## <chr> <chr> <dbl> <chr> <chr> <chr> <dbl> <dbl> <chr> <chr> <chr>
## 1 GP F 18 U GT3 A 4 4 at_home teach~ course
## 2 GP F 17 U GT3 T 1 1 at_home other course
## 3 GP F 15 U LE3 T 1 1 at_home other other
## 4 GP F 15 U GT3 T 4 2 health servi~ home
## 5 GP F 16 U GT3 T 3 3 other other home
## 6 GP M 16 U LE3 T 4 3 services other reput~
## 7 GP M 16 U LE3 T 2 2 other other home
## 8 GP F 17 U GT3 A 4 4 other teach~ home
## 9 GP M 15 U LE3 A 3 2 services other home
## 10 GP M 15 U GT3 T 3 4 other other home
## # ... with 385 more rows, and 23 more variables: guardian <chr>,
## # traveltime <dbl>, studytime <dbl>, failures <dbl>, schoolsup <chr>,
## # famsup <chr>, paid <chr>, activities <chr>, nursery <chr>, higher <chr>,
## # internet <chr>, romantic <chr>, famrel <dbl>, freetime <dbl>, goout <dbl>,
## # Dalc <dbl>, Walc <dbl>, health <dbl>, absences <dbl>, G1 <dbl>, G2 <dbl>,
## # G3 <dbl>, PassFail <dbl>
set.seed(811)
train.rows <- sample(nrow(student.mat.binary), nrow(student.mat.binary) * .7)
student_train <- student.mat.binary[train.rows,]
student_test <- student.mat.binary[-train.rows,]
student_fail <- table(student_train$PassFail) %>% prop.table
student_fail
##
## 0 1
## 0.5869565 0.4130435
student.model.training <- glm(PassFail ~ romantic + freetime + goout + Dalc + Walc + famrel + activities + sex + age, data = student_train, family = binomial(link = "logit"))
summary(student.model.training)
##
## Call:
## glm(formula = PassFail ~ romantic + freetime + goout + Dalc +
## Walc + famrel + activities + sex + age, family = binomial(link = "logit"),
## data = student_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.4637 -1.0004 -0.8004 1.1898 1.8702
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.85351 1.88298 1.515 0.1297
## romanticyes 0.20676 0.27906 0.741 0.4587
## freetime 0.09797 0.14334 0.684 0.4943
## goout -0.26952 0.13777 -1.956 0.0504 .
## Dalc 0.01355 0.19429 0.070 0.9444
## Walc -0.16103 0.13873 -1.161 0.2458
## famrel 0.12801 0.14595 0.877 0.3804
## activitiesyes -0.03781 0.25872 -0.146 0.8838
## sexM 0.56205 0.27167 2.069 0.0386 *
## age -0.19001 0.10678 -1.779 0.0752 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 374.23 on 275 degrees of freedom
## Residual deviance: 355.43 on 266 degrees of freedom
## AIC: 375.43
##
## Number of Fisher Scoring iterations: 4
student_train$prediction <- predict(student.model.training, type = 'response', newdata = student_train)
ggplot(student_train, aes(x = prediction, color = PassFail == 1)) + geom_density()

student_train$prediction_class <- student_train$prediction > 0.5
tab <- table(student_train$prediction_class,
student_train$PassFail) %>% prop.table() %>% print()
##
## 0 1
## FALSE 0.4710145 0.2536232
## TRUE 0.1159420 0.1594203
accuracy <- sum(tab[1], tab[4]) / sum(tab[1:4])
precision <- tab[4] / sum(tab[4], tab[2])
sensitivity <- tab[4] / sum(tab[4], tab[3])
fscore <- (2 * (sensitivity * precision))/(sensitivity + precision)
specificity <- tab[1] / sum(tab[1], tab[2])
accuracy
## [1] 0.6304348
The overall accuracy of this model is 63.40%. Considering that
58.69565 % of the data set population has failed their final exam and
would be used for guessing.
#The true positive rate: the percentage of individuals the model correctly predicted would pass
sensitivity
## [1] 0.3859649
#the true negative: percentage the model correctly predicted would fail
specificity
## [1] 0.8024691
library(pROC)
## Warning: package 'pROC' was built under R version 4.1.3
## Type 'citation("pROC")' for a citation.
##
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
##
## cov, smooth, var
roc.info <- roc(student_train$PassFail, student.model.training$fitted.values, plot = TRUE, print.auc = TRUE)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases

roc.df <- data.frame(
tpp=roc.info$sensitivities*100,
fpp=(1 - roc.info$specificities) * 100,
thresholds=roc.info$thresholds
)
head(roc.df)
## tpp fpp thresholds
## 1 100.00000 100.00000 -Inf
## 2 100.00000 99.38272 0.1368562
## 3 100.00000 98.76543 0.1444745
## 4 100.00000 98.14815 0.1551943
## 5 100.00000 97.53086 0.1699440
## 6 99.12281 97.53086 0.1808670
tail(roc.df)
## tpp fpp thresholds
## 262 4.385965 0 0.6765606
## 263 3.508772 0 0.7126668
## 264 2.631579 0 0.7409729
## 265 1.754386 0 0.7445702
## 266 0.877193 0 0.7617851
## 267 0.000000 0 Inf
Based on the ROC curve being close to the diagonal, above I think the
model is a poor classifier. Implying lower TPR rates (more false
negatives) and higher FPR rates (more false positives). This leads us to
the concept of AUC or Area Under the Curve discussed in point c.
For passing
student_pass <- table(student_test$PassFail) %>% prop.table()
student_pass
##
## 0 1
## 0.5966387 0.4033613
student_test$prediction <- predict(student.model.training, newdata = student_test, type = 'response')
student_test$prediciton_class <- student_test$prediction > 0.5
tab_test <- table(student_test$prediciton_class, student_test$PassFail) %>%
prop.table() %>% print()
##
## 0 1
## FALSE 0.4621849 0.2521008
## TRUE 0.1344538 0.1512605
accuracy <- sum(tab_test[1], tab_test[4]) / sum(tab_test[1:4])
precision <- tab_test[4] / sum(tab_test[4], tab_test[2])
sensitivity <- tab_test[4] / sum(tab_test[4], tab_test[3])
fscore <- (2 * (sensitivity * precision))/(sensitivity + precision)
specificity <- tab_test[1] / sum(tab_test[1], tab_test[2])
accuracy
## [1] 0.6134454
41% passed their final exam
sensitivity
## [1] 0.375
specificity
## [1] 0.7746479
Conclusion
This analysis shows that there is some relationship between the
social lives of these students in the study and to their final grades.
These models however, could be better tuned with the other variables
collected in the study.
Limitations of this analysis are that it is one sample size of only
two schools in one country. There also could have been missing data,
human error that can account for a lot of the values (the abundant 0s).
Having that outlier can skew the model and not be a clear indicator.
Another consideration to make is there are many other variables that
were not included that could have helped with the model tuning.
Based on the residuals QQ plot I would say that a linear regression
model is not the best fit for evaluating social life to the students
grades when there are grades of 0. However, if there are no failing
grades, based on the model without 0, I would say that a linear model is
a reasonablee fit.
Logistic regression is also reasonable to use, since it does not have
the restrictions of needing a normally distributed residuals, however
the models AUD showed to not be a strong classifier.
Resources
Frost, Jim. “Multicollinearity in Regression Analysis: Problems,
Detection, and Solutions.” Statistics By Jim, 22 July 2022, https://statisticsbyjim.com/regression/multicollinearity-in-regression-analysis/#:~:text=The%20potential%20solutions%20include%20the,or%20partial%20least%20squares%20regression.
“Sensitivity vs Specificity.” Analysis & Separations from Technology
Networks, https://www.technologynetworks.com/analysis/articles/sensitivity-vs-specificity-318222.
Sigrist, Fabio. “Demystifying Roc and Precision-Recall Curves.” Medium,
Towards Data Science, 4 Mar. 2022, https://towardsdatascience.com/demystifying-roc-and-precision-recall-curves-d30f3fad2cbf.
Zach. “How to Perform Robust Regression in R (Step-by-Step).” Statology,
10 Apr. 2021, https://www.statology.org/robust-regression-in-r/.