Data Preparation

# load data
knitr::opts_chunk$set(echo = TRUE)
library(tidyverse)
library(ggplot2)
library(curl)
library(psych)
library(car)

Abstract

Two portuguese schools recorded the demographics, grades, social and homelife of a sample of students. I wanted to see if I could find a model that could predict if having a social life relates to a students grade. Throughout this project, I used multiple forms of regression, linear, logistic, and robust to try and find the best model.

Research question

Does having a social life predict final grade?

Cases

What are the cases, and how many are there?

There are 649 cases that represent students and their achievements in secondary education in two Portuguese schools.

Data collection

Describe the method of data collection.

The data was collected using school reports and questionnaires.

Type of study

What type of study is this (observational/experiment)?

This is observational study.

Data Source

If you collected the data, state self-collected. If not, provide a citation/link.

The data was collected by University of Minho and the dataset can be found here: Source: https://archive.ics.uci.edu/ml/datasets/Student+Performance

Dependent Variable

What is the response variable? Is it quantitative or qualitative?

The response variable is the final grade of each student. It is quantitative.

Independent Variable(s)

The independent variable is the variables that describe the students social life. To answer this questions the variables I classify as describing a students social life is their activities, romantic, family relationship, free time, going out and alcohol consumption. They are all qualitative.

Relevant summary statistics

student_mat_csv <- "https://raw.githubusercontent.com/moiyajosephs/Data606-Final/main/student-mat.csv"
student_mat <- read_delim(curl(student_mat_csv),delim = ";")

student_por_csv <- "https://raw.githubusercontent.com/moiyajosephs/Data606-Final/main/student-por.csv"
student_por <- read_delim(curl(student_por_csv),delim = ";")

## Rows: 649 Columns: 33
## -- Column specification --------------------------------------------------------
## Delimiter: ";"
## chr (17): school, sex, address, famsize, Pstatus, Mjob, Fjob, reason, guardi...
## dbl (16): age, Medu, Fedu, traveltime, studytime, failures, famrel, freetime...
## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.

summary(student_mat)

##     school              sex                 age         address         
##  Length:395         Length:395         Min.   :15.0   Length:395        
##  Class :character   Class :character   1st Qu.:16.0   Class :character  
##  Mode  :character   Mode  :character   Median :17.0   Mode  :character  
##                                        Mean   :16.7                     
##                                        3rd Qu.:18.0                     
##                                        Max.   :22.0                     
##    famsize            Pstatus               Medu            Fedu      
##  Length:395         Length:395         Min.   :0.000   Min.   :0.000  
##  Class :character   Class :character   1st Qu.:2.000   1st Qu.:2.000  
##  Mode  :character   Mode  :character   Median :3.000   Median :2.000  
##                                        Mean   :2.749   Mean   :2.522  
##                                        3rd Qu.:4.000   3rd Qu.:3.000  
##                                        Max.   :4.000   Max.   :4.000  
##      Mjob               Fjob              reason            guardian        
##  Length:395         Length:395         Length:395         Length:395        
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##    traveltime      studytime        failures       schoolsup        
##  Min.   :1.000   Min.   :1.000   Min.   :0.0000   Length:395        
##  1st Qu.:1.000   1st Qu.:1.000   1st Qu.:0.0000   Class :character  
##  Median :1.000   Median :2.000   Median :0.0000   Mode  :character  
##  Mean   :1.448   Mean   :2.035   Mean   :0.3342                     
##  3rd Qu.:2.000   3rd Qu.:2.000   3rd Qu.:0.0000                     
##  Max.   :4.000   Max.   :4.000   Max.   :3.0000                     
##     famsup              paid            activities          nursery         
##  Length:395         Length:395         Length:395         Length:395        
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##     higher            internet           romantic             famrel     
##  Length:395         Length:395         Length:395         Min.   :1.000  
##  Class :character   Class :character   Class :character   1st Qu.:4.000  
##  Mode  :character   Mode  :character   Mode  :character   Median :4.000  
##                                                           Mean   :3.944  
##                                                           3rd Qu.:5.000  
##                                                           Max.   :5.000  
##     freetime         goout            Dalc            Walc      
##  Min.   :1.000   Min.   :1.000   Min.   :1.000   Min.   :1.000  
##  1st Qu.:3.000   1st Qu.:2.000   1st Qu.:1.000   1st Qu.:1.000  
##  Median :3.000   Median :3.000   Median :1.000   Median :2.000  
##  Mean   :3.235   Mean   :3.109   Mean   :1.481   Mean   :2.291  
##  3rd Qu.:4.000   3rd Qu.:4.000   3rd Qu.:2.000   3rd Qu.:3.000  
##  Max.   :5.000   Max.   :5.000   Max.   :5.000   Max.   :5.000  
##      health         absences            G1              G2       
##  Min.   :1.000   Min.   : 0.000   Min.   : 3.00   Min.   : 0.00  
##  1st Qu.:3.000   1st Qu.: 0.000   1st Qu.: 8.00   1st Qu.: 9.00  
##  Median :4.000   Median : 4.000   Median :11.00   Median :11.00  
##  Mean   :3.554   Mean   : 5.709   Mean   :10.91   Mean   :10.71  
##  3rd Qu.:5.000   3rd Qu.: 8.000   3rd Qu.:13.00   3rd Qu.:13.00  
##  Max.   :5.000   Max.   :75.000   Max.   :19.00   Max.   :19.00  
##        G3       
##  Min.   : 0.00  
##  1st Qu.: 8.00  
##  Median :11.00  
##  Mean   :10.42  
##  3rd Qu.:14.00  
##  Max.   :20.00

Attributes for student-mat.csv (Math course): 1 school - student’s school (binary: ‘GP’ - Gabriel Pereira or ‘MS’ - Mousinho da Silveira) 2 sex - student’s sex (binary: ‘F’ - female or ‘M’ - male) 3 age - student’s age (numeric: from 15 to 22) 4 address - student’s home address type (binary: ‘U’ - urban or ‘R’ - rural) 5 famsize - family size (binary: ‘LE3’ - less or equal to 3 or ‘GT3’ - greater than 3) 6 Pstatus - parent’s cohabitation status (binary: ‘T’ - living together or ‘A’ - apart) 7 Medu - mother’s education (numeric: 0 - none, 1 - primary education (4th grade), 2 â€“ 5th to 9th grade, 3 â€“ secondary education or 4 â€“ higher education) 8 Fedu - father’s education (numeric: 0 - none, 1 - primary education (4th grade), 2 â€“ 5th to 9th grade, 3 â€“ secondary education or 4 â€“ higher education) 9 Mjob - mother’s job (nominal: ‘teacher’, ‘health’ care related, civil ‘services’ (e.g. administrative or police), ‘at_home’ or ‘other’) 10 Fjob - father’s job (nominal: ‘teacher’, ‘health’ care related, civil ‘services’ (e.g. administrative or police), ‘at_home’ or ‘other’) 11 reason - reason to choose this school (nominal: close to ‘home’, school ‘reputation’, ‘course’ preference or ‘other’) 12 guardian - student’s guardian (nominal: ‘mother’, ‘father’ or ‘other’) 13 traveltime - home to school travel time (numeric: 1 - <15 min., 2 - 15 to 30 min., 3 - 30 min. to 1 hour, or 4 - >1 hour) 14 studytime - weekly study time (numeric: 1 - <2 hours, 2 - 2 to 5 hours, 3 - 5 to 10 hours, or 4 - >10 hours) 15 failures - number of past class failures (numeric: n if 1<=n<3, else 4) 16 schoolsup - extra educational support (binary: yes or no) 17 famsup - family educational support (binary: yes or no) 18 paid - extra paid classes within the course subject (Math or Portuguese) (binary: yes or no) 19 activities - extra-curricular activities (binary: yes or no) 20 nursery - attended nursery school (binary: yes or no) 21 higher - wants to take higher education (binary: yes or no) 22 internet - Internet access at home (binary: yes or no) 23 romantic - with a romantic relationship (binary: yes or no) 24 famrel - quality of family relationships (numeric: from 1 - very bad to 5 - excellent) 25 freetime - free time after school (numeric: from 1 - very low to 5 - very high) 26 goout - going out with friends (numeric: from 1 - very low to 5 - very high) 27 Dalc - workday alcohol consumption (numeric: from 1 - very low to 5 - very high) 28 Walc - weekend alcohol consumption (numeric: from 1 - very low to 5 - very high) 29 health - current health status (numeric: from 1 - very bad to 5 - very good) 30 absences - number of school absences (numeric: from 0 to 93)

these grades are related with the course subject, Math or Portuguese:

31 G1 - first period grade (numeric: from 0 to 20) 31 G2 - second period grade (numeric: from 0 to 20) 32 G3 - final grade (numeric: from 0 to 20, output target)

ggplot(student_mat, aes(sex)) + geom_bar()

ggplot(student_mat, aes(G3)) + geom_bar() + facet_wrap(vars(sex))

ggplot(student_mat, aes(G3)) + geom_bar() + facet_grid(vars(school), vars(sex))

The Romance Model

Romantic

Students reply either yes or no if they are in a romantic relationship, indicated by the romantic variable.

ggplot(student_mat, aes(romantic)) + geom_bar()

romantic.model <- lm(G3 ~ romantic + sex + age, data = student_mat)
summary(romantic.model)

## 
## Call:
## lm(formula = G3 ~ romantic + sex + age, data = student_mat)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12.0331  -1.9794   0.3268   2.9669   9.3268 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  18.9147     3.0017   6.301 7.96e-10 ***
## romanticyes  -0.9438     0.4884  -1.932  0.05403 .  
## sexM          0.8196     0.4553   1.800  0.07261 .  
## age          -0.5134     0.1799  -2.854  0.00455 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.494 on 391 degrees of freedom
## Multiple R-squared:  0.045,  Adjusted R-squared:  0.03768 
## F-statistic: 6.142 on 3 and 391 DF,  p-value: 0.0004341

ggplot(student_mat, aes(x= G3)) +
  geom_histogram(bins = 10) +
  facet_grid(vars(romantic), vars(sex))

Romantic is statistically significant according to the anova test.

summary_stats <- student_mat %>%
  group_by(romantic) %>%
  summarise(mean_by_group = mean(G3))
summary_stats

## # A tibble: 2 x 2
##   romantic mean_by_group
##   <chr>            <dbl>
## 1 no               10.8 
## 2 yes               9.58

describeBy(student_mat$G3, 
           group = student_mat$romantic, mat=TRUE)

##     item group1 vars   n      mean       sd median   trimmed    mad min max
## X11    1     no    1 263 10.836502 4.385946     11 11.113744 4.4478   0  20
## X12    2    yes    1 132  9.575758 4.856916     11  9.971698 3.7065   0  18
##     range       skew   kurtosis        se
## X11    20 -0.6186634  0.5452026 0.2704490
## X12    18 -0.8212904 -0.2272406 0.4227403

Analyzing the r^2

ggplot(data = romantic.model, aes(x = .fitted, y = .resid)) +
  geom_point() + 
  geom_jitter()+
  geom_hline(yintercept = 0, linetype = "dashed") +
  xlab("Fitted values") +
  ylab("Residuals")

The Activities Model

Students reply yes or no if they have extra curricular activities, indicated by the activities variables.

ggplot(student_mat, aes(x= G3)) +
  geom_histogram(bins = 10) +
  facet_wrap(vars(activities))

activities.model <- lm(G3 ~ activities + sex + age, data = student_mat)
summary(activities.model)

## 
## Call:
## lm(formula = G3 ~ activities + sex + age, data = student_mat)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11.9186  -1.8871   0.2918   3.0814   9.7181 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   19.60836    3.04399   6.442 3.48e-10 ***
## activitiesyes -0.09467    0.45909  -0.206  0.83673    
## sexM           0.91567    0.45740   2.002  0.04599 *  
## age           -0.57369    0.17926  -3.200  0.00148 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.515 on 391 degrees of freedom
## Multiple R-squared:  0.03599,    Adjusted R-squared:  0.02859 
## F-statistic: 4.866 on 3 and 391 DF,  p-value: 0.002466

By looking at the summary of the activities.model we can see that activites estimate is not statistically significant as indicated by p-value which is greater than 0.05. As well as the F-statistic and p value given, which is also greater than 0.05.

ggplot(data = activities.model, aes(x = .fitted, y = .resid)) +
  geom_point() +
  geom_jitter() +
  geom_hline(yintercept = 0, linetype = "dashed") +
  xlab("Fitted values") +
  ylab("Residuals")

Freetime

Freetime after school rated from 1 very low to 5 very high.

ggplot(student_mat, aes(freetime)) + geom_bar()

ggplot(student_mat, aes(x= G3)) +
  geom_histogram(bins = 10) +
  facet_wrap(vars(activities))

freetime.model <- lm(G3 ~ freetime + sex + age, data = student_mat)
summary(freetime.model)

## 
## Call:
## lm(formula = G3 ~ freetime + sex + age, data = student_mat)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11.9281  -1.8236   0.2774   3.0719   9.8027 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  19.6300     3.0666   6.401 4.42e-10 ***
## freetime     -0.0472     0.2346  -0.201  0.84064    
## sexM          0.9291     0.4688   1.982  0.04823 *  
## age          -0.5691     0.1784  -3.190  0.00154 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.515 on 391 degrees of freedom
## Multiple R-squared:  0.03598,    Adjusted R-squared:  0.02859 
## F-statistic: 4.865 on 3 and 391 DF,  p-value: 0.002469

Not significant!

Family Relationship

famrel indicated the quality of family relationships on a scale from (numeric: from 1 - very bad to 5 - excellent)

ggplot(student_mat, aes(famrel)) + geom_bar()

famrel.model <- lm(G3 ~ famrel+ sex + age, data = student_mat)
summary(famrel.model)

## 
## Call:
## lm(formula = G3 ~ famrel + sex + age, data = student_mat)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12.1556  -1.7315   0.2844   2.8738   9.7415 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  18.6010     3.1070   5.987 4.85e-09 ***
## famrel        0.2782     0.2542   1.095  0.27436    
## sexM          0.8763     0.4554   1.924  0.05506 .  
## age          -0.5809     0.1784  -3.257  0.00123 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.509 on 391 degrees of freedom
## Multiple R-squared:  0.03883,    Adjusted R-squared:  0.03145 
## F-statistic: 5.265 on 3 and 391 DF,  p-value: 0.001432

No significance

Going Out

goout indicates if students go out with friends rated from (numeric: from 1 - very low to 5 - very high)

ggplot(student_mat, aes(goout)) + geom_bar()

goout.model <- lm(G3 ~ goout+ sex + age, data = student_mat)
summary(goout.model)

## 
## Call:
## lm(formula = G3 ~ goout + sex + age, data = student_mat)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12.3706  -1.8464   0.1759   2.9101  10.1759 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  20.0801     2.9866   6.723 6.31e-11 ***
## goout        -0.5058     0.2051  -2.466  0.01409 *  
## sexM          0.9961     0.4532   2.198  0.02854 *  
## age          -0.5129     0.1785  -2.874  0.00428 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.481 on 391 degrees of freedom
## Multiple R-squared:  0.05065,    Adjusted R-squared:  0.04337 
## F-statistic: 6.954 on 3 and 391 DF,  p-value: 0.0001437

Significance!!

Using Multiple Regression

Upon further implementation, I realized there may be more than one explanatory variable for the output. To consider the variable in conjunction with the G3 variable, I will be using multiple linear regression, and establishing my best naively implemented model to predict G3.

When using many variables in a linear regression, there may also exist a collinearity between them. To determine if there are any such variables I used vif to check for collinearity.

Linear Regression for full model of social activitiess

model.full <- lm(G3 ~ romantic + freetime + goout + Walc + famrel + activities+ sex + age + Dalc, data= student_mat)
summary(model.full)

## 
## Call:
## lm(formula = G3 ~ romantic + freetime + goout + Walc + famrel + 
##     activities + sex + age + Dalc, data = student_mat)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11.9606  -1.8145   0.4119   2.9367   9.3786 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   18.39330    3.17914   5.786  1.5e-08 ***
## romanticyes   -0.90796    0.48930  -1.856   0.0643 .  
## freetime       0.11460    0.24727   0.463   0.6433    
## goout         -0.53579    0.23438  -2.286   0.0228 *  
## Walc           0.06730    0.25043   0.269   0.7883    
## famrel         0.25665    0.26037   0.986   0.3249    
## activitiesyes -0.04855    0.46062  -0.105   0.9161    
## sexM           0.89641    0.49223   1.821   0.0694 .  
## age           -0.45677    0.18371  -2.486   0.0133 *  
## Dalc          -0.21582    0.34085  -0.633   0.5270    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.483 on 385 degrees of freedom
## Multiple R-squared:  0.06422,    Adjusted R-squared:  0.04234 
## F-statistic: 2.936 on 9 and 385 DF,  p-value: 0.002216

vif(model.full)

##   romantic   freetime      goout       Walc     famrel activities        sex 
##   1.046795   1.195705   1.334533   2.038981   1.068367   1.042021   1.186951 
##        age       Dalc 
##   1.077172   1.806766

Through backwards selection I removed activities, Walc, freetime and Dalc. The model with the highest R adjusted is shown below.

Model reduced linear model equation is:

y = 18.6197 -0.9137 * romanticyes -0.5201 * goout + 0.2802 * famrel + 0.8842 * sexM - 0.4675 * age

The model has an adjusted R squared of 0.05082.

model.reduced <- lm(G3 ~ romantic + famrel + goout + sex + age, data= student_mat)
summary(model.reduced)

## 
## Call:
## lm(formula = G3 ~ romantic + famrel + goout + sex + age, data = student_mat)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12.0110  -1.7786   0.3513   2.8961   9.4753 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  18.6197     3.0914   6.023 3.97e-09 ***
## romanticyes  -0.9137     0.4862  -1.879  0.06097 .  
## famrel        0.2802     0.2526   1.110  0.26786    
## goout        -0.5201     0.2046  -2.542  0.01141 *  
## sexM          0.8842     0.4542   1.947  0.05229 .  
## age          -0.4675     0.1805  -2.590  0.00995 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.464 on 389 degrees of freedom
## Multiple R-squared:  0.06286,    Adjusted R-squared:  0.05082 
## F-statistic: 5.219 on 5 and 389 DF,  p-value: 0.0001191

vif(model.reduced)

## romantic   famrel    goout      sex      age 
## 1.042849 1.014230 1.025925 1.019550 1.048983

When removing the other variables the most ideal variables are whether the student goes out or if they are in a romantic relationship. According to this model, as the grade increases, the romanticyes and goout variable will decrease by the given estimates. Indicating that the less the student does/has these things the higher their grade will be.

Model	Adjusted R
model.full	0.04234
model.reduced	0.05082

Model reduced has a better adjusted R but the increase is not by a significant amount.

ggplot(data = model.reduced, aes(x = .fitted, y = .resid)) +
  geom_point() +
  geom_jitter() +
  geom_hline(yintercept = 0, linetype = "dashed") +
  xlab("Fitted values") +
  ylab("Residuals")

ggplot(model.reduced, aes(x=.resid)) +
  geom_histogram(binwidth = 1) +
  xlab("resid")

ggplot(model.reduced, aes(sample = .resid)) +
    stat_qq_line() +
 stat_qq() +
  xlab("resid")

Plotting the residuals show that the residuals are not normally distributed. Likely due to the outliers of 0 values in the data as well.

Using robust linear regression

In trying to find the best model to fit this data, I researched models that perform better than lm when there are outliers. In my search I discovered the lmrob function from the library MASS.

By using the lmrob model the error is lower than the lm.

Robust Full

library(robustbase)

## Warning: package 'robustbase' was built under R version 4.1.3

res <- lmrob(G3 ~ romantic + freetime + goout + Walc + Dalc + famrel + activities + sex + age,
  data=student_mat)
summary(res)

## 
## Call:
## lmrob(formula = G3 ~ romantic + freetime + goout + Walc + Dalc + famrel + 
##     activities + sex + age, data = student_mat)
##  \--> method = "MM"
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12.8795  -2.2319  -0.1279   2.3922   9.0019 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   19.367015   3.056071   6.337 6.54e-10 ***
## romanticyes   -0.315221   0.503447  -0.626   0.5316    
## freetime       0.083992   0.239092   0.351   0.7256    
## goout         -0.507998   0.250583  -2.027   0.0433 *  
## Walc          -0.177547   0.235605  -0.754   0.4516    
## Dalc          -0.125698   0.246057  -0.511   0.6097    
## famrel         0.116678   0.265095   0.440   0.6601    
## activitiesyes -0.008661   0.422479  -0.021   0.9837    
## sexM           1.057875   0.442807   2.389   0.0174 *  
## age           -0.441836   0.174074  -2.538   0.0115 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Robust residual standard error: 3.611 
## Multiple R-squared:  0.07457,    Adjusted R-squared:  0.05294 
## Convergence in 14 IRWLS iterations
## 
## Robustness weights: 
##  28 weights are ~= 1. The remaining 367 ones are summarized as
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1767  0.8479  0.9556  0.8699  0.9868  0.9989 
## Algorithmic parameters: 
##        tuning.chi                bb        tuning.psi        refine.tol 
##         1.548e+00         5.000e-01         4.685e+00         1.000e-07 
##           rel.tol         scale.tol         solve.tol       eps.outlier 
##         1.000e-07         1.000e-10         1.000e-07         2.532e-04 
##             eps.x warn.limit.reject warn.limit.meanrw 
##         4.002e-11         5.000e-01         5.000e-01 
##      nResample         max.it       best.r.s       k.fast.s          k.max 
##            500             50              2              1            200 
##    maxit.scale      trace.lev            mts     compute.rd fast.s.large.n 
##            200              0           1000              0           2000 
##                   psi           subsampling                   cov 
##            "bisquare"         "nonsingular"         ".vcov.avar1" 
## compute.outlier.stats 
##                  "SM" 
## seed : int(0)

Through backwards selection I removed famrel, Dalc, freetime and `romantic``. The model with the highest R adjusted is shown below. famrel

robust.model.reduced <- lmrob(G3 ~   goout + Walc   + sex + age,
  data=student_mat)
summary(robust.model.reduced)

## 
## Call:
## lmrob(formula = G3 ~ goout + Walc + sex + age, data = student_mat)
##  \--> method = "MM"
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -13.20825  -2.18920  -0.05832   2.33241   9.28603 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  20.3540     2.6585   7.656 1.52e-13 ***
## goout        -0.4857     0.2358  -2.060  0.04007 *  
## Walc         -0.2576     0.1896  -1.359  0.17497    
## sexM          1.1163     0.4253   2.625  0.00901 ** 
## age          -0.4689     0.1632  -2.874  0.00428 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Robust residual standard error: 3.605 
## Multiple R-squared:  0.07392,    Adjusted R-squared:  0.06443 
## Convergence in 13 IRWLS iterations
## 
## Robustness weights: 
##  31 weights are ~= 1. The remaining 364 ones are summarized as
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1508  0.8518  0.9533  0.8677  0.9870  0.9990 
## Algorithmic parameters: 
##        tuning.chi                bb        tuning.psi        refine.tol 
##         1.548e+00         5.000e-01         4.685e+00         1.000e-07 
##           rel.tol         scale.tol         solve.tol       eps.outlier 
##         1.000e-07         1.000e-10         1.000e-07         2.532e-04 
##             eps.x warn.limit.reject warn.limit.meanrw 
##         4.002e-11         5.000e-01         5.000e-01 
##      nResample         max.it       best.r.s       k.fast.s          k.max 
##            500             50              2              1            200 
##    maxit.scale      trace.lev            mts     compute.rd fast.s.large.n 
##            200              0           1000              0           2000 
##                   psi           subsampling                   cov 
##            "bisquare"         "nonsingular"         ".vcov.avar1" 
## compute.outlier.stats 
##                  "SM" 
## seed : int(0)

plot(robust.model.reduced$residuals)

The drawback of this robust function is there is no way to plot the residuals in a QQ without turning them into a dataframe.

df <- data.frame(matrix(unlist(robust.model.reduced$residuals), nrow=length(robust.model.reduced$residuals), byrow=TRUE))
names(df)[1] <- ".resid" 
head(df)

##      .resid
## 1 -3.713970
## 2 -4.668544
## 3 -1.576744
## 4  2.908016
## 5 -1.365499
## 6  2.518236

ggplot(df, aes(sample = .resid)) +
 stat_qq() +
    stat_qq_line() +
  xlab("resid")

## Removing the Outliers

Next I tried to remove the outliers to see if they were having a significant affect on the models.

r student.pass <- student_mat %>% filter(G3 != 0) student.pass

## # A tibble: 357 x 33 ##    school sex     age address famsize Pstatus  Medu  Fedu Mjob     Fjob   reason ##    <chr>  <chr> <dbl> <chr>   <chr>   <chr>   <dbl> <dbl> <chr>    <chr>  <chr> ##  1 GP     F        18 U       GT3     A           4     4 at_home  teach~ course ##  2 GP     F        17 U       GT3     T           1     1 at_home  other  course ##  3 GP     F        15 U       LE3     T           1     1 at_home  other  other ##  4 GP     F        15 U       GT3     T           4     2 health   servi~ home ##  5 GP     F        16 U       GT3     T           3     3 other    other  home ##  6 GP     M        16 U       LE3     T           4     3 services other  reput~ ##  7 GP     M        16 U       LE3     T           2     2 other    other  home ##  8 GP     F        17 U       GT3     A           4     4 other    teach~ home ##  9 GP     M        15 U       LE3     A           3     2 services other  home ## 10 GP     M        15 U       GT3     T           3     4 other    other  home ## # ... with 347 more rows, and 22 more variables: guardian <chr>, ## #   traveltime <dbl>, studytime <dbl>, failures <dbl>, schoolsup <chr>, ## #   famsup <chr>, paid <chr>, activities <chr>, nursery <chr>, higher <chr>, ## #   internet <chr>, romantic <chr>, famrel <dbl>, freetime <dbl>, goout <dbl>, ## #   Dalc <dbl>, Walc <dbl>, health <dbl>, absences <dbl>, G1 <dbl>, G2 <dbl>, ## #   G3 <dbl>

### The New Full Model

r model.full <- lm(G3 ~ romantic + freetime + goout + famrel + activities + sex + age +Walc + Dalc, data= student.pass) summary(model.full)

## ## Call: ## lm(formula = G3 ~ romantic + freetime + goout + famrel + activities + ##     sex + age + Walc + Dalc, data = student.pass) ## ## Residuals: ##     Min      1Q  Median      3Q     Max ## -6.9412 -2.1242 -0.2452  2.2083  8.0987 ## ## Coefficients: ##               Estimate Std. Error t value Pr(>|t|) ## (Intercept)   16.89594    2.36438   7.146  5.3e-12 *** ## romanticyes   -0.08557    0.36671  -0.233  0.81563 ## freetime      -0.01091    0.17931  -0.061  0.95150 ## goout         -0.29155    0.17844  -1.634  0.10320 ## famrel         0.05066    0.19376   0.261  0.79392 ## activitiesyes  0.15759    0.34197   0.461  0.64520 ## sexM           0.96277    0.36097   2.667  0.00801 ** ## age           -0.24615    0.13736  -1.792  0.07400 . ## Walc          -0.35484    0.18434  -1.925  0.05506 . ## Dalc          -0.14943    0.24291  -0.615  0.53886 ## --- ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 3.132 on 347 degrees of freedom ## Multiple R-squared:  0.08204,    Adjusted R-squared:  0.05823 ## F-statistic: 3.446 on 9 and 347 DF,  p-value: 0.0004389

r vif(model.full)

##   romantic   freetime      goout     famrel activities        sex        age ##   1.053421   1.193749   1.374550   1.068645   1.063611   1.183653   1.101094 ##       Walc       Dalc ##   2.067608   1.811555

### Model reduced

Using backwards elimination I removed freetime, romantic, famrel, activities, dalc from the model and arrive at the final model below.

r model.reduced <- lm(G3 ~ goout + age + sex + Walc, data= student.pass) summary(model.reduced)

## ## Call: ## lm(formula = G3 ~ goout + age + sex + Walc, data = student.pass) ## ## Residuals: ##     Min      1Q  Median      3Q     Max ## -7.0856 -2.1074 -0.2585  2.1972  8.1284 ## ## Coefficients: ##             Estimate Std. Error t value Pr(>|t|) ## (Intercept)  17.3500     2.1932   7.911 3.34e-14 *** ## goout        -0.2828     0.1698  -1.666  0.09668 . ## age          -0.2647     0.1317  -2.010  0.04519 * ## sexM          0.9692     0.3424   2.831  0.00491 ** ## Walc         -0.4326     0.1477  -2.928  0.00363 ** ## --- ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 3.114 on 352 degrees of freedom ## Multiple R-squared:  0.07993,    Adjusted R-squared:  0.06947 ## F-statistic: 7.645 on 4 and 352 DF,  p-value: 6.467e-06

r vif(model.reduced)

## goout age sex Walc ## 1.259462 1.024419 1.077554 1.344071

r ggplot(data = model.reduced, aes(x = .fitted, y = .resid)) + geom_point() + geom_jitter() + geom_hline(yintercept = 0, linetype = "dashed") + xlab("Fitted values") + ylab("Residuals")

r ggplot(model.reduced, aes(sample = .resid)) + stat_qq() + stat_qq_line() + xlab("resid")

r ggplot(model.reduced, aes(x=.resid)) + geom_histogram(binwidth = 1) + xlab("resid")

Logistic

Change to a logistic model

student.mat.binary <- student_mat %>%
      mutate(PassFail = ifelse(G3 >= 12,1,0))
student.mat.binary

## # A tibble: 395 x 34
##    school sex     age address famsize Pstatus  Medu  Fedu Mjob     Fjob   reason
##    <chr>  <chr> <dbl> <chr>   <chr>   <chr>   <dbl> <dbl> <chr>    <chr>  <chr> 
##  1 GP     F        18 U       GT3     A           4     4 at_home  teach~ course
##  2 GP     F        17 U       GT3     T           1     1 at_home  other  course
##  3 GP     F        15 U       LE3     T           1     1 at_home  other  other 
##  4 GP     F        15 U       GT3     T           4     2 health   servi~ home  
##  5 GP     F        16 U       GT3     T           3     3 other    other  home  
##  6 GP     M        16 U       LE3     T           4     3 services other  reput~
##  7 GP     M        16 U       LE3     T           2     2 other    other  home  
##  8 GP     F        17 U       GT3     A           4     4 other    teach~ home  
##  9 GP     M        15 U       LE3     A           3     2 services other  home  
## 10 GP     M        15 U       GT3     T           3     4 other    other  home  
## # ... with 385 more rows, and 23 more variables: guardian <chr>,
## #   traveltime <dbl>, studytime <dbl>, failures <dbl>, schoolsup <chr>,
## #   famsup <chr>, paid <chr>, activities <chr>, nursery <chr>, higher <chr>,
## #   internet <chr>, romantic <chr>, famrel <dbl>, freetime <dbl>, goout <dbl>,
## #   Dalc <dbl>, Walc <dbl>, health <dbl>, absences <dbl>, G1 <dbl>, G2 <dbl>,
## #   G3 <dbl>, PassFail <dbl>

set.seed(811)
train.rows <- sample(nrow(student.mat.binary), nrow(student.mat.binary) * .7)
student_train <- student.mat.binary[train.rows,]
student_test <- student.mat.binary[-train.rows,]

student_fail <- table(student_train$PassFail) %>% prop.table
student_fail

## 
##         0         1 
## 0.5869565 0.4130435

student.model.training <- glm(PassFail ~ romantic + freetime + goout + Dalc + Walc + famrel + activities + sex + age, data = student_train, family = binomial(link = "logit"))
summary(student.model.training)

## 
## Call:
## glm(formula = PassFail ~ romantic + freetime + goout + Dalc + 
##     Walc + famrel + activities + sex + age, family = binomial(link = "logit"), 
##     data = student_train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.4637  -1.0004  -0.8004   1.1898   1.8702  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)  
## (Intercept)    2.85351    1.88298   1.515   0.1297  
## romanticyes    0.20676    0.27906   0.741   0.4587  
## freetime       0.09797    0.14334   0.684   0.4943  
## goout         -0.26952    0.13777  -1.956   0.0504 .
## Dalc           0.01355    0.19429   0.070   0.9444  
## Walc          -0.16103    0.13873  -1.161   0.2458  
## famrel         0.12801    0.14595   0.877   0.3804  
## activitiesyes -0.03781    0.25872  -0.146   0.8838  
## sexM           0.56205    0.27167   2.069   0.0386 *
## age           -0.19001    0.10678  -1.779   0.0752 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 374.23  on 275  degrees of freedom
## Residual deviance: 355.43  on 266  degrees of freedom
## AIC: 375.43
## 
## Number of Fisher Scoring iterations: 4

student_train$prediction <- predict(student.model.training, type = 'response', newdata = student_train)
ggplot(student_train, aes(x = prediction, color = PassFail == 1)) + geom_density()

student_train$prediction_class <- student_train$prediction > 0.5
tab <- table(student_train$prediction_class, 
             student_train$PassFail) %>% prop.table() %>% print()

##        
##                 0         1
##   FALSE 0.4710145 0.2536232
##   TRUE  0.1159420 0.1594203

accuracy <- sum(tab[1], tab[4]) / sum(tab[1:4])
precision <- tab[4] / sum(tab[4], tab[2])
sensitivity <- tab[4] / sum(tab[4], tab[3])
fscore <- (2 * (sensitivity * precision))/(sensitivity + precision)
specificity <- tab[1] / sum(tab[1], tab[2])

accuracy

## [1] 0.6304348

The overall accuracy of this model is 63.40%. Considering that 58.69565 % of the data set population has failed their final exam and would be used for guessing.

#The true positive rate: the percentage of individuals the model correctly predicted would pass 

sensitivity

## [1] 0.3859649

#the true negative: percentage the model correctly predicted would fail
specificity

## [1] 0.8024691

library(pROC)

## Warning: package 'pROC' was built under R version 4.1.3

## Type 'citation("pROC")' for a citation.

## 
## Attaching package: 'pROC'

## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var

roc.info <- roc(student_train$PassFail, student.model.training$fitted.values, plot = TRUE, print.auc = TRUE)

## Setting levels: control = 0, case = 1

## Setting direction: controls < cases

roc.df <- data.frame(
  tpp=roc.info$sensitivities*100,
  fpp=(1 - roc.info$specificities) * 100,
  thresholds=roc.info$thresholds
)

head(roc.df)

##         tpp       fpp thresholds
## 1 100.00000 100.00000       -Inf
## 2 100.00000  99.38272  0.1368562
## 3 100.00000  98.76543  0.1444745
## 4 100.00000  98.14815  0.1551943
## 5 100.00000  97.53086  0.1699440
## 6  99.12281  97.53086  0.1808670

tail(roc.df)

##          tpp fpp thresholds
## 262 4.385965   0  0.6765606
## 263 3.508772   0  0.7126668
## 264 2.631579   0  0.7409729
## 265 1.754386   0  0.7445702
## 266 0.877193   0  0.7617851
## 267 0.000000   0        Inf

Based on the ROC curve being close to the diagonal, above I think the model is a poor classifier. Implying lower TPR rates (more false negatives) and higher FPR rates (more false positives). This leads us to the concept of AUC or Area Under the Curve discussed in point c.

For passing

student_pass <- table(student_test$PassFail) %>% prop.table()
student_pass

## 
##         0         1 
## 0.5966387 0.4033613

student_test$prediction <- predict(student.model.training, newdata = student_test, type = 'response')
student_test$prediciton_class <- student_test$prediction > 0.5
tab_test <- table(student_test$prediciton_class, student_test$PassFail) %>%
prop.table() %>% print()

##        
##                 0         1
##   FALSE 0.4621849 0.2521008
##   TRUE  0.1344538 0.1512605

accuracy <- sum(tab_test[1], tab_test[4]) / sum(tab_test[1:4])
precision <- tab_test[4] / sum(tab_test[4], tab_test[2])
sensitivity <- tab_test[4] / sum(tab_test[4], tab_test[3])
fscore <- (2 * (sensitivity * precision))/(sensitivity + precision)
specificity <- tab_test[1] / sum(tab_test[1], tab_test[2])

accuracy

## [1] 0.6134454

41% passed their final exam

sensitivity

## [1] 0.375

specificity

## [1] 0.7746479

Conclusion

This analysis shows that there is some relationship between the social lives of these students in the study and to their final grades. These models however, could be better tuned with the other variables collected in the study.

Limitations of this analysis are that it is one sample size of only two schools in one country. There also could have been missing data, human error that can account for a lot of the values (the abundant 0s). Having that outlier can skew the model and not be a clear indicator. Another consideration to make is there are many other variables that were not included that could have helped with the model tuning.

Based on the residuals QQ plot I would say that a linear regression model is not the best fit for evaluating social life to the students grades when there are grades of 0. However, if there are no failing grades, based on the model without 0, I would say that a linear model is a reasonablee fit.

Logistic regression is also reasonable to use, since it does not have the restrictions of needing a normally distributed residuals, however the models AUD showed to not be a strong classifier.

Resources

Frost, Jim. “Multicollinearity in Regression Analysis: Problems, Detection, and Solutions.” Statistics By Jim, 22 July 2022, https://statisticsbyjim.com/regression/multicollinearity-in-regression-analysis/#:~:text=The%20potential%20solutions%20include%20the,or%20partial%20least%20squares%20regression. “Sensitivity vs Specificity.” Analysis & Separations from Technology Networks, https://www.technologynetworks.com/analysis/articles/sensitivity-vs-specificity-318222. Sigrist, Fabio. “Demystifying Roc and Precision-Recall Curves.” Medium, Towards Data Science, 4 Mar. 2022, https://towardsdatascience.com/demystifying-roc-and-precision-recall-curves-d30f3fad2cbf. Zach. “How to Perform Robust Regression in R (Step-by-Step).” Statology, 10 Apr. 2021, https://www.statology.org/robust-regression-in-r/.

Predicting Student Grades Based on Social Life

Moiya Josephs

2022-10-18