Linear Regression on Student Grade Prediction

Overview

This time, we will try to do a prediction of students’ final grade based on several parameters. By using Linear Regression, we want to know the relationship among variables, especially between the Final Score variable or G3 in this dataset with other variables. we also want to predict the score based on the historical data. The dataset we will use for making prediction is from kaggle that contains 33 attributes for 393 entries. If you interested, the dataset for this project can be accessed here.

Data Preparation

Load required packages.

library(car)
library(GGally)
library(MLmetrics)
library(tidyverse)

Load the dataset.

grade <- read.csv("data_input/student-mat.csv")

glimpse(grade)

## Rows: 395
## Columns: 33
## $ school     <chr> "GP", "GP", "GP", "GP", "GP", "GP", "GP", "GP", "GP", "G...
## $ sex        <chr> "F", "F", "F", "F", "F", "M", "M", "F", "M", "M", "F", "...
## $ age        <int> 18, 17, 15, 15, 16, 16, 16, 17, 15, 15, 15, 15, 15, 15, ...
## $ address    <chr> "U", "U", "U", "U", "U", "U", "U", "U", "U", "U", "U", "...
## $ famsize    <chr> "GT3", "GT3", "LE3", "GT3", "GT3", "LE3", "LE3", "GT3", ...
## $ Pstatus    <chr> "A", "T", "T", "T", "T", "T", "T", "A", "A", "T", "T", "...
## $ Medu       <int> 4, 1, 1, 4, 3, 4, 2, 4, 3, 3, 4, 2, 4, 4, 2, 4, 4, 3, 3,...
## $ Fedu       <int> 4, 1, 1, 2, 3, 3, 2, 4, 2, 4, 4, 1, 4, 3, 2, 4, 4, 3, 2,...
## $ Mjob       <chr> "at_home", "at_home", "at_home", "health", "other", "ser...
## $ Fjob       <chr> "teacher", "other", "other", "services", "other", "other...
## $ reason     <chr> "course", "course", "other", "home", "home", "reputation...
## $ guardian   <chr> "mother", "father", "mother", "mother", "father", "mothe...
## $ traveltime <int> 2, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 3, 1, 2, 1, 1, 1, 3, 1,...
## $ studytime  <int> 2, 2, 2, 3, 2, 2, 2, 2, 2, 2, 2, 3, 1, 2, 3, 1, 3, 2, 1,...
## $ failures   <int> 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3,...
## $ schoolsup  <chr> "yes", "no", "yes", "no", "no", "no", "no", "yes", "no",...
## $ famsup     <chr> "no", "yes", "no", "yes", "yes", "yes", "no", "yes", "ye...
## $ paid       <chr> "no", "no", "yes", "yes", "yes", "yes", "no", "no", "yes...
## $ activities <chr> "no", "no", "no", "yes", "no", "yes", "no", "no", "no", ...
## $ nursery    <chr> "yes", "no", "yes", "yes", "yes", "yes", "yes", "yes", "...
## $ higher     <chr> "yes", "yes", "yes", "yes", "yes", "yes", "yes", "yes", ...
## $ internet   <chr> "no", "yes", "yes", "yes", "no", "yes", "yes", "no", "ye...
## $ romantic   <chr> "no", "no", "no", "yes", "no", "no", "no", "no", "no", "...
## $ famrel     <int> 4, 5, 4, 3, 4, 5, 4, 4, 4, 5, 3, 5, 4, 5, 4, 4, 3, 5, 5,...
## $ freetime   <int> 3, 3, 3, 2, 3, 4, 4, 1, 2, 5, 3, 2, 3, 4, 5, 4, 2, 3, 5,...
## $ goout      <int> 4, 3, 2, 2, 2, 2, 4, 4, 2, 1, 3, 2, 3, 3, 2, 4, 3, 2, 5,...
## $ Dalc       <int> 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2,...
## $ Walc       <int> 1, 1, 3, 1, 2, 2, 1, 1, 1, 1, 2, 1, 3, 2, 1, 2, 2, 1, 4,...
## $ health     <int> 3, 3, 3, 5, 5, 5, 3, 1, 1, 5, 2, 4, 5, 3, 3, 2, 2, 4, 5,...
## $ absences   <int> 6, 4, 10, 2, 4, 10, 0, 6, 0, 0, 0, 4, 2, 2, 0, 4, 6, 4, ...
## $ G1         <int> 5, 5, 7, 15, 6, 15, 12, 6, 16, 14, 10, 10, 14, 10, 14, 1...
## $ G2         <int> 6, 5, 8, 14, 10, 15, 12, 5, 18, 15, 8, 12, 14, 10, 16, 1...
## $ G3         <int> 6, 6, 10, 15, 10, 15, 11, 6, 19, 15, 9, 12, 14, 11, 16, ...

school - student’s school (binary: ‘GP’ - Gabriel Pereira or ‘MS’ - Mousinho da Silveira)
sex - student’s sex (binary: ‘F’ - female or ‘M’ - male)
age - student’s age (numeric: from 15 to 22)
address - student’s home address type (binary: ‘U’ - urban or ‘R’ - rural)
famsize - family size (binary: ‘LE3’ - less or equal to 3 or ‘GT3’ - greater than 3)
Pstatus - parent’s cohabitation status (binary: ‘T’ - living together or ‘A’ - apart)
Medu - mother’s education (numeric: 0 - none, 1 - primary education (4th grade), 2 - 5th to 9th grade, 3 - secondary education or 4 - higher education)
Fedu - father’s education (numeric: 0 - none, 1 - primary education (4th grade), 2 - 5th to 9th grade, 3 - secondary education or 4 - higher education)
Mjob - mother’s job (nominal: ‘teacher’, ‘health’ care related, civil ‘services’ (e.g. administrative or police), ‘at_home’ or ‘other’)
Fjob - father’s job (nominal: ‘teacher’, ‘health’ care related, civil ‘services’ (e.g. administrative or police), ‘at_home’ or ‘other’)
reason - reason to choose this school (nominal: close to ‘home’, school ‘reputation’, ‘course’ preference or ‘other’)
guardian - student’s guardian (nominal: ‘mother’, ‘father’ or ‘other’)
traveltime - home to school travel time (numeric: 1: < 15 minutes, 2: 15-30 minutes, 3: 30 minutes - 1 hour, 4: > 1 hour)
studytime - weekly study time (numeric: 1: < 2 hours, 2: 2-5 hours, 3: 5-10 hours, 4: > 10 hours)
failures - number of past class failures (numeric: n if 1<=n<3, else 4)
schoolsup - extra educational support (binary: yes or no)
famsup - family educational support (binary: yes or no)
paid - extra paid classes within the course subject (Math or Portuguese) (binary: yes or no)
activities - extra-curricular activities (binary: yes or no)
nursery - attended nursery school (binary: yes or no)
higher - wants to take higher education (binary: yes or no)
internet - Internet access at home (binary: yes or no)
romantic - with a romantic relationship (binary: yes or no)
famrel - quality of family relationships (numeric: from 1 - very bad to 5 - excellent)
freetime - free time after school (numeric: from 1 - very low to 5 - very high)
goout - going out with friends (numeric: from 1 - very low to 5 - very high)
Dalc - workday alcohol consumption (numeric: from 1 - very low to 5 - very high)
Walc - weekend alcohol consumption (numeric: from 1 - very low to 5 - very high)
health - current health status (numeric: from 1 - very bad to 5 - very good)
absences - number of school absences (numeric: from 0 to 93)
G1 - first period grade (numeric: from 0 to 20)
G2 - second period grade (numeric: from 0 to 20)
G3 - final grade (numeric: from 0 to 20, output target)

Data Wrangling

Our target in this case is G3 or Final Grade, and from description each variable above, there are 2 parameters that have strong and direct correlation with G3, namely G1 ang G2. And because of that, this situation could bias our model later, so we need to remove G1 and G2.

ggpairs(grade %>% 
  select(c(G1,G2,G3)))

grade <- grade %>%
  mutate_at(vars(-age,-G3,-absences),.funs=funs(factor)) %>%
  select(-c(G1,G2))

str(grade)

## 'data.frame':    395 obs. of  31 variables:
##  $ school    : Factor w/ 2 levels "GP","MS": 1 1 1 1 1 1 1 1 1 1 ...
##  $ sex       : Factor w/ 2 levels "F","M": 1 1 1 1 1 2 2 1 2 2 ...
##  $ age       : int  18 17 15 15 16 16 16 17 15 15 ...
##  $ address   : Factor w/ 2 levels "R","U": 2 2 2 2 2 2 2 2 2 2 ...
##  $ famsize   : Factor w/ 2 levels "GT3","LE3": 1 1 2 1 1 2 2 1 2 1 ...
##  $ Pstatus   : Factor w/ 2 levels "A","T": 1 2 2 2 2 2 2 1 1 2 ...
##  $ Medu      : Factor w/ 5 levels "0","1","2","3",..: 5 2 2 5 4 5 3 5 4 4 ...
##  $ Fedu      : Factor w/ 5 levels "0","1","2","3",..: 5 2 2 3 4 4 3 5 3 5 ...
##  $ Mjob      : Factor w/ 5 levels "at_home","health",..: 1 1 1 2 3 4 3 3 4 3 ...
##  $ Fjob      : Factor w/ 5 levels "at_home","health",..: 5 3 3 4 3 3 3 5 3 3 ...
##  $ reason    : Factor w/ 4 levels "course","home",..: 1 1 3 2 2 4 2 2 2 2 ...
##  $ guardian  : Factor w/ 3 levels "father","mother",..: 2 1 2 2 1 2 2 2 2 2 ...
##  $ traveltime: Factor w/ 4 levels "1","2","3","4": 2 1 1 1 1 1 1 2 1 1 ...
##  $ studytime : Factor w/ 4 levels "1","2","3","4": 2 2 2 3 2 2 2 2 2 2 ...
##  $ failures  : Factor w/ 4 levels "0","1","2","3": 1 1 4 1 1 1 1 1 1 1 ...
##  $ schoolsup : Factor w/ 2 levels "no","yes": 2 1 2 1 1 1 1 2 1 1 ...
##  $ famsup    : Factor w/ 2 levels "no","yes": 1 2 1 2 2 2 1 2 2 2 ...
##  $ paid      : Factor w/ 2 levels "no","yes": 1 1 2 2 2 2 1 1 2 2 ...
##  $ activities: Factor w/ 2 levels "no","yes": 1 1 1 2 1 2 1 1 1 2 ...
##  $ nursery   : Factor w/ 2 levels "no","yes": 2 1 2 2 2 2 2 2 2 2 ...
##  $ higher    : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 2 ...
##  $ internet  : Factor w/ 2 levels "no","yes": 1 2 2 2 1 2 2 1 2 2 ...
##  $ romantic  : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 1 1 1 ...
##  $ famrel    : Factor w/ 5 levels "1","2","3","4",..: 4 5 4 3 4 5 4 4 4 5 ...
##  $ freetime  : Factor w/ 5 levels "1","2","3","4",..: 3 3 3 2 3 4 4 1 2 5 ...
##  $ goout     : Factor w/ 5 levels "1","2","3","4",..: 4 3 2 2 2 2 4 4 2 1 ...
##  $ Dalc      : Factor w/ 5 levels "1","2","3","4",..: 1 1 2 1 1 1 1 1 1 1 ...
##  $ Walc      : Factor w/ 5 levels "1","2","3","4",..: 1 1 3 1 2 2 1 1 1 1 ...
##  $ health    : Factor w/ 5 levels "1","2","3","4",..: 3 3 3 5 5 5 3 1 1 5 ...
##  $ absences  : int  6 4 10 2 4 10 0 6 0 0 ...
##  $ G3        : int  6 6 10 15 10 15 11 6 19 15 ...

Checking for missing values.

colSums(is.na(grade))

##     school        sex        age    address    famsize    Pstatus       Medu 
##          0          0          0          0          0          0          0 
##       Fedu       Mjob       Fjob     reason   guardian traveltime  studytime 
##          0          0          0          0          0          0          0 
##   failures  schoolsup     famsup       paid activities    nursery     higher 
##          0          0          0          0          0          0          0 
##   internet   romantic     famrel   freetime      goout       Dalc       Walc 
##          0          0          0          0          0          0          0 
##     health   absences         G3 
##          0          0          0

Checking for duplicated data.

sum(duplicated(grade))

## [1] 0

checking for target data distribution.

plot(as.factor(grade$G3),col = "lightBlue",xlab="Final Score (G3)",ylab="Quantity")

From distribution plot above, there is something seems off with our target data. Apart from the high number of students scoring 0, the distribution is normal as expected. Maybe the value 0 is used in place of null. Or maybe the students who did not appear for the exam, or were not allowed to sit for the exam due to some reason are marked as 0. We cannot be sure. But from the checking for missing values above, there is no null values, so maybe grade 0 does not mean null after all. Or maybe, The values of 0 above have an explanation of the available variables, therefore, let’s check it.

ggplot(data = grade_compare,aes(x=G3,y=failures))+
  geom_jitter(aes(col=type))+
  labs(x="Final Score (G3)",y="Failures")

ggplot(data = grade_compare,aes(x=G3,y=studytime))+
  geom_jitter(aes(col=type))+
  labs(x="Final Score (G3)",y="Study Time")

ggplot(data = grade_compare,aes(x=G3,y=Walc))+
  geom_jitter(aes(col=type))+
  labs(x="Final Score (G3)",y="Weekend Alcohol Consumption")

ggplot(data = grade_compare,aes(x=G3,y=health))+
  geom_jitter(aes(col=type))+
  labs(x="Final Score (G3)",y="Health")

After checking the correlation between 0 and non-zero values with several variables, it turns out that there is no difference, which means that the value of 0 has no explanation for the reasons for the available variables. Because of this, we need to remove zero values so that our model has good performance later.

grade <- grade %>%
  filter(G3!=0)

Exploratory Data Analysis

Because there are many variables, we need to look at the correlation of the target variable with several predictors, so that we can know which variables have a significant effect on the target variable.

ggplot(data = grade,aes(x=age,y=G3))+
  geom_violin(aes(fill=sex),alpha=.9)+
  labs(x="Age",y="Final Score (G3)")

From the diagram of the relationship between age, gender and the final score, it can be seen that the highest and lowest scores of male students are higher than female, and the median value of the age of male students is greater than female. However, the median value of the final grades of both male and female students was almost the same.

ggplot(data = grade,aes(x = studytime,y=G3,fill=studytime))+
  geom_boxplot(show.legend = F)+
  labs(x="Study Time",y="Final Score (G3)")

The studytime predictor consists of 4 classes classified based on the length of study time in a week, and from the diagram above, it can be seen that there is a slight tendency for an increase in the mean final grade along with the increase in weekly study hours.

ggplot(data = grade,aes(x=G3))+
  geom_bar(aes(fill=romantic),alpha=.9)+
  labs(y="Amount",x="Final Score (G3)")

The number of students who are in love relationships is less than students who are not. From the diagram above, the distribution of the final grades of students, whether in love relationships or not, is somewhat similar, but for high final grades, it tends to be owned by many students who are not in a loving relationship.

ggplot(data = grade,aes(x = failures,y=G3,fill=failures))+
  geom_boxplot(show.legend = F)+
  labs(x="Failures",y="Final Score (G3)")

For the correlation between the number of failures in the previous class and the final grade, there is a clear relationship, where the more students failed in the previous class, the smaller the final score obtained.

ggplot(data = grade,aes(x = schoolsup,y=G3,fill=schoolsup))+
  geom_boxplot(show.legend = F)+
  labs(x="Extra Educational Support",y="Final Score (G3)")

There is something unique from the diagram above. groups of students who received extra educational support, it turned out that most had lower final scores than most students who did not receive extra educational support.

ggplot(data = grade,aes(x=G3))+
  geom_density(aes(fill=internet),alpha=.9)+
  labs(y="Frequency",x="Final Score (G3)")

Internet access at home has a slight role in increasing students’ final grades, as seen in the diagram above, even though their median scores are almost the same.

ggplot(data = grade,aes(x = freetime,y=G3,fill=freetime))+
  geom_boxplot(show.legend = F)+
  labs(x="Free Time",y="Final Score (G3)")

The amount of free time after school does not guarantee that students will get a high final score, seen from the diagram above, there is no visible relationship between the addition of free time to the final score.

ggplot(data = grade,aes(x=G3))+
  geom_density(aes(fill=higher),alpha=.9)+
  labs(y="Frequency",x="Final Score (G3)")

The desire to continue to higher education can actually trigger students to be more active in studying so that the final grades obtained are also good, as seen from the diagram above, students who want to continue to higher education tend to have better final grades than those who don’t want to continue to higher education.

Cross Validation

Before we make the model, we need to split the data into train dataset and test dataset. We will use the train dataset to train the linear regression model. The test dataset will be used as a comparasion and see the performance of our model. We will 80% of the data as the training data and the rest of it as the testing data.

RNGkind(sample.kind = "Rounding")
set.seed(23)

intrain <- sample(nrow(grade),nrow(grade)*.8)

grade_train <- grade[intrain,]
grade_test <- grade[-intrain,]

Modelling

We will try to create several models the linear regression using G3 as the target value. The models that we will create come from several ways, some from the my understanding or estimation and from stepwise selection.

model_grade_all <- lm(formula = G3~.,data = grade_train)

model_grade_none <- lm(formula = G3~1,data = grade_train)

model_grade_selected <- lm(formula = G3~failures+studytime+higher+schoolsup+internet+goout+romantic,data = grade_train)

model_grade_selected2 <- lm(formula = G3~Mjob+Fjob+studytime+failures+schoolsup+paid+health+absences,data = grade_train)

model_grade_backward <- step(object = model_grade_all,direction = "backward",trace = F)

model_both_forward <- step(object = model_grade_none,direction = "both",scope = list(lower=model_grade_none,upper=model_grade_all),trace = F)

After making several models, now let’s compare each other.

performance::compare_performance(model_grade_all,model_grade_backward,model_both_forward,model_grade_selected,model_grade_selected2)

## # Comparison of Model Performance Indices
## 
## Model                 | Type |     AIC |     BIC |   R2 | R2 (adj.) | RMSE | Sigma |     BF |      p
## ----------------------------------------------------------------------------------------------------
## model_grade_all       |   lm | 1460.05 | 1719.38 | 0.43 |      0.25 | 2.44 |  2.81 |   1.00 |       
## model_grade_backward  |   lm | 1411.83 | 1517.75 | 0.35 |      0.28 | 2.60 |  2.74 | > 1000 | 0.940 
## model_both_forward    |   lm | 1411.83 | 1517.75 | 0.35 |      0.28 | 2.60 |  2.74 | > 1000 |       
## model_grade_selected  |   lm | 1444.91 | 1503.35 | 0.20 |      0.16 | 2.89 |  2.97 | > 1000 | < .001
## model_grade_selected2 |   lm | 1419.68 | 1503.69 | 0.31 |      0.25 | 2.69 |  2.80 | > 1000 | < .001

From the above results, the model_grade_backward and model_both_forward models are the best models, because these models have the smallest AIC value and the smallest RMSE. Therefore we can choose one of these models as our model for prediction and evaluation.. And for this time, I will use the model_grade_backward model.

summary(model_grade_backward)

## 
## Call:
## lm(formula = G3 ~ sex + address + Mjob + Fjob + studytime + failures + 
##     schoolsup + paid + goout + health + absences, data = grade_train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.6749 -1.7892 -0.1182  1.8939  7.1404 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  12.924394   1.309520   9.870  < 2e-16 ***
## sexM          0.546507   0.366872   1.490 0.137545    
## addressU      0.772521   0.426038   1.813 0.070957 .  
## Mjobhealth    1.906672   0.750381   2.541 0.011644 *  
## Mjobother     0.270847   0.567665   0.477 0.633678    
## Mjobservices  1.826645   0.603482   3.027 0.002722 ** 
## Mjobteacher   0.260919   0.679106   0.384 0.701141    
## Fjobhealth   -0.043794   1.107022  -0.040 0.968475    
## Fjobother    -0.171854   0.831469  -0.207 0.836418    
## Fjobservices -0.216455   0.857599  -0.252 0.800936    
## Fjobteacher   1.869268   1.056484   1.769 0.078025 .  
## studytime2   -0.008232   0.422526  -0.019 0.984471    
## studytime3    1.466283   0.572756   2.560 0.011038 *  
## studytime4    1.678683   0.701930   2.392 0.017498 *  
## failures1    -0.839125   0.569938  -1.472 0.142161    
## failures2    -2.670172   1.030598  -2.591 0.010120 *  
## failures3    -2.793761   0.977727  -2.857 0.004621 ** 
## schoolsupyes -2.782472   0.504188  -5.519 8.33e-08 ***
## paidyes      -0.595432   0.356032  -1.672 0.095660 .  
## goout2       -0.019520   0.886741  -0.022 0.982454    
## goout3       -1.238722   0.880826  -1.406 0.160837    
## goout4       -1.298300   0.900937  -1.441 0.150787    
## goout5       -1.617293   0.925396  -1.748 0.081714 .  
## health2      -0.480456   0.694683  -0.692 0.489800    
## health3      -1.653920   0.590961  -2.799 0.005520 ** 
## health4      -1.041166   0.626183  -1.663 0.097588 .  
## health5      -1.432170   0.542646  -2.639 0.008817 ** 
## absences     -0.079947   0.021947  -3.643 0.000326 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.74 on 257 degrees of freedom
## Multiple R-squared:  0.3524, Adjusted R-squared:  0.2844 
## F-statistic:  5.18 on 27 and 257 DF,  p-value: 3.786e-13

model_grade_backward’s summary above contains lot of information, like predictors that used for making model, five-number summary of residuals, significance level of each predictor (Pr(>|t|)) , and R-squared. From Pr(>|t|) above, we can get information on which predictors have a significant influence on the target, if the value is below 0.05 (alpha), we asume that the variable has significant effect toward the model, and then the smaller the Pr(>|t|) value, the more significant the predictors have on the target, and to make it easier, there is a star symbol which indicates the more stars the more significant the predictor’s influence on the target.

Prediction

After choosing the best model for our dataset, then we need to test our model performance using testing dataset that we have splitted above.

grade_test$G3_predicted <- round(predict(object = model_grade_backward,newdata = grade_test),2)

Evaluation

Model Performance

From testing performance using testing dataset above, we can evaluate our model using RMSE. Root Mean Square Error RMSE is the square root of the variance of the residuals. It indicates the absolute fit of the model to the data–how close the observed data points are to the model’s predicted value.

RMSE(y_pred = predict(model_grade_backward,grade_test),y_true = grade_test$G3)

## [1] 2.893352

Assumptions

Assumptions are essentially conditions that should be met before we draw inferences regarding the model estimates. Assumption tests are needed to prove that the resulting model is not misleading, or has biased estimators

Normality Test

The linear regression analysis requires all variables to be multivariate normal. This assumption can best be checked with a histogram or a Q-Q-Plot. Normality can be checked with a goodness of fit test, and this time we will use the Shapiro-Wilk normality test.

hist(model_grade_backward$residuals,xlab = "Residuals",col = "lightBlue",main = "Residual Distribution Plot")

qqnorm(model_grade_backward$residuals)
qqline(model_grade_backward$residuals,col="red")

Saphiro-Wilk normality test.

shapiro.test(model_grade_backward$residuals)

## 
##  Shapiro-Wilk normality test
## 
## data:  model_grade_backward$residuals
## W = 0.99164, p-value = 0.1067

The null hypothesis is that the residuals follow normal distribution. With p-value < 0.05, we can conclude that our hypothesis has failed to be rejected, so, our residuals are following the normal distribution.

Homoscedasticity Test

Homoscedasticity refers to whether these residuals are equally distributed, or whether they tend to bunch together at some values, and at other values, spread far apart.

plot(model_grade_backward$fitted.values,model_grade_backward$residuals,pch=16,col = "black",xlab = "Fitted Values", ylab="Residuals",main = "Residual Plot")
abline(h=0,col="red")

Studentized Breusch-Pagan test

library(lmtest)
bptest(model_grade_backward)

## 
##  studentized Breusch-Pagan test
## 
## data:  model_grade_backward
## BP = 25.315, df = 27, p-value = 0.5568

The null hypothesis is that the residuals are homoscedastic. With p-value < 0.05, we can conclude that our hypothesis has failed to be rejected, so, our residuals are equally distributed.

Autocorrelation Test

Linear regression assumes that there is little or no multicollinearity in the data. Multicollinearity occurs when the independent variables are too highly correlated with each other.

This time, Multicollinearity will be tested with Variance Inflation Factor (VIF). Variance inflation factor of the linear regression is defined as VIF = 1/Tolerance (T). With VIF > 10 there is multicollinearity among the variables.

vif(mod = model_grade_backward)

##               GVIF Df GVIF^(1/(2*Df))
## sex       1.275598  1        1.129424
## address   1.117002  1        1.056883
## Mjob      1.839265  4        1.079147
## Fjob      1.654193  4        1.064935
## studytime 1.473062  3        1.066687
## failures  1.412332  3        1.059228
## schoolsup 1.139954  1        1.067686
## paid      1.202989  1        1.096808
## goout     1.541449  4        1.055580
## health    1.327582  4        1.036055
## absences  1.162231  1        1.078068

All of our predictors that used for making model have VIF < 10. It means that multicolinearity is not present in our model.

Conclusion

After conducting the evaluation test using RMSE Test, our model (model_grade_backward ) has good performance to predict student grade, and besides, our model has passed asumption tests. But, there is One thing to note, however, is that our model has a low r-squared value. The R-square is a measure of explanatory power. With an r-square value of 0.28, it means that our model can only explain 28% of the total data, but this situation does not only happen in our model, even if all predictors are used, the r-square value also ranges in the same number. This is probably because the predictors in our dataset are predominantly dummy variables. But actually that’s not entirely bad, because our model uses predictors of high significance, so our model explains how changes in response values are associated with changes in predictor values.