Introduction

The aim of this paper is to predict the final score of the portugese student by using Machine Learning methods. Like several other countries(e.g. France or Venezuela), a 20-point grading scale isused, where 0 is the lowest grade and 20 is the perfectscore. During the school year, students are evaluated in three periods and the last evaluation G3 corresponds to the final grade.

Dataset

Data were collected during the 2005-2006 school year from two public schools, from the Alen-tejo region of Portugal. The data are from http://archive.ics.uci.edu/ml/datasets/Student+Performance#\n

The dataset consists of 649 observations with 33 variables. Variable - G3 is the score after 3rd class that sum up whole learning cycle.

# Attributes:
# 1 school - student's school (binary: "GP" - Gabriel Pereira or "MS" - Mousinho da Silveira)
# 2 sex - student's sex (binary: "F" - female or "M" - male)
# 3 age - student's age (numeric: from 15 to 22)
# 4 address - student's home address type (binary: "U" - urban or "R" - rural)
# 5 famsize - family size (binary: "LE3" - less or equal to 3 or "GT3" - greater than 3)
# 6 Pstatus - parent's cohabitation status (binary: "T" - living together or "A" - apart)
# 7 Medu - mother's education (numeric: 0 - none,  1 - primary education (4th grade), 2 – 5th to 9th grade, 3 – secondary education or 4 – higher education)
# 8 Fedu - father's education (numeric: 0 - none,  1 - primary education (4th grade), 2 – 5th to 9th grade, 3 – secondary education or 4 – higher education)
# 9 Mjob - mother's job (nominal: "teacher", "health" care related, civil "services" (e.g. administrative or police), "at_home" or "other")
# 10 Fjob - father's job (nominal: "teacher", "health" care related, civil "services" (e.g. administrative or police), "at_home" or "other")
# 11 reason - reason to choose this school (nominal: close to "home", school "reputation", "course" preference or "other")
# 12 guardian - student's guardian (nominal: "mother", "father" or "other")
# 13 traveltime - home to school travel time (numeric: 1 - <15 min., 2 - 15 to 30 min., 3 - 30 min. to 1 hour, or 4 - >1 hour)
# 14 studytime - weekly study time (numeric: 1 - <2 hours, 2 - 2 to 5 hours, 3 - 5 to 10 hours, or 4 - >10 hours)
# 15 failures - number of past class failures (numeric: n if 1<=n<3, else 4)
# 16 schoolsup - extra educational support (binary: yes or no)
# 17 famsup - family educational support (binary: yes or no)
# 18 paid - extra paid classes within the course subject (Math or Portuguese) (binary: yes or no)
# 19 activities - extra-curricular activities (binary: yes or no)
# 20 nursery - attended nursery school (binary: yes or no)
# 21 higher - wants to take higher education (binary: yes or no)
# 22 internet - Internet access at home (binary: yes or no)
# 23 romantic - with a romantic relationship (binary: yes or no)
# 24 famrel - quality of family relationships (numeric: from 1 - very bad to 5 - excellent)
# 25 freetime - free time after school (numeric: from 1 - very low to 5 - very high)
# 26 goout - going out with friends (numeric: from 1 - very low to 5 - very high)
# 27 Dalc - workday alcohol consumption (numeric: from 1 - very low to 5 - very high)
# 28 Walc - weekend alcohol consumption (numeric: from 1 - very low to 5 - very high)
# 29 health - current health status (numeric: from 1 - very bad to 5 - very good)
# 30 absences - number of school absences (numeric: from 0 to 93)
# 
# *these grades are related with the course subject:*
# 31 G1 - first period grade (numeric: from 0 to 20)
# 31 G2 - second period grade (numeric: from 0 to 20)
# 32 G3 - final grade (numeric: from 0 to 20, output target)
library(MASS)
library(tidyverse)
library(tree)
library(caret)
library(rpart)
library(rpart.plot)
library(rattle)
library(dplyr)
library(randomForest)
library(ranger)
library(gbm)
library(xgboost)
library(fastAdaboost)
library(neuralnet)
library(caret)
library(plyr) 
library(psych)
library(boot)
library(Hmisc)
library(corrplot)
library(fastDummies)
#read data
por=read.table("student-por.csv",sep=";",header=TRUE)
#str(por)

#EDA Lets check how the distribution of our dependant variable G3 looks like

lets look at the histogram and frequency of all variables

num<-dplyr::select(por, age, Medu, Fedu, traveltime, studytime, failures, famrel, freetime, goout,Dalc, Walc,health,absences, G1,G2,G3)

multi.hist(num, freq=F, dcol = "red", dlty=c("dotted", "solid")) 

Correlation matrix

lets check which variables are correlated

res<-Hmisc::rcorr(as.matrix(num))

corrplot(res$r, type="upper", order="hclust", # Insignificant correlations are leaved blank
         p.mat = res$P, sig.level = 0.01, insig = "blank")

Because G1 and G2 are highly correlated with G3, as these are scores students got in two previous classes at the end oy the year and because authors of the article propose to predict value G3 without them, we decided to remove G1&G2 from our model.

####remove G2, G3, train_and_test_set####
por_all<-por
por<-dplyr::select(por, -c(G1, G2))

Dealing with nominal variables

the nominal variables has been changed to binary variables

por<-fastDummies::dummy_cols(por, select_columns = names(fac), remove_most_frequent_dummy = T, remove_selected_columns = T)

#res<-rcorr(as.matrix(por))
#corrplot(res$r, type="upper", order="hclust", # Insignificant correlations are leaved blank
#        p.mat = res$P, sig.level = 0.01, insig = "blank")

Splitting data on train and test set

#create train and test set
set.seed(333)

train_obs <- createDataPartition(por$G3, 
                                 p = 0.7, 
                                 list = FALSE) 
data_train <- por[train_obs,]
data_test  <- por[-train_obs,]

# all predictors
modelformula <- G3 ~ .

Model 1 - decision tree

A decision tree is a flowchart-like structure in which each internal node represents a “test” on an attribute (e.g. whether a coin flip comes up heads or tails), each branch represents the outcome of the test, and each leaf node represents a class label.

data.tree <-  rpart(modelformula,
                    data = data_train)

fancyRpartPlot(data.tree, main="Model 1 - Decision Tree", palettes=c("Greys", "Oranges"))

We started by cross validation of the model

it is a model validation techniques for assessing how the results of a statistical analysis will generalize to an independent data set.

#Cross validation of the model
set.seed(333)
tc <- trainControl(method = "cv",
                   number = 10)
cp.grid <- expand.grid(cp = seq(0, 0.03, 0.001))


data.tree.cv <- train(modelformula,
                      data = data_train, 
                      method = "rpart", 
                      trControl = tc,
                      tuneGrid = cp.grid)
data.tree.cv
## CART 
## 
## 456 samples
##  39 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 411, 410, 410, 410, 409, 411, ... 
## Resampling results across tuning parameters:
## 
##   cp     RMSE      Rsquared   MAE     
##   0.000  2.976771  0.2493323  2.188503
##   0.001  2.975406  0.2497983  2.187770
##   0.002  2.977101  0.2486477  2.188191
##   0.003  2.938812  0.2613787  2.159255
##   0.004  2.913891  0.2662663  2.133407
##   0.005  2.913946  0.2666047  2.167727
##   0.006  2.910764  0.2675785  2.160847
##   0.007  2.912100  0.2577801  2.173196
##   0.008  2.913111  0.2534939  2.184956
##   0.009  2.907692  0.2550133  2.176833
##   0.010  2.907150  0.2546758  2.178109
##   0.011  2.905580  0.2535629  2.175929
##   0.012  2.889333  0.2620623  2.184924
##   0.013  2.862822  0.2698991  2.159307
##   0.014  2.894269  0.2539937  2.196610
##   0.015  2.886070  0.2575813  2.199146
##   0.016  2.873889  0.2545001  2.199571
##   0.017  2.872953  0.2516029  2.205524
##   0.018  2.859587  0.2524523  2.180272
##   0.019  2.861293  0.2509174  2.183712
##   0.020  2.898270  0.2490460  2.200453
##   0.021  2.899300  0.2478623  2.202283
##   0.022  2.885145  0.2526354  2.189346
##   0.023  2.895625  0.2447668  2.191686
##   0.024  2.901478  0.2427628  2.200445
##   0.025  2.903645  0.2421612  2.199182
##   0.026  2.903645  0.2421612  2.199182
##   0.027  2.902240  0.2412934  2.197181
##   0.028  2.882771  0.2458482  2.184281
##   0.029  2.842470  0.2559861  2.145881
##   0.030  2.826371  0.2598272  2.136170
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was cp = 0.03.

RMSE was used to select the optimal model using the smallest value. The final value used for the model was cp = 0.03.

set.seed(333)
cp.grid <- expand.grid(cp = 0.03)

data.tree.cv <- train(modelformula,
                      data = data_train, 
                      method = "rpart", 
                      trControl = tc,
                      tuneGrid = cp.grid)
data.tree.cv
## CART 
## 
## 456 samples
##  39 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 411, 410, 410, 410, 409, 411, ... 
## Resampling results:
## 
##   RMSE      Rsquared   MAE    
##   2.826371  0.2598272  2.13617
## 
## Tuning parameter 'cp' was held constant at a value of 0.03

Lets compare the results of training and test set by using regression Metrics functions

regressionMetrics <- function(real, predicted) {

  # simple function that summarizes popular ex-post error measures

  # Mean Square Error
  MSE <- mean((real - predicted)^2)
  
  # Root Mean Squera Error
  RMSE <- sqrt(MSE)
  
  # Mean Absolute Error
  MAE <- mean(abs(real - predicted))
  
  # Median Absolute Error
  MedAE <- median(abs(real - predicted))
  
  # Mean Logarithmic Absolute Error
  MSLE <- mean((log(1 + real) - log(1 + predicted))^2)
  
  # Total Sum of Squares
  TSS <- sum((real - mean(real))^2)
  
  # Explained Sum of Squares
  RSS <- sum((predicted - real)^2)
  
  # R2
  R2 <- 1 - RSS/TSS
  
  result <- data.frame(MSE, RMSE, MAE, MedAE, MSLE, R2)
  return(result)
}
##     MSE  RMSE   MAE MedAE  MSLE    R2         model
## 1 7.925 2.815 2.107 1.693 0.186 0.271 on train data
## 2 8.418 2.901 2.188 1.935 0.117 0.097  on test data

The worse result is for the test sample. It has higher values of errors than train one, what is expected behavior. We compared those result on the test data with the next models created.

In order to decrease the variation of the prediction we used Bagging

Model 2 - Bagging

Bagging - designed to improve the stability and accuracy of machine learning algorithms used in statistical classification and regression. It also reduces variance and helps to avoid overfitting.

We conducted 1000 regressions on different samples and than conducted mean as the final result of the values

n <- nrow(data_train)

results_regression <- list()
data_sample <- list()

set.seed(333)
for (sample in 1:1000) {
  message(sample)
  
    data_sample <- 
    data_train[sample(1:n, 
                      size = n,
                      replace = TRUE),]
  
  
results_regression[[sample]] <-lm(modelformula,
                                 data_sample)
  data_sample[[sample]] <- data_sample
}
forecasts_bag = forecasts_bag %>% rowMeans()
# test dataset
c= regressionMetrics(real = data_test$G3,
                  predicted = forecasts_bag)

b=round(b,3)
c=round(c,3)
b = b %>% mutate(model= "CrossValidation")
c = c %>% mutate(model= "Bagging")

t = rbind(b,c)
t %>% arrange(MSE)
##     MSE  RMSE   MAE MedAE  MSLE    R2           model
## 1 7.455 2.730 2.007 1.668 0.114 0.201         Bagging
## 2 8.418 2.901 2.188 1.935 0.117 0.097 CrossValidation

We compared the results from the Model 1 (cross validation) with the Model 2 (bagging). Bagging has decreased the final error values of the models.

Model 3 - Random forest

set.seed(333)
data.random.forest <- randomForest(modelformula,
                                    data = data_train)

print(data.random.forest)
## 
## Call:
##  randomForest(formula = modelformula, data = data_train) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 13
## 
##           Mean of squared residuals: 7.320225
##                     % Var explained: 32.69

What is the best value for mtry?

plot(data.random.forest)

Lets change the number of trees to 60 and check the results

set.seed(333)
data.random.forest2 <- randomForest(modelformula,
                                     data = data_train,
                                     ntree = 60,
                                     importance = TRUE)

#mtry
#Number of variables randomly sampled as candidates at each split. Note that #the default values are different for classification (sqrt(p) where p is #number of variables in x) and regression (p/3); 40/3~~13

plot(data.random.forest2)

print(data.random.forest2)
## 
## Call:
##  randomForest(formula = modelformula, data = data_train, ntree = 60,      importance = TRUE) 
##                Type of random forest: regression
##                      Number of trees: 60
## No. of variables tried at each split: 13
## 
##           Mean of squared residuals: 7.571156
##                     % Var explained: 30.38

The % var explained has decreased as well as the MSRE. Lets try to find the optimal value for mtry parameter (Number of variables randomly sampled as candidates at each split).

parameters_rf <- expand.grid(mtry = 1:20)

set.seed(333)
  data.random.forest3_training <- 
    train(modelformula, 
          data = data_train, 
          method = "rf", 
          ntree = 60,
          tuneGrid = parameters_rf, 
          trControl = tc,
          importance = TRUE)

plot(data.random.forest3_training)

print(data.random.forest3_training)
## Random Forest 
## 
## 456 samples
##  39 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 411, 410, 410, 410, 409, 411, ... 
## Resampling results across tuning parameters:
## 
##   mtry  RMSE      Rsquared   MAE     
##    1    2.896091  0.3049615  2.091615
##    2    2.743380  0.3410184  1.960771
##    3    2.699925  0.3402982  1.955689
##    4    2.722786  0.3138669  1.964194
##    5    2.680430  0.3383260  1.930988
##    6    2.656462  0.3518727  1.925076
##    7    2.692042  0.3299081  1.962533
##    8    2.681286  0.3358943  1.938161
##    9    2.659533  0.3386445  1.932939
##   10    2.694445  0.3250489  1.960403
##   11    2.674775  0.3329844  1.961611
##   12    2.680800  0.3281031  1.970356
##   13    2.719868  0.3132709  1.985726
##   14    2.712958  0.3184563  1.978898
##   15    2.733323  0.3101584  1.995726
##   16    2.658505  0.3442494  1.935771
##   17    2.694266  0.3311205  1.984853
##   18    2.714821  0.3205838  1.987447
##   19    2.686482  0.3377981  1.963816
##   20    2.740012  0.3149341  2.017510
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 6.

From the graph above and from the results, we can conclude that optimal number of variables is 6

set.seed(333)
data.random.forest3 <- randomForest(modelformula,
                                     data = data_train,
                                     ntree = 60,
                                     mtry = 6,
                                     importance = TRUE)

Lets compare the 3 different approaches of random forests

##     MSE  RMSE   MAE MedAE  MSLE    R2                         random_forest
## 1 7.030 2.651 1.962 1.450 0.110 0.246  limited ntry with optimal mtry value
## 2 7.106 2.666 1.990 1.539 0.111 0.238                         random forest
## 3 7.250 2.693 2.029 1.598 0.112 0.223                         limited ntree

It looks like the basic random forest was better than the other approaches. The difference is probably because of the number of trees. We also noticed that application of cross validation helped to limit value of the errors.

Comparing 3 main models

##     MSE  RMSE   MAE MedAE  MSLE    R2           model
## 1 7.106 2.666 1.990 1.539 0.111 0.238   Random Forest
## 2 7.455 2.730 2.007 1.668 0.114 0.201         Bagging
## 3 8.418 2.901 2.188 1.935 0.117 0.097 CrossValidation

Random forest looks the best when compared to the other models

Model 4 - Neural Networks

Neural networks are a set of algorithms, modeled loosely after the human brain, that are designed to recognize patterns. They interpret sensory data through a kind of machine perception, labeling or clustering raw input.

Firstly we need to scale the data

# we scale the data
test<-scale(data_test)
test<-as.data.frame(test)

train<-scale(data_train)
train<-as.data.frame(train)

The basic approach of neuralNet package

net <- neuralnet(modelformula, # the model
                  train)

plot(net, rep = 'best')

e1= regressionMetrics(real = test$G3,
                     predicted = compute(net, test)$net.result)
e1 =round(e1,3)
e1 = e1 %>% mutate(model= "NN 1")
e1
##     MSE  RMSE   MAE MedAE MSLE    R2 model
## 1 0.869 0.932 0.704 0.506  NaN 0.127  NN 1

Lets modify model and try different settings

net2 <- neuralnet(modelformula, 
                  data=train,
                  act.fct = "logistic")

net3 <- neuralnet(modelformula, 
                  data=train,
                  hidden = c(2,1),
                  act.fct = "logistic")
e2= regressionMetrics(real = test$G3,
                      predicted = compute(net2, test)$net.result)

e3= regressionMetrics(real = test$G3,
                      predicted = compute(net3, test)$net.result)

e2=round(e2,3)
e3=round(e3,3)

e2 <- e2 %>% mutate(model="NN 2")
e3<- e3 %>% mutate(model="NN 3")

ef= rbind(e1,e2, e3)
ef %>% arrange(MSE)
##     MSE  RMSE   MAE MedAE MSLE     R2 model
## 1 0.869 0.932 0.704 0.506  NaN  0.127  NN 1
## 2 0.928 0.963 0.694 0.540  NaN  0.067  NN 2
## 3 1.249 1.118 0.876 0.798  NaN -0.256  NN 3

The best looking is model N1

let’s check how neural networks compare with other models

e<- e1
t = rbind(t,e)
t %>% arrange(MSE)
##     MSE  RMSE   MAE MedAE  MSLE    R2           model
## 1 0.869 0.932 0.704 0.506   NaN 0.127            NN 1
## 2 7.106 2.666 1.990 1.539 0.111 0.238   Random Forest
## 3 7.455 2.730 2.007 1.668 0.114 0.201         Bagging
## 4 8.418 2.901 2.188 1.935 0.117 0.097 CrossValidation

Summary

We created 4 models. Random forest looks the best between the most common methods. This is indicated by the smallest errors values. However when we computed neural networks we saw, that it is much more precise than others.