The aim of this paper is to predict the final score of the portugese student by using Machine Learning methods. Like several other countries(e.g. France or Venezuela), a 20-point grading scale isused, where 0 is the lowest grade and 20 is the perfectscore. During the school year, students are evaluated in three periods and the last evaluation G3 corresponds to the final grade.
Data were collected during the 2005-2006 school year from two public schools, from the Alen-tejo region of Portugal. The data are from http://archive.ics.uci.edu/ml/datasets/Student+Performance#\n
The dataset consists of 649 observations with 33 variables. Variable - G3 is the score after 3rd class that sum up whole learning cycle.
# Attributes:
# 1 school - student's school (binary: "GP" - Gabriel Pereira or "MS" - Mousinho da Silveira)
# 2 sex - student's sex (binary: "F" - female or "M" - male)
# 3 age - student's age (numeric: from 15 to 22)
# 4 address - student's home address type (binary: "U" - urban or "R" - rural)
# 5 famsize - family size (binary: "LE3" - less or equal to 3 or "GT3" - greater than 3)
# 6 Pstatus - parent's cohabitation status (binary: "T" - living together or "A" - apart)
# 7 Medu - mother's education (numeric: 0 - none, 1 - primary education (4th grade), 2 – 5th to 9th grade, 3 – secondary education or 4 – higher education)
# 8 Fedu - father's education (numeric: 0 - none, 1 - primary education (4th grade), 2 – 5th to 9th grade, 3 – secondary education or 4 – higher education)
# 9 Mjob - mother's job (nominal: "teacher", "health" care related, civil "services" (e.g. administrative or police), "at_home" or "other")
# 10 Fjob - father's job (nominal: "teacher", "health" care related, civil "services" (e.g. administrative or police), "at_home" or "other")
# 11 reason - reason to choose this school (nominal: close to "home", school "reputation", "course" preference or "other")
# 12 guardian - student's guardian (nominal: "mother", "father" or "other")
# 13 traveltime - home to school travel time (numeric: 1 - <15 min., 2 - 15 to 30 min., 3 - 30 min. to 1 hour, or 4 - >1 hour)
# 14 studytime - weekly study time (numeric: 1 - <2 hours, 2 - 2 to 5 hours, 3 - 5 to 10 hours, or 4 - >10 hours)
# 15 failures - number of past class failures (numeric: n if 1<=n<3, else 4)
# 16 schoolsup - extra educational support (binary: yes or no)
# 17 famsup - family educational support (binary: yes or no)
# 18 paid - extra paid classes within the course subject (Math or Portuguese) (binary: yes or no)
# 19 activities - extra-curricular activities (binary: yes or no)
# 20 nursery - attended nursery school (binary: yes or no)
# 21 higher - wants to take higher education (binary: yes or no)
# 22 internet - Internet access at home (binary: yes or no)
# 23 romantic - with a romantic relationship (binary: yes or no)
# 24 famrel - quality of family relationships (numeric: from 1 - very bad to 5 - excellent)
# 25 freetime - free time after school (numeric: from 1 - very low to 5 - very high)
# 26 goout - going out with friends (numeric: from 1 - very low to 5 - very high)
# 27 Dalc - workday alcohol consumption (numeric: from 1 - very low to 5 - very high)
# 28 Walc - weekend alcohol consumption (numeric: from 1 - very low to 5 - very high)
# 29 health - current health status (numeric: from 1 - very bad to 5 - very good)
# 30 absences - number of school absences (numeric: from 0 to 93)
#
# *these grades are related with the course subject:*
# 31 G1 - first period grade (numeric: from 0 to 20)
# 31 G2 - second period grade (numeric: from 0 to 20)
# 32 G3 - final grade (numeric: from 0 to 20, output target)
library(MASS)
library(tidyverse)
library(tree)
library(caret)
library(rpart)
library(rpart.plot)
library(rattle)
library(dplyr)
library(randomForest)
library(ranger)
library(gbm)
library(xgboost)
library(fastAdaboost)
library(neuralnet)
library(caret)
library(plyr)
library(psych)
library(boot)
library(Hmisc)
library(corrplot)
library(fastDummies)
#read data
por=read.table("student-por.csv",sep=";",header=TRUE)
#str(por)
#EDA Lets check how the distribution of our dependant variable G3 looks like
num<-dplyr::select(por, age, Medu, Fedu, traveltime, studytime, failures, famrel, freetime, goout,Dalc, Walc,health,absences, G1,G2,G3)
multi.hist(num, freq=F, dcol = "red", dlty=c("dotted", "solid"))
lets check which variables are correlated
res<-Hmisc::rcorr(as.matrix(num))
corrplot(res$r, type="upper", order="hclust", # Insignificant correlations are leaved blank
p.mat = res$P, sig.level = 0.01, insig = "blank")
Because G1 and G2 are highly correlated with G3, as these are scores students got in two previous classes at the end oy the year and because authors of the article propose to predict value G3 without them, we decided to remove G1&G2 from our model.
####remove G2, G3, train_and_test_set####
por_all<-por
por<-dplyr::select(por, -c(G1, G2))
the nominal variables has been changed to binary variables
por<-fastDummies::dummy_cols(por, select_columns = names(fac), remove_most_frequent_dummy = T, remove_selected_columns = T)
#res<-rcorr(as.matrix(por))
#corrplot(res$r, type="upper", order="hclust", # Insignificant correlations are leaved blank
# p.mat = res$P, sig.level = 0.01, insig = "blank")
#create train and test set
set.seed(333)
train_obs <- createDataPartition(por$G3,
p = 0.7,
list = FALSE)
data_train <- por[train_obs,]
data_test <- por[-train_obs,]
# all predictors
modelformula <- G3 ~ .
A decision tree is a flowchart-like structure in which each internal node represents a “test” on an attribute (e.g. whether a coin flip comes up heads or tails), each branch represents the outcome of the test, and each leaf node represents a class label.
data.tree <- rpart(modelformula,
data = data_train)
fancyRpartPlot(data.tree, main="Model 1 - Decision Tree", palettes=c("Greys", "Oranges"))
We started by cross validation of the model
it is a model validation techniques for assessing how the results of a statistical analysis will generalize to an independent data set.
#Cross validation of the model
set.seed(333)
tc <- trainControl(method = "cv",
number = 10)
cp.grid <- expand.grid(cp = seq(0, 0.03, 0.001))
data.tree.cv <- train(modelformula,
data = data_train,
method = "rpart",
trControl = tc,
tuneGrid = cp.grid)
data.tree.cv
## CART
##
## 456 samples
## 39 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 411, 410, 410, 410, 409, 411, ...
## Resampling results across tuning parameters:
##
## cp RMSE Rsquared MAE
## 0.000 2.976771 0.2493323 2.188503
## 0.001 2.975406 0.2497983 2.187770
## 0.002 2.977101 0.2486477 2.188191
## 0.003 2.938812 0.2613787 2.159255
## 0.004 2.913891 0.2662663 2.133407
## 0.005 2.913946 0.2666047 2.167727
## 0.006 2.910764 0.2675785 2.160847
## 0.007 2.912100 0.2577801 2.173196
## 0.008 2.913111 0.2534939 2.184956
## 0.009 2.907692 0.2550133 2.176833
## 0.010 2.907150 0.2546758 2.178109
## 0.011 2.905580 0.2535629 2.175929
## 0.012 2.889333 0.2620623 2.184924
## 0.013 2.862822 0.2698991 2.159307
## 0.014 2.894269 0.2539937 2.196610
## 0.015 2.886070 0.2575813 2.199146
## 0.016 2.873889 0.2545001 2.199571
## 0.017 2.872953 0.2516029 2.205524
## 0.018 2.859587 0.2524523 2.180272
## 0.019 2.861293 0.2509174 2.183712
## 0.020 2.898270 0.2490460 2.200453
## 0.021 2.899300 0.2478623 2.202283
## 0.022 2.885145 0.2526354 2.189346
## 0.023 2.895625 0.2447668 2.191686
## 0.024 2.901478 0.2427628 2.200445
## 0.025 2.903645 0.2421612 2.199182
## 0.026 2.903645 0.2421612 2.199182
## 0.027 2.902240 0.2412934 2.197181
## 0.028 2.882771 0.2458482 2.184281
## 0.029 2.842470 0.2559861 2.145881
## 0.030 2.826371 0.2598272 2.136170
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was cp = 0.03.
RMSE was used to select the optimal model using the smallest value. The final value used for the model was cp = 0.03.
set.seed(333)
cp.grid <- expand.grid(cp = 0.03)
data.tree.cv <- train(modelformula,
data = data_train,
method = "rpart",
trControl = tc,
tuneGrid = cp.grid)
data.tree.cv
## CART
##
## 456 samples
## 39 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 411, 410, 410, 410, 409, 411, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 2.826371 0.2598272 2.13617
##
## Tuning parameter 'cp' was held constant at a value of 0.03
Lets compare the results of training and test set by using regression Metrics functions
regressionMetrics <- function(real, predicted) {
# simple function that summarizes popular ex-post error measures
# Mean Square Error
MSE <- mean((real - predicted)^2)
# Root Mean Squera Error
RMSE <- sqrt(MSE)
# Mean Absolute Error
MAE <- mean(abs(real - predicted))
# Median Absolute Error
MedAE <- median(abs(real - predicted))
# Mean Logarithmic Absolute Error
MSLE <- mean((log(1 + real) - log(1 + predicted))^2)
# Total Sum of Squares
TSS <- sum((real - mean(real))^2)
# Explained Sum of Squares
RSS <- sum((predicted - real)^2)
# R2
R2 <- 1 - RSS/TSS
result <- data.frame(MSE, RMSE, MAE, MedAE, MSLE, R2)
return(result)
}
## MSE RMSE MAE MedAE MSLE R2 model
## 1 7.925 2.815 2.107 1.693 0.186 0.271 on train data
## 2 8.418 2.901 2.188 1.935 0.117 0.097 on test data
The worse result is for the test sample. It has higher values of errors than train one, what is expected behavior. We compared those result on the test data with the next models created.
In order to decrease the variation of the prediction we used Bagging
Bagging - designed to improve the stability and accuracy of machine learning algorithms used in statistical classification and regression. It also reduces variance and helps to avoid overfitting.
We conducted 1000 regressions on different samples and than conducted mean as the final result of the values
n <- nrow(data_train)
results_regression <- list()
data_sample <- list()
set.seed(333)
for (sample in 1:1000) {
message(sample)
data_sample <-
data_train[sample(1:n,
size = n,
replace = TRUE),]
results_regression[[sample]] <-lm(modelformula,
data_sample)
data_sample[[sample]] <- data_sample
}
forecasts_bag = forecasts_bag %>% rowMeans()
# test dataset
c= regressionMetrics(real = data_test$G3,
predicted = forecasts_bag)
b=round(b,3)
c=round(c,3)
b = b %>% mutate(model= "CrossValidation")
c = c %>% mutate(model= "Bagging")
t = rbind(b,c)
t %>% arrange(MSE)
## MSE RMSE MAE MedAE MSLE R2 model
## 1 7.455 2.730 2.007 1.668 0.114 0.201 Bagging
## 2 8.418 2.901 2.188 1.935 0.117 0.097 CrossValidation
We compared the results from the Model 1 (cross validation) with the Model 2 (bagging). Bagging has decreased the final error values of the models.
set.seed(333)
data.random.forest <- randomForest(modelformula,
data = data_train)
print(data.random.forest)
##
## Call:
## randomForest(formula = modelformula, data = data_train)
## Type of random forest: regression
## Number of trees: 500
## No. of variables tried at each split: 13
##
## Mean of squared residuals: 7.320225
## % Var explained: 32.69
What is the best value for mtry?
plot(data.random.forest)
Lets change the number of trees to 60 and check the results
set.seed(333)
data.random.forest2 <- randomForest(modelformula,
data = data_train,
ntree = 60,
importance = TRUE)
#mtry
#Number of variables randomly sampled as candidates at each split. Note that #the default values are different for classification (sqrt(p) where p is #number of variables in x) and regression (p/3); 40/3~~13
plot(data.random.forest2)
print(data.random.forest2)
##
## Call:
## randomForest(formula = modelformula, data = data_train, ntree = 60, importance = TRUE)
## Type of random forest: regression
## Number of trees: 60
## No. of variables tried at each split: 13
##
## Mean of squared residuals: 7.571156
## % Var explained: 30.38
The % var explained has decreased as well as the MSRE. Lets try to find the optimal value for mtry parameter (Number of variables randomly sampled as candidates at each split).
parameters_rf <- expand.grid(mtry = 1:20)
set.seed(333)
data.random.forest3_training <-
train(modelformula,
data = data_train,
method = "rf",
ntree = 60,
tuneGrid = parameters_rf,
trControl = tc,
importance = TRUE)
plot(data.random.forest3_training)
print(data.random.forest3_training)
## Random Forest
##
## 456 samples
## 39 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 411, 410, 410, 410, 409, 411, ...
## Resampling results across tuning parameters:
##
## mtry RMSE Rsquared MAE
## 1 2.896091 0.3049615 2.091615
## 2 2.743380 0.3410184 1.960771
## 3 2.699925 0.3402982 1.955689
## 4 2.722786 0.3138669 1.964194
## 5 2.680430 0.3383260 1.930988
## 6 2.656462 0.3518727 1.925076
## 7 2.692042 0.3299081 1.962533
## 8 2.681286 0.3358943 1.938161
## 9 2.659533 0.3386445 1.932939
## 10 2.694445 0.3250489 1.960403
## 11 2.674775 0.3329844 1.961611
## 12 2.680800 0.3281031 1.970356
## 13 2.719868 0.3132709 1.985726
## 14 2.712958 0.3184563 1.978898
## 15 2.733323 0.3101584 1.995726
## 16 2.658505 0.3442494 1.935771
## 17 2.694266 0.3311205 1.984853
## 18 2.714821 0.3205838 1.987447
## 19 2.686482 0.3377981 1.963816
## 20 2.740012 0.3149341 2.017510
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 6.
From the graph above and from the results, we can conclude that optimal number of variables is 6
set.seed(333)
data.random.forest3 <- randomForest(modelformula,
data = data_train,
ntree = 60,
mtry = 6,
importance = TRUE)
Lets compare the 3 different approaches of random forests
## MSE RMSE MAE MedAE MSLE R2 random_forest
## 1 7.030 2.651 1.962 1.450 0.110 0.246 limited ntry with optimal mtry value
## 2 7.106 2.666 1.990 1.539 0.111 0.238 random forest
## 3 7.250 2.693 2.029 1.598 0.112 0.223 limited ntree
It looks like the basic random forest was better than the other approaches. The difference is probably because of the number of trees. We also noticed that application of cross validation helped to limit value of the errors.
## MSE RMSE MAE MedAE MSLE R2 model
## 1 7.106 2.666 1.990 1.539 0.111 0.238 Random Forest
## 2 7.455 2.730 2.007 1.668 0.114 0.201 Bagging
## 3 8.418 2.901 2.188 1.935 0.117 0.097 CrossValidation
Random forest looks the best when compared to the other models
Neural networks are a set of algorithms, modeled loosely after the human brain, that are designed to recognize patterns. They interpret sensory data through a kind of machine perception, labeling or clustering raw input.
Firstly we need to scale the data
# we scale the data
test<-scale(data_test)
test<-as.data.frame(test)
train<-scale(data_train)
train<-as.data.frame(train)
The basic approach of neuralNet package
net <- neuralnet(modelformula, # the model
train)
plot(net, rep = 'best')
e1= regressionMetrics(real = test$G3,
predicted = compute(net, test)$net.result)
e1 =round(e1,3)
e1 = e1 %>% mutate(model= "NN 1")
e1
## MSE RMSE MAE MedAE MSLE R2 model
## 1 0.869 0.932 0.704 0.506 NaN 0.127 NN 1
Lets modify model and try different settings
net2 <- neuralnet(modelformula,
data=train,
act.fct = "logistic")
net3 <- neuralnet(modelformula,
data=train,
hidden = c(2,1),
act.fct = "logistic")
e2= regressionMetrics(real = test$G3,
predicted = compute(net2, test)$net.result)
e3= regressionMetrics(real = test$G3,
predicted = compute(net3, test)$net.result)
e2=round(e2,3)
e3=round(e3,3)
e2 <- e2 %>% mutate(model="NN 2")
e3<- e3 %>% mutate(model="NN 3")
ef= rbind(e1,e2, e3)
ef %>% arrange(MSE)
## MSE RMSE MAE MedAE MSLE R2 model
## 1 0.869 0.932 0.704 0.506 NaN 0.127 NN 1
## 2 0.928 0.963 0.694 0.540 NaN 0.067 NN 2
## 3 1.249 1.118 0.876 0.798 NaN -0.256 NN 3
The best looking is model N1
let’s check how neural networks compare with other models
e<- e1
t = rbind(t,e)
t %>% arrange(MSE)
## MSE RMSE MAE MedAE MSLE R2 model
## 1 0.869 0.932 0.704 0.506 NaN 0.127 NN 1
## 2 7.106 2.666 1.990 1.539 0.111 0.238 Random Forest
## 3 7.455 2.730 2.007 1.668 0.114 0.201 Bagging
## 4 8.418 2.901 2.188 1.935 0.117 0.097 CrossValidation
We created 4 models. Random forest looks the best between the most common methods. This is indicated by the smallest errors values. However when we computed neural networks we saw, that it is much more precise than others.