Introduction

In this group assignment for WQD7004: Programming for Data Science, the scope is to use R Programming to create both a classification model as well as a regression model using a single dataset for prediction purposes. After careful consideration and discussion, the chosen dataset is determined to be Student Data collected by Cortez and Silva (2008) in their research.

For context, the research conducted by Cortez and Silva (2008) was to assess the educational level of the Portuguese population from two different schools. This was presented as a concern as statistics have shown that Portugal has one of the highest failure rates in Europe despite their continuous improvements in the last decade. In 2006, the early school leaving rate in Portugal was 40% for those aged 18-24 years old, while the European Union average was only 15% (Eurostat, 2007). The two subjects most important are Mathematics and Portuguese as they are core classes that provide students with the fundamental information for success in other school subjects, such as Chemistry and History.

Modeling student performance has consequently always been an important tool for both educators, students, and researchers alike. It can help to further research in the domain of factors that affect student achievement. Thus, the researchers from previous study addressed the prediction of secondary student grades by using past school performance, demographic, social, and other school related data. They then tested out three different data mining goals (binary classification, five level classification, and regression) and four data mining methods (decision trees, random forests, neural networks, and support vector machines).

In concluding the previous study, the past researchers have noted that the best predictive accuracy of the final grade can be achieved if past performance grades are evaluated and that predictive performance decreases when each subsequent past performance is not used. However, the researchers have also postulated that the effects of other variables, such as number of school absences, parent’s job and education, and alcohol consumption, also influence the prediction of a student’s final grade, just not as much as past performance.

For this group assignment, the goal is to improve the performance of the prediction. We focused only on the Portuguese language dataset as the number of record is higher compared to the Mathematics dataset.

Problem Statement

Education plays a vital role in the growth and development of a country. Therefore, it is crucial to predict student performance so that education practitioners can formulate plans to improve the education system. Furthermore, it is also essential to identify the factors that affects the student performance in education. Hence, we aimed to explore the correlation between student grades, demographic, social and school related features, and the student’s final grade. We also develop the regression and classification models to predict student performance.

Research Questions

  1. What are the factors that highly impact to the student performance?
  2. How to design prediction models to predict student performance?
  3. Which model is the best to predict student performance?

Research Objectives

  1. To identify the factors that highly correlated with the student performance
  2. To create the regression and classification models to predict student performance
  3. To evaluate the performance of the models

Dataset

Student Performance Data Set

This dataset consists of student performance in secondary education of two Portuguese schools in year 2005 - 2006. It was collected through school reports and questionnaires.

In this project, we are interested in the student performance in the Portuguese subject. It consists of 649 observations and 33 variables.

Metadata

Variable Description
school student’s school (binary: ‘GP’ - Gabriel Pereira or ‘MS’ - Mousinho da Silveira)
sex student’s sex (binary: ‘F’ - female or ‘M’ - male)
age student’s age (numeric: from 15 to 22)
address student’s home address type (binary: ‘U’ - urban or ‘R’ - rural)
famsize family size (binary: ‘LE3’ - less or equal to 3 or ‘GT3’ - greater than 3)
Pstatus parent’s cohabitation status (binary: ‘T’ - living together or ‘A’ - apart)
Medu mother’s education (numeric: 0 - none, 1 - primary education (4th grade), 2 - 5th to 9th grade, 3 - secondary education or 4 - higher education)
Fedu father’s education (numeric: 0 - none, 1 - primary education (4th grade), 2 - 5th to 9th grade, 3 - secondary education or 4 - higher education)
Mjob mother’s job (nominal: ‘teacher’, ‘health’ care related, civil ‘services’ (e.g. administrative or police), ‘at_home’ or ‘other’)
Fjob father’s job (nominal: ‘teacher’, ‘health’ care related, civil ‘services’ (e.g. administrative or police), ‘at_home’ or ‘other’)
reason reason to choose this school (nominal: close to ‘home’, school ‘reputation’, ‘course’ preference or ‘other’)
guardian student’s guardian (nominal: ‘mother’, ‘father’ or ‘other’)
traveltime home to school travel time (numeric: 1 - <15 min., 2 - 15 to 30 min., 3 - 30 min. to 1 hour, or 4 - >1 hour)
studytime weekly study time (numeric: 1 - <2 hours, 2 - 2 to 5 hours, 3 - 5 to 10 hours, or 4 - >10 hours)
failures number of past class failures (numeric: n if 0<=n<=2, else 3)
schoolsup extra educational support (binary: yes or no)
famsup family educational support (binary: yes or no)
paid extra paid classes within the course subject (Math or Portuguese) (binary: yes or no)
activities extra-curricular activities (binary: yes or no)
nursery attended nursery school (binary: yes or no)
higher wants to take higher education (binary: yes or no)
internet Internet access at home (binary: yes or no)
romantic with a romantic relationship (binary: yes or no)
famrel quality of family relationships (numeric: from 1 - very bad to 5 - excellent)
freetime free time after school (numeric: from 1 - very low to 5 - very high)
goout going out with friends (numeric: from 1 - very low to 5 - very high)
Dalc workday alcohol consumption (numeric: from 1 - very low to 5 - very high)
Walc weekend alcohol consumption (numeric: from 1 - very low to 5 - very high)
health current health status (numeric: from 1 - very bad to 5 - very good)
absences number of school absences (numeric: from 0 to 93)
G1 first period grade (numeric: from 0 to 20)
G2 second period grade (numeric: from 0 to 20)
G3 final grade (numeric: from 0 to 20, output target)

References

  1. Cortez, P., & Silva, A. M. G. (2008). Using data mining to predict secondary school student performance. In A. Brito and J. Teixeira Eds., Proceedings of 5th FUture BUsiness TEChnology Conference (FUBUTEC 2008) pp. 5-12, Porto, Portugal, April, 2008, EUROSIS, ISBN 978-9077381-39-7.
  2. Eurostat. (2007). Early school-leavers. http://epp.eurostat.ec.europa.eu/.

Imports

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)
library(caret)
## Loading required package: lattice
library(reshape)
## 
## Attaching package: 'reshape'
## The following object is masked from 'package:dplyr':
## 
##     rename
df <- read.csv("../data/student-por.csv", sep = ";", stringsAsFactors = TRUE)
head(df)
##   school sex age address famsize Pstatus Medu Fedu     Mjob     Fjob     reason
## 1     GP   F  18       U     GT3       A    4    4  at_home  teacher     course
## 2     GP   F  17       U     GT3       T    1    1  at_home    other     course
## 3     GP   F  15       U     LE3       T    1    1  at_home    other      other
## 4     GP   F  15       U     GT3       T    4    2   health services       home
## 5     GP   F  16       U     GT3       T    3    3    other    other       home
## 6     GP   M  16       U     LE3       T    4    3 services    other reputation
##   guardian traveltime studytime failures schoolsup famsup paid activities
## 1   mother          2         2        0       yes     no   no         no
## 2   father          1         2        0        no    yes   no         no
## 3   mother          1         2        0       yes     no   no         no
## 4   mother          1         3        0        no    yes   no        yes
## 5   father          1         2        0        no    yes   no         no
## 6   mother          1         2        0        no    yes   no        yes
##   nursery higher internet romantic famrel freetime goout Dalc Walc health
## 1     yes    yes       no       no      4        3     4    1    1      3
## 2      no    yes      yes       no      5        3     3    1    1      3
## 3     yes    yes      yes       no      4        3     2    2    3      3
## 4     yes    yes      yes      yes      3        2     2    1    1      5
## 5     yes    yes       no       no      4        3     2    1    2      5
## 6     yes    yes      yes       no      5        4     2    1    2      5
##   absences G1 G2 G3
## 1        4  0 11 11
## 2        2  9 11 11
## 3        6 12 13 12
## 4        0 14 14 14
## 5        0 11 13 13
## 6        6 12 12 13

Preprocessing

str(df)
## 'data.frame':    649 obs. of  33 variables:
##  $ school    : Factor w/ 2 levels "GP","MS": 1 1 1 1 1 1 1 1 1 1 ...
##  $ sex       : Factor w/ 2 levels "F","M": 1 1 1 1 1 2 2 1 2 2 ...
##  $ age       : int  18 17 15 15 16 16 16 17 15 15 ...
##  $ address   : Factor w/ 2 levels "R","U": 2 2 2 2 2 2 2 2 2 2 ...
##  $ famsize   : Factor w/ 2 levels "GT3","LE3": 1 1 2 1 1 2 2 1 2 1 ...
##  $ Pstatus   : Factor w/ 2 levels "A","T": 1 2 2 2 2 2 2 1 1 2 ...
##  $ Medu      : int  4 1 1 4 3 4 2 4 3 3 ...
##  $ Fedu      : int  4 1 1 2 3 3 2 4 2 4 ...
##  $ Mjob      : Factor w/ 5 levels "at_home","health",..: 1 1 1 2 3 4 3 3 4 3 ...
##  $ Fjob      : Factor w/ 5 levels "at_home","health",..: 5 3 3 4 3 3 3 5 3 3 ...
##  $ reason    : Factor w/ 4 levels "course","home",..: 1 1 3 2 2 4 2 2 2 2 ...
##  $ guardian  : Factor w/ 3 levels "father","mother",..: 2 1 2 2 1 2 2 2 2 2 ...
##  $ traveltime: int  2 1 1 1 1 1 1 2 1 1 ...
##  $ studytime : int  2 2 2 3 2 2 2 2 2 2 ...
##  $ failures  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ schoolsup : Factor w/ 2 levels "no","yes": 2 1 2 1 1 1 1 2 1 1 ...
##  $ famsup    : Factor w/ 2 levels "no","yes": 1 2 1 2 2 2 1 2 2 2 ...
##  $ paid      : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ activities: Factor w/ 2 levels "no","yes": 1 1 1 2 1 2 1 1 1 2 ...
##  $ nursery   : Factor w/ 2 levels "no","yes": 2 1 2 2 2 2 2 2 2 2 ...
##  $ higher    : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 2 ...
##  $ internet  : Factor w/ 2 levels "no","yes": 1 2 2 2 1 2 2 1 2 2 ...
##  $ romantic  : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 1 1 1 ...
##  $ famrel    : int  4 5 4 3 4 5 4 4 4 5 ...
##  $ freetime  : int  3 3 3 2 3 4 4 1 2 5 ...
##  $ goout     : int  4 3 2 2 2 2 4 4 2 1 ...
##  $ Dalc      : int  1 1 2 1 1 1 1 1 1 1 ...
##  $ Walc      : int  1 1 3 1 2 2 1 1 1 1 ...
##  $ health    : int  3 3 3 5 5 5 3 1 1 5 ...
##  $ absences  : int  4 2 6 0 0 6 0 2 0 0 ...
##  $ G1        : int  0 9 12 14 11 12 13 10 15 12 ...
##  $ G2        : int  11 11 13 14 13 12 12 13 16 12 ...
##  $ G3        : int  11 11 12 14 13 13 13 13 17 13 ...
summary(df)
##  school   sex          age        address famsize   Pstatus      Medu      
##  GP:423   F:383   Min.   :15.00   R:197   GT3:457   A: 80   Min.   :0.000  
##  MS:226   M:266   1st Qu.:16.00   U:452   LE3:192   T:569   1st Qu.:2.000  
##                   Median :17.00                             Median :2.000  
##                   Mean   :16.74                             Mean   :2.515  
##                   3rd Qu.:18.00                             3rd Qu.:4.000  
##                   Max.   :22.00                             Max.   :4.000  
##       Fedu             Mjob           Fjob            reason      guardian  
##  Min.   :0.000   at_home :135   at_home : 42   course    :285   father:153  
##  1st Qu.:1.000   health  : 48   health  : 23   home      :149   mother:455  
##  Median :2.000   other   :258   other   :367   other     : 72   other : 41  
##  Mean   :2.307   services:136   services:181   reputation:143               
##  3rd Qu.:3.000   teacher : 72   teacher : 36                                
##  Max.   :4.000                                                              
##    traveltime      studytime        failures      schoolsup famsup     paid    
##  Min.   :1.000   Min.   :1.000   Min.   :0.0000   no :581   no :251   no :610  
##  1st Qu.:1.000   1st Qu.:1.000   1st Qu.:0.0000   yes: 68   yes:398   yes: 39  
##  Median :1.000   Median :2.000   Median :0.0000                                
##  Mean   :1.569   Mean   :1.931   Mean   :0.2219                                
##  3rd Qu.:2.000   3rd Qu.:2.000   3rd Qu.:0.0000                                
##  Max.   :4.000   Max.   :4.000   Max.   :3.0000                                
##  activities nursery   higher    internet  romantic      famrel     
##  no :334    no :128   no : 69   no :151   no :410   Min.   :1.000  
##  yes:315    yes:521   yes:580   yes:498   yes:239   1st Qu.:4.000  
##                                                     Median :4.000  
##                                                     Mean   :3.931  
##                                                     3rd Qu.:5.000  
##                                                     Max.   :5.000  
##     freetime        goout            Dalc            Walc          health     
##  Min.   :1.00   Min.   :1.000   Min.   :1.000   Min.   :1.00   Min.   :1.000  
##  1st Qu.:3.00   1st Qu.:2.000   1st Qu.:1.000   1st Qu.:1.00   1st Qu.:2.000  
##  Median :3.00   Median :3.000   Median :1.000   Median :2.00   Median :4.000  
##  Mean   :3.18   Mean   :3.185   Mean   :1.502   Mean   :2.28   Mean   :3.536  
##  3rd Qu.:4.00   3rd Qu.:4.000   3rd Qu.:2.000   3rd Qu.:3.00   3rd Qu.:5.000  
##  Max.   :5.00   Max.   :5.000   Max.   :5.000   Max.   :5.00   Max.   :5.000  
##     absences            G1             G2              G3       
##  Min.   : 0.000   Min.   : 0.0   Min.   : 0.00   Min.   : 0.00  
##  1st Qu.: 0.000   1st Qu.:10.0   1st Qu.:10.00   1st Qu.:10.00  
##  Median : 2.000   Median :11.0   Median :11.00   Median :12.00  
##  Mean   : 3.659   Mean   :11.4   Mean   :11.57   Mean   :11.91  
##  3rd Qu.: 6.000   3rd Qu.:13.0   3rd Qu.:13.00   3rd Qu.:14.00  
##  Max.   :32.000   Max.   :19.0   Max.   :19.00   Max.   :19.00

Convert ordinal variables into ordered factors.

df$famsize <- factor(df$famsize, ordered = TRUE, levels = c("LE3", "GT3"))
df$Medu <- factor(df$Medu, ordered = TRUE, levels = 0:4)
df$Fedu <- factor(df$Fedu, ordered = TRUE, levels = 0:4)
df$traveltime <- factor(df$traveltime, ordered = TRUE, levels = 1:4)
df$studytime <- factor(df$studytime, ordered = TRUE, levels = 1:4)
df$failures <- factor(df$failures, ordered = TRUE, levels = 0:4)
df$famrel <- factor(df$famrel, ordered = TRUE, levels = 1:5)
df$freetime <- factor(df$freetime, ordered = TRUE, levels = 1:5)
df$goout <- factor(df$goout, ordered = TRUE, levels = 1:5)
df$Dalc <- factor(df$Dalc, ordered = TRUE, levels = 1:5)
df$Walc <- factor(df$Walc, ordered = TRUE, levels = 1:5)
df$health <- factor(df$health, ordered = TRUE, levels = 1:5)

Create label from G3.

df$result <- ifelse(df$G3 >= 10, "pass", "fail")
df$result <- as.factor(df$result)

Descriptive Analysis

str(df)
## 'data.frame':    649 obs. of  34 variables:
##  $ school    : Factor w/ 2 levels "GP","MS": 1 1 1 1 1 1 1 1 1 1 ...
##  $ sex       : Factor w/ 2 levels "F","M": 1 1 1 1 1 2 2 1 2 2 ...
##  $ age       : int  18 17 15 15 16 16 16 17 15 15 ...
##  $ address   : Factor w/ 2 levels "R","U": 2 2 2 2 2 2 2 2 2 2 ...
##  $ famsize   : Ord.factor w/ 2 levels "LE3"<"GT3": 2 2 1 2 2 1 1 2 1 2 ...
##  $ Pstatus   : Factor w/ 2 levels "A","T": 1 2 2 2 2 2 2 1 1 2 ...
##  $ Medu      : Ord.factor w/ 5 levels "0"<"1"<"2"<"3"<..: 5 2 2 5 4 5 3 5 4 4 ...
##  $ Fedu      : Ord.factor w/ 5 levels "0"<"1"<"2"<"3"<..: 5 2 2 3 4 4 3 5 3 5 ...
##  $ Mjob      : Factor w/ 5 levels "at_home","health",..: 1 1 1 2 3 4 3 3 4 3 ...
##  $ Fjob      : Factor w/ 5 levels "at_home","health",..: 5 3 3 4 3 3 3 5 3 3 ...
##  $ reason    : Factor w/ 4 levels "course","home",..: 1 1 3 2 2 4 2 2 2 2 ...
##  $ guardian  : Factor w/ 3 levels "father","mother",..: 2 1 2 2 1 2 2 2 2 2 ...
##  $ traveltime: Ord.factor w/ 4 levels "1"<"2"<"3"<"4": 2 1 1 1 1 1 1 2 1 1 ...
##  $ studytime : Ord.factor w/ 4 levels "1"<"2"<"3"<"4": 2 2 2 3 2 2 2 2 2 2 ...
##  $ failures  : Ord.factor w/ 5 levels "0"<"1"<"2"<"3"<..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ schoolsup : Factor w/ 2 levels "no","yes": 2 1 2 1 1 1 1 2 1 1 ...
##  $ famsup    : Factor w/ 2 levels "no","yes": 1 2 1 2 2 2 1 2 2 2 ...
##  $ paid      : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ activities: Factor w/ 2 levels "no","yes": 1 1 1 2 1 2 1 1 1 2 ...
##  $ nursery   : Factor w/ 2 levels "no","yes": 2 1 2 2 2 2 2 2 2 2 ...
##  $ higher    : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 2 ...
##  $ internet  : Factor w/ 2 levels "no","yes": 1 2 2 2 1 2 2 1 2 2 ...
##  $ romantic  : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 1 1 1 ...
##  $ famrel    : Ord.factor w/ 5 levels "1"<"2"<"3"<"4"<..: 4 5 4 3 4 5 4 4 4 5 ...
##  $ freetime  : Ord.factor w/ 5 levels "1"<"2"<"3"<"4"<..: 3 3 3 2 3 4 4 1 2 5 ...
##  $ goout     : Ord.factor w/ 5 levels "1"<"2"<"3"<"4"<..: 4 3 2 2 2 2 4 4 2 1 ...
##  $ Dalc      : Ord.factor w/ 5 levels "1"<"2"<"3"<"4"<..: 1 1 2 1 1 1 1 1 1 1 ...
##  $ Walc      : Ord.factor w/ 5 levels "1"<"2"<"3"<"4"<..: 1 1 3 1 2 2 1 1 1 1 ...
##  $ health    : Ord.factor w/ 5 levels "1"<"2"<"3"<"4"<..: 3 3 3 5 5 5 3 1 1 5 ...
##  $ absences  : int  4 2 6 0 0 6 0 2 0 0 ...
##  $ G1        : int  0 9 12 14 11 12 13 10 15 12 ...
##  $ G2        : int  11 11 13 14 13 12 12 13 16 12 ...
##  $ G3        : int  11 11 12 14 13 13 13 13 17 13 ...
##  $ result    : Factor w/ 2 levels "fail","pass": 2 2 2 2 2 2 2 2 2 2 ...
summary(df)
##  school   sex          age        address famsize   Pstatus Medu    Fedu   
##  GP:423   F:383   Min.   :15.00   R:197   LE3:192   A: 80   0:  6   0:  7  
##  MS:226   M:266   1st Qu.:16.00   U:452   GT3:457   T:569   1:143   1:174  
##                   Median :17.00                             2:186   2:209  
##                   Mean   :16.74                             3:139   3:131  
##                   3rd Qu.:18.00                             4:175   4:128  
##                   Max.   :22.00                                            
##        Mjob           Fjob            reason      guardian   traveltime
##  at_home :135   at_home : 42   course    :285   father:153   1:366     
##  health  : 48   health  : 23   home      :149   mother:455   2:213     
##  other   :258   other   :367   other     : 72   other : 41   3: 54     
##  services:136   services:181   reputation:143                4: 16     
##  teacher : 72   teacher : 36                                           
##                                                                        
##  studytime failures schoolsup famsup     paid     activities nursery  
##  1:212     0:549    no :581   no :251   no :610   no :334    no :128  
##  2:305     1: 70    yes: 68   yes:398   yes: 39   yes:315    yes:521  
##  3: 97     2: 16                                                      
##  4: 35     3: 14                                                      
##            4:  0                                                      
##                                                                       
##  higher    internet  romantic  famrel  freetime goout   Dalc    Walc    health 
##  no : 69   no :151   no :410   1: 22   1: 45    1: 48   1:451   1:247   1: 90  
##  yes:580   yes:498   yes:239   2: 29   2:107    2:145   2:121   2:150   2: 78  
##                                3:101   3:251    3:205   3: 43   3:120   3:124  
##                                4:317   4:178    4:141   4: 17   4: 87   4:108  
##                                5:180   5: 68    5:110   5: 17   5: 45   5:249  
##                                                                                
##     absences            G1             G2              G3         result   
##  Min.   : 0.000   Min.   : 0.0   Min.   : 0.00   Min.   : 0.00   fail:100  
##  1st Qu.: 0.000   1st Qu.:10.0   1st Qu.:10.00   1st Qu.:10.00   pass:549  
##  Median : 2.000   Median :11.0   Median :11.00   Median :12.00             
##  Mean   : 3.659   Mean   :11.4   Mean   :11.57   Mean   :11.91             
##  3rd Qu.: 6.000   3rd Qu.:13.0   3rd Qu.:13.00   3rd Qu.:14.00             
##  Max.   :32.000   Max.   :19.0   Max.   :19.00   Max.   :19.00

Exploratory Data Analysis

Univariate

ggplot(df, aes(x = school)) +
  geom_bar(fill = "blue") +
  labs(
    x = "School",
    y = "No. of students",
    title = "1. Student's school"
  )

ggplot(df, aes(x = sex)) +
  geom_bar(fill = "blue") +
  labs(
    x = "Sex",
    y = "No. of students",
    title = "2. Student's sex"
  )

ggplot(df, aes(x = age)) +
  geom_bar(
    fill = "green",
    color = "black"
  ) +
  labs(
    x = "Age",
    y = "Frequency",
    title = "3. Student's age"
  )

ggplot(df, aes(x = address)) +
  geom_bar(fill = "purple") +
  labs(
    x = "Address",
    y = "Frequency",
    title = "4. Student's home address type"
  )

ggplot(df, aes(x = famsize)) +
  geom_bar(fill = "purple") +
  labs(
    x = "Family size",
    y = "Frequency",
    title = "5. Family size "
  )

ggplot(df, aes(x = Pstatus)) +
  geom_bar(fill = "purple") +
  labs(
    x = "Parent's Status",
    y = "Frequency",
    title = "6. Parent's cohabitation status "
  )

ggplot(df, aes(x = Medu)) +
  geom_bar(
    fill = "white",
    color = "black"
  ) +
  labs(
    x = "Medu",
    y = "Frequency",
    title = "7. Mother's education level"
  )

ggplot(df, aes(x = Fedu)) +
  geom_bar(
    fill = "white",
    color = "black"
  ) +
  labs(
    x = "Fedu",
    y = "Frequency",
    title = "8. Father's education level"
  )

ggplot(df, aes(x = Mjob)) +
  geom_bar(fill = "purple") +
  labs(
    x = "Mjob",
    y = "Frequency",
    title = "9. Mother's job"
  )

ggplot(df, aes(x = Fjob)) +
  geom_bar(fill = "purple") +
  labs(
    x = "Fjob",
    y = "Frequency",
    title = "10. Father's job"
  )

ggplot(df, aes(x = reason)) +
  geom_bar(fill = "purple") +
  labs(
    x = "Reason",
    y = "Frequency",
    title = "11. Reason for selecting school"
  )

ggplot(df, aes(x = guardian)) +
  geom_bar(fill = "purple") +
  labs(
    x = "Guardian",
    y = "Frequency",
    title = "12. Student's guardian"
  )

ggplot(df, aes(x = traveltime)) +
  geom_bar(
    fill = "white",
    color = "black"
  ) +
  labs(
    x = "Traveltime",
    y = "Frequency",
    title = "13. Home to school travel time"
  )

ggplot(df, aes(x = studytime)) +
  geom_bar(
    fill = "white",
    color = "black"
  ) +
  labs(
    x = "Studytime",
    y = "Frequency",
    title = "14. Weekly study time"
  )

ggplot(df, aes(x = failures)) +
  geom_bar(
    fill = "green",
    color = "black"
  ) +
  labs(title = "15. Number of past class failures")

ggplot(df, aes(x = schoolsup)) +
  geom_bar(fill = "blue") +
  labs(
    x = "Schoolsup",
    y = "Frequency",
    title = "16. Extra educational support"
  )

ggplot(df, aes(x = famsup)) +
  geom_bar(fill = "blue") +
  labs(
    x = "Famsup",
    y = "Frequency",
    title = "17. Family educational support"
  )

ggplot(df, aes(x = paid)) +
  geom_bar(fill = "blue") +
  labs(
    x = "Paid",
    y = "Frequency",
    title = "18. Extra paid Portuguese classes "
  )

ggplot(df, aes(x = activities)) +
  geom_bar(fill = "blue") +
  labs(
    x = "Activities",
    y = "Frequency",
    title = "19. Extra-curricular activities"
  )

ggplot(df, aes(x = nursery)) +
  geom_bar(fill = "blue") +
  labs(
    x = "Nursery",
    y = "Frequency",
    title = "20. Attended nursery school"
  )

ggplot(df, aes(x = higher)) +
  geom_bar(fill = "blue") +
  labs(
    x = "Higher",
    y = "Frequency",
    title = "21. Wants to take higher education"
  )

ggplot(df, aes(x = internet)) +
  geom_bar(fill = "blue") +
  labs(
    x = "Internet",
    y = "Frequency",
    title = "22. Internet access at home"
  )

ggplot(df, aes(x = romantic)) +
  geom_bar(fill = "blue") +
  labs(
    x = "Romantic",
    y = "Frequency",
    title = "23. In a romantic relationship"
  )

ggplot(df, aes(x = famrel)) +
  geom_bar(
    fill = "white",
    color = "black"
  ) +
  labs(
    x = "Famrel",
    y = "Frequency",
    title = "24. Quality of family relationship"
  )

ggplot(df, aes(x = freetime)) +
  geom_bar(
    fill = "white",
    color = "black"
  ) +
  labs(
    x = "Freetime",
    y = "Frequency",
    title = "25. Free time after school"
  )

ggplot(df, aes(x = goout)) +
  geom_bar(
    fill = "white",
    color = "black"
  ) +
  labs(
    x = "Goout",
    y = "Frequency",
    title = "26. Going out with friends"
  )

ggplot(df, aes(x = Dalc)) +
  geom_bar(
    fill = "white",
    color = "black"
  ) +
  labs(
    x = "Dalc",
    y = "Frequency",
    title = "27. Workday alcohol consumption"
  )

ggplot(df, aes(x = Walc)) +
  geom_bar(
    fill = "white",
    color = "black"
  ) +
  labs(
    x = "Walc",
    y = "Frequency",
    title = "28. Weekend alcohol consumption"
  )

ggplot(df, aes(x = health)) +
  geom_bar(
    fill = "white",
    color = "black"
  ) +
  labs(
    x = "health",
    y = "Frequency",
    title = "29. Current health status"
  )

ggplot(df, aes(x = absences)) +
  geom_bar(
    fill = "green",
    color = "black"
  ) +
  labs(
    x = "Absences",
    y = "Frequency",
    title = "30. Number of school absences"
  )

ggplot(df, aes(x = G1)) +
  geom_bar(
    fill = "pink",
    color = "black"
  ) +
  labs(
    x = "G1",
    y = "Frequency",
    title = "31. First period grade"
  )

ggplot(df, aes(x = G2)) +
  geom_bar(
    fill = "pink",
    color = "black"
  ) +
  labs(
    x = "G2",
    y = "Frequency",
    title = "32. Second period grade"
  )

ggplot(df, aes(x = G3)) +
  geom_bar(
    fill = "orange",
    color = "black"
  ) +
  labs(
    x = "G3",
    y = "Frequency",
    title = "33. Final grade-output target"
  )

ggplot(df, aes(x = result)) +
  geom_bar(
    fill = "blue",
    color = "black"
  ) +
  labs(
    x = "Result",
    y = "Frequency",
    title = "34. Final grade-result"
  )

Bivariate

General Function for Bar Graph Plotting

# bar graph for bivariate analysis
Unstacked_bi_bar_graph <- function(`x.axis` = "", Result = "") {
  # Result represents dependent variable / usually being drawn in y-axis
  # x-axis represent independent variable
  # dodge is used to un-stake the bar graph

  # 1. counts (or sums of weights)
  bar <- ggplot(df, aes(`x.axis`))

  # 2. Number of tuples in each class
  graph_output <- bar + geom_bar(aes(fill = `Result`), position = "dodge") +
    labs(y = "Frequency")

  # 3. get data from the graph
  graph_label <- layer_data(graph_output)

  # 4. Annotate value at respective bar
  graph_output <- graph_output + annotate(
    geom = "text", label = graph_label$count,
    x = graph_label$x, y = 15
  )

  return(graph_output)
}
bar.bivar <- Unstacked_bi_bar_graph(df$school, df$result)
bar.bivar <- bar.bivar + ggtitle(label = "1. School vs Final Grade Result") +
  labs(x = "Student's School")
bar.bivar

bar.bivar <- Unstacked_bi_bar_graph(df$sex, df$result)
bar.bivar <- bar.bivar + ggtitle(label = "2. Gender vs Final Grade Result") +
  labs(x = "Student's Sex")
bar.bivar

bar.bivar <- Unstacked_bi_bar_graph(df$age, df$result)
bar.bivar <- bar.bivar + ggtitle(label = "3. Age vs Final Grade Result") +
  labs(x = "Student's Age")
bar.bivar

bar.bivar <- Unstacked_bi_bar_graph(df$address, df$result)
bar.bivar <- bar.bivar + ggtitle(label = "4. Student's Home Address Type vs Final Grade Result") +
  labs(x = "Address Type")
bar.bivar

bar.bivar <- Unstacked_bi_bar_graph(df$famsize, df$result)
bar.bivar <- bar.bivar + ggtitle(label = "5. Student's Family Size vs Final Grade Result") +
  labs(x = "Family Size")
bar.bivar

bar.bivar <- Unstacked_bi_bar_graph(df$Pstatus, df$result)
bar.bivar <- bar.bivar + ggtitle(label = "6. Parent's Cohabitation Status vs Final Grade Result") +
  labs(x = "Parent's Cohabitation Status")
bar.bivar

bar.bivar <- Unstacked_bi_bar_graph(df$Medu, df$result)
bar.bivar <- bar.bivar + ggtitle(label = "7. Mother's Education vs Final Grade Result") +
  labs(x = "Mother's Education Level")
bar.bivar

bar.bivar <- Unstacked_bi_bar_graph(df$Fedu, df$result)
bar.bivar <- bar.bivar + ggtitle(label = "8. Father's Education vs Final Grade Result") +
  labs(x = "Father's Education Level")
bar.bivar

bar.bivar <- Unstacked_bi_bar_graph(df$Mjob, df$result)
bar.bivar <- bar.bivar + ggtitle(label = "9. Mother's Job vs Final Grade Result") +
  labs(x = "Mother's Job")
bar.bivar

bar.bivar <- Unstacked_bi_bar_graph(df$Fjob, df$result)
bar.bivar <- bar.bivar + ggtitle(label = "10. Father's Job vs Final Grade Result") +
  labs(x = "Father's Job")
bar.bivar

bar.bivar <- Unstacked_bi_bar_graph(df$reason, df$result)
bar.bivar <- bar.bivar + ggtitle(label = "11. Reason to Choose Selected School vs Final Grade Result") + labs(x = "Reason Types")
bar.bivar

bar.bivar <- Unstacked_bi_bar_graph(df$guardian, df$result)
bar.bivar <- bar.bivar + ggtitle(label = "12. Student's Guardian vs Final Grade Result") +
  labs(x = "Guardian")
bar.bivar

bar.bivar <- Unstacked_bi_bar_graph(df$traveltime, df$result)
bar.bivar <- bar.bivar + ggtitle(label = "13. Home to School Travel Time vs Final Grade Result") +
  labs(x = "Travel Time")
bar.bivar

bar.bivar <- Unstacked_bi_bar_graph(df$studytime, df$result)
bar.bivar <- bar.bivar + ggtitle(label = "14. Weekly Study Time vs Final Grade Result") +
  labs(x = "Weekly Study Time")
bar.bivar

bar.bivar <- Unstacked_bi_bar_graph(df$failures, df$result)
bar.bivar <- bar.bivar + ggtitle(label = "15. Number of Past Class Failures vs Final Grade Result") +
  labs(x = "Number of Past Class Failures")
bar.bivar

bar.bivar <- Unstacked_bi_bar_graph(df$schoolsup, df$result)
bar.bivar <- bar.bivar + ggtitle(label = "16. Extra Educational Support vs Final Grade Result") +
  labs(x = "Extra Educational Support")
bar.bivar

bar.bivar <- Unstacked_bi_bar_graph(df$famsup, df$result)
bar.bivar <- bar.bivar + ggtitle(label = "17. Family Educational Support vs Final Grade Result") +
  labs(x = "Family Educational Support")
bar.bivar

bar.bivar <- Unstacked_bi_bar_graph(df$paid, df$result)
bar.bivar <- bar.bivar +
  ggtitle(label = "18. Extra Paid Classes for Portuguese Subject vs Final Grade Result") +
  labs(x = "Joined Extra Paid Classes?")
bar.bivar

bar.bivar <- Unstacked_bi_bar_graph(df$activities, df$result)
bar.bivar <- bar.bivar + ggtitle(label = "19. Extra-Curricular Activities vs Final Grade Result") +
  labs(x = "Joined Extra-Curricular Activities?")
bar.bivar

bar.bivar <- Unstacked_bi_bar_graph(df$activities, df$result)
bar.bivar <- bar.bivar + ggtitle(label = "20. Extra-Curricular Activities vs Final Grade Result") +
  labs(x = "Joined Extra-Curricular Activities?")
bar.bivar

bar.bivar <- Unstacked_bi_bar_graph(df$internet, df$result)
bar.bivar <- bar.bivar + ggtitle(label = "21. Internet Access at Home vs Final Grade Result") +
  labs(x = "Internet Access at Home")
bar.bivar

bar.bivar <- Unstacked_bi_bar_graph(df$romantic, df$result)
bar.bivar <- bar.bivar + ggtitle(label = "22. With a Romantic Relationship vs Final Grade Result") +
  labs(x = "With a Romantic Relationship?")
bar.bivar

bar.bivar <- Unstacked_bi_bar_graph(df$famrel, df$result)
bar.bivar <- bar.bivar + ggtitle(label = "23. Quality of Family Relationships vs Final Grade Result") + labs(x = "Quality of Family Relationships")
bar.bivar

bar.bivar <- Unstacked_bi_bar_graph(df$freetime, df$result)
bar.bivar <- bar.bivar + ggtitle(label = "24. Free Time After School vs Final Grade Result") +
  labs(x = "Free Time After School")
bar.bivar

bar.bivar <- Unstacked_bi_bar_graph(df$goout, df$result)
bar.bivar <- bar.bivar + ggtitle(label = "25. Going Out with Friends vs Final Grade Result") +
  labs(x = "Going Out with Friends")
bar.bivar

bar.bivar <- Unstacked_bi_bar_graph(df$Dalc, df$result)
bar.bivar <- bar.bivar + ggtitle(label = "26. Workday Alcohol Consumption vs Final Grade Result") +
  labs(x = "Workday Alcohol Consumption")
bar.bivar

bar.bivar <- Unstacked_bi_bar_graph(df$Walc, df$result)
bar.bivar <- bar.bivar + ggtitle(label = "27. Weekend Alcohol Consumption vs Final Grade Result") +
  labs(x = "Weekend Alcohol Consumption")
bar.bivar

bar.bivar <- Unstacked_bi_bar_graph(df$health, df$result)
bar.bivar <- bar.bivar + ggtitle(label = "28. Current Health Status vs Final Grade Result") +
  labs(x = "Current Health Status")
bar.bivar

ggplot(df, aes(y = absences, x = result)) +
  geom_boxplot() +
  ggtitle(label = "29. Number of School Absences vs Final Grade Result") +
  labs(y = "Number of School Absences", x = "Result") +
  scale_y_continuous(breaks = seq(0, 36, by = 2))

ggplot(df, aes(y = G1, x = result)) +
  geom_boxplot() +
  labs(
    title = "30. First Period Grade vs Final Grade Result",
    y = "First Period Grade",
    x = "Result"
  ) +
  scale_y_continuous(breaks = seq(0, 22, by = 2))

ggplot(df, aes(y = G2, x = result)) +
  geom_boxplot() +
  labs(
    title = "31. Second Period Grade vs Final Grade Result",
    y = "Second Period Grade",
    x = "Result"
  ) +
  scale_y_continuous(breaks = seq(0, 22, by = 2))

Correlation Analysis

Change processed dataset to matrix as ‘heat_map_data’.

heat_map_data <- df %>% select(-1)
heat_map_data <- data.matrix(heat_map_data)
str(heat_map_data)
##  int [1:649, 1:33] 1 1 1 1 1 2 2 1 2 2 ...
##  - attr(*, "dimnames")=List of 2
##   ..$ : NULL
##   ..$ : chr [1:33] "sex" "age" "address" "famsize" ...

Calculate the correlation value for each attribute.

heat_map_data <- round(cor(heat_map_data, method = "pearson"), 2)
head(heat_map_data)
##           sex   age address famsize Pstatus  Medu  Fedu  Mjob  Fjob reason
## sex      1.00 -0.04    0.03   -0.10    0.06  0.12  0.08  0.15  0.08   0.01
## age     -0.04  1.00   -0.03    0.00   -0.01 -0.11 -0.12 -0.07 -0.05  -0.03
## address  0.03 -0.03    1.00   -0.05   -0.09  0.19  0.14  0.16 -0.01   0.00
## famsize -0.10  0.00   -0.05    1.00    0.24  0.01  0.04 -0.02  0.06  -0.03
## Pstatus  0.06 -0.01   -0.09    0.24    1.00 -0.06 -0.03 -0.03  0.05  -0.03
## Medu     0.12 -0.11    0.19    0.01   -0.06  1.00  0.65  0.46  0.15   0.13
##         guardian traveltime studytime failures schoolsup famsup  paid
## sex        -0.04       0.04     -0.21     0.07     -0.11  -0.13  0.08
## age         0.27       0.03     -0.01     0.32     -0.17  -0.10 -0.01
## address    -0.02      -0.34      0.06    -0.06      0.02   0.01 -0.03
## famsize     0.00      -0.01      0.01     0.07      0.06   0.04  0.05
## Pstatus    -0.17       0.04     -0.01    -0.01     -0.01   0.01  0.02
## Medu       -0.01      -0.27      0.10    -0.17     -0.02   0.12  0.11
##         activities nursery higher internet romantic famrel freetime goout  Dalc
## sex           0.12   -0.04  -0.06     0.07    -0.11   0.08     0.15  0.06  0.28
## age          -0.05   -0.02  -0.27     0.01     0.18  -0.02     0.00  0.11  0.13
## address      -0.01    0.02   0.08     0.18    -0.03  -0.03    -0.04  0.02 -0.05
## famsize       0.01   -0.10   0.00    -0.01     0.03   0.00     0.02  0.00 -0.06
## Pstatus       0.10   -0.03   0.02     0.06    -0.05   0.05     0.04  0.03  0.04
## Medu          0.12    0.13   0.21     0.27    -0.03   0.02    -0.02  0.01 -0.01
##          Walc health absences    G1    G2    G3 result
## sex      0.32   0.14     0.02 -0.10 -0.10 -0.13  -0.08
## age      0.09  -0.01     0.15 -0.17 -0.11 -0.11  -0.11
## address -0.01   0.00     0.07  0.16  0.15  0.17   0.13
## famsize -0.08   0.00     0.00 -0.05 -0.04 -0.05  -0.05
## Pstatus  0.07   0.01    -0.12  0.02  0.02  0.00   0.00
## Medu    -0.02   0.00    -0.01  0.26  0.26  0.24   0.14

Change the matrix student_pro from short data to long data.

heat_map_data <- melt(heat_map_data)
## Warning in type.convert.default(X[[i]], ...): 'as.is' should be specified by the
## caller; using TRUE

## Warning in type.convert.default(X[[i]], ...): 'as.is' should be specified by the
## caller; using TRUE
head(heat_map_data)
##        X1  X2 value
## 1     sex sex  1.00
## 2     age sex -0.04
## 3 address sex  0.03
## 4 famsize sex -0.10
## 5 Pstatus sex  0.06
## 6    Medu sex  0.12

Plot the heat map.

heatmap <- ggplot(data = heat_map_data, aes(x = X1, y = X2, fill = value)) +
  geom_tile(color = "white") +
  scale_fill_gradient(low = "red", high = "darkblue") +
  theme(
    axis.text.x = element_text(angle = 90, size = 8),
    axis.title.x = element_text(angle = 0, color = "red"),
    axis.title.y = element_text(angle = 360, color = "blue")
  ) +
  coord_equal()
heatmap

Only show the absolute correlation value of G3 with Descending order.

G3 <- subset(heat_map_data, X1 == "G3")
G3[3] <- abs(G3[3])
G3 <- G3[order(G3[, 3], decreasing = TRUE), ]
G3
##      X1         X2 value
## 1055 G3         G3  1.00
## 1022 G3         G2  0.92
## 989  G3         G1  0.83
## 1088 G3     result  0.66
## 461  G3   failures  0.39
## 659  G3     higher  0.33
## 428  G3  studytime  0.25
## 197  G3       Medu  0.24
## 230  G3       Fedu  0.21
## 857  G3       Dalc  0.20
## 890  G3       Walc  0.18
## 98   G3    address  0.17
## 263  G3       Mjob  0.15
## 692  G3   internet  0.15
## 32   G3        sex  0.13
## 395  G3 traveltime  0.13
## 329  G3     reason  0.12
## 791  G3   freetime  0.12
## 65   G3        age  0.11
## 923  G3     health  0.10
## 725  G3   romantic  0.09
## 824  G3      goout  0.09
## 956  G3   absences  0.09
## 362  G3   guardian  0.08
## 494  G3  schoolsup  0.07
## 527  G3     famsup  0.06
## 593  G3 activities  0.06
## 758  G3     famrel  0.06
## 131  G3    famsize  0.05
## 296  G3       Fjob  0.05
## 560  G3       paid  0.05
## 626  G3    nursery  0.03
## 164  G3    Pstatus  0.00

Train-Test Split

Create train set and test set for regression and classification respectively.

Regression

df_reg <- df

df_reg$result <- NULL

set.seed(42)

train_index_reg <- createDataPartition(
  df_reg$G3,
  p = 0.8,
  list = FALSE,
  times = 1
)

train_set_reg <- df_reg[train_index_reg, ]
test_set_reg <- df_reg[-train_index_reg, ]

Classification

df_clf <- df

df_clf$G3 <- NULL

set.seed(42)

train_index_clf <- createDataPartition(
  df_clf$result,
  p = 0.8,
  list = FALSE,
  times = 1
)

train_set_clf <- df_clf[train_index_clf, ]
test_set_clf <- df_clf[-train_index_clf, ]

Subsampling

Since we are dealing with imbalanced classes, we explore subsampling techniques for classification.

Original Sample

table(train_set_clf$result)
## 
## fail pass 
##   80  440

Down-Sampling

set.seed(42)

train_set_clf_down <- downSample(
  x = train_set_clf[, -ncol(train_set_clf)],
  y = train_set_clf$result,
  yname = "result"
)

table(train_set_clf_down$result)
## 
## fail pass 
##   80   80

Up-Sampling

set.seed(42)

train_set_clf_up <- upSample(
  x = train_set_clf[, -ncol(train_set_clf)],
  y = train_set_clf$result,
  yname = "result"
)

table(train_set_clf_up$result)
## 
## fail pass 
##  440  440

Modeling

Define repeated 10-fold cross validation.

fit_control <- trainControl(
  method = "repeatedcv",
  number = 10,
  repeats = 10
)

Regression

Helper function for regression models

train_evaluate_reg <- function(method = "", data = train_set_reg, tuneGrid = NULL, tuneLength = NULL, name = "") {
  set.seed(42)

  fit <- train(
    G3 ~ .,
    data = data,
    method = method,
    trControl = fit_control,
    preProcess = c("center", "scale"),
    tuneGrid = tuneGrid,
    tuneLength = tuneLength
  )
  print(fit)

  pred <- predict(fit, test_set_reg)
  result <- postResample(pred = pred, obs = test_set_reg$G3)

  metrics <- data.frame(
    Model = name,
    RMSE = result[["RMSE"]],
    Rsquared = result[["Rsquared"]],
    MAE = result[["MAE"]]
  )

  return(
    list(
      model = fit,
      metrics = metrics
    )
  )
}
get_best_result <- function(caret_fit) {
  best <- which(rownames(caret_fit$results) == rownames(caret_fit$bestTune))
  best_result <- caret_fit$results[best, ]
  rownames(best_result) <- NULL
  best_result
}

1. Decision Tree

CART

  • method = ‘rpart’
  • Type: Regression, Classification
  • Tuning parameters:
    • cp (Complexity Parameter)
  • Required packages: rpart
  • A model-specific variable importance metric is available.

Training model with auto tuning:

dt_reg <- train_evaluate_reg("rpart", name = "Decision Tree", tuneLength = 10)
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, :
## There were missing values in resampled performance measures.
## CART 
## 
## 520 samples
##  32 predictor
## 
## Pre-processing: centered (72), scaled (72) 
## Resampling: Cross-Validated (10 fold, repeated 10 times) 
## Summary of sample sizes: 467, 469, 467, 468, 469, 469, ... 
## Resampling results across tuning parameters:
## 
##   cp           RMSE      Rsquared   MAE      
##   0.002557381  1.511393  0.7924958  0.9036204
##   0.002711357  1.510197  0.7929802  0.9011272
##   0.005577788  1.480733  0.8000066  0.8637666
##   0.013277533  1.527893  0.7858778  0.9042992
##   0.013807340  1.534900  0.7842550  0.9149780
##   0.022095544  1.594516  0.7664695  1.0153027
##   0.053870748  1.745563  0.7258779  1.1300192
##   0.091924350  1.911984  0.6642470  1.3105574
##   0.127582534  2.194760  0.5633255  1.5491983
##   0.522696343  2.771181  0.4965485  2.0307630
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was cp = 0.005577788.

Best parameters from tuning:

get_best_result(dt_reg$model)
##            cp     RMSE  Rsquared       MAE    RMSESD RsquaredSD     MAESD
## 1 0.005577788 1.480733 0.8000066 0.8637666 0.3764418 0.08400127 0.1518122

Plot the trained model:

plot(dt_reg$model)

  • Value 0.005577788 is the optimal value for complexity parameter.

Evaluation of model:

dt_reg$metrics
##           Model     RMSE  Rsquared      MAE
## 1 Decision Tree 1.340746 0.8114244 0.861462

2. Random Forest

Random Forest

  • method = ‘ranger’
  • Type: Classification, Regression
  • Tuning parameters:
    • mtry (#Randomly Selected Predictors) (Number of variables randomly sampled as candidates at each split.)
    • splitrule (Splitting Rule)
    • min.node.size (Minimal Node Size)
  • Required packages: e1071, ranger, dplyr
  • A model-specific variable importance metric is available.

Training model with auto tuning:

rf_reg <- train_evaluate_reg("ranger", name = "Random Forest", tuneLength = 10)
## Random Forest 
## 
## 520 samples
##  32 predictor
## 
## Pre-processing: centered (72), scaled (72) 
## Resampling: Cross-Validated (10 fold, repeated 10 times) 
## Summary of sample sizes: 467, 469, 467, 468, 469, 469, ... 
## Resampling results across tuning parameters:
## 
##   mtry  splitrule   RMSE      Rsquared   MAE      
##    2    variance    2.361082  0.5965337  1.6083866
##    2    extratrees  2.510659  0.5538075  1.7610333
##    9    variance    1.671533  0.7692800  1.0813811
##    9    extratrees  1.844648  0.7384919  1.2565375
##   17    variance    1.458251  0.8115059  0.9221814
##   17    extratrees  1.571260  0.7925544  1.0201987
##   25    variance    1.382811  0.8255016  0.8781849
##   25    extratrees  1.452196  0.8132295  0.9186745
##   33    variance    1.349965  0.8314621  0.8630600
##   33    extratrees  1.395685  0.8225973  0.8814960
##   40    variance    1.337174  0.8335206  0.8572666
##   40    extratrees  1.364251  0.8279827  0.8666393
##   48    variance    1.332195  0.8340520  0.8557826
##   48    extratrees  1.347049  0.8312339  0.8624743
##   56    variance    1.334484  0.8331868  0.8596320
##   56    extratrees  1.334864  0.8334659  0.8613067
##   64    variance    1.341194  0.8311826  0.8630732
##   64    extratrees  1.327816  0.8346144  0.8608048
##   72    variance    1.356029  0.8277105  0.8705985
##   72    extratrees  1.328030  0.8342539  0.8635932
## 
## Tuning parameter 'min.node.size' was held constant at a value of 5
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were mtry = 64, splitrule = extratrees
##  and min.node.size = 5.

Best parameters from tuning:

get_best_result(rf_reg$model)
##   mtry min.node.size  splitrule     RMSE  Rsquared       MAE   RMSESD
## 1   64             5 extratrees 1.327816 0.8346144 0.8608048 0.355189
##   RsquaredSD     MAESD
## 1 0.07199443 0.1303264

Plot the trained model:

plot(rf_reg$model)

  • For splitrule (splitting rule) = variance, mtry = 48 is already the optimal value.
  • For splitrule (splitting rule) = extratrees, mtry = 64. Can try to tune with mtry between 64 and 72 (max mtry = 72).
  • Tuning parameter ‘min.node.size’ was held constant at a value of 5, can try to tune with other values.

Training model with manual tuning:

  • splitrule = variance or extratrees
  • mtry between 48 to 72
  • min.node.size between 1 to 10
tuneGrid_rf_reg <- expand.grid(
  splitrule = c("variance", "extratrees"),
  mtry = seq(48, 72, by = 2),
  min.node.size = seq(1, 10)
)

rf_reg_manual_tune <- train_evaluate_reg("ranger", name = "Random Forest", tuneGrid = tuneGrid_rf_reg)
## Random Forest 
## 
## 520 samples
##  32 predictor
## 
## Pre-processing: centered (72), scaled (72) 
## Resampling: Cross-Validated (10 fold, repeated 10 times) 
## Summary of sample sizes: 467, 469, 467, 468, 469, 469, ... 
## Resampling results across tuning parameters:
## 
##   splitrule   mtry  min.node.size  RMSE      Rsquared   MAE      
##   variance    48     1             1.339687  0.8321916  0.8600324
##   variance    48     2             1.336134  0.8332092  0.8592320
##   variance    48     3             1.333500  0.8336435  0.8559939
##   variance    48     4             1.334916  0.8332117  0.8589217
##   variance    48     5             1.332246  0.8340008  0.8572885
##   variance    48     6             1.331247  0.8343653  0.8556232
##   variance    48     7             1.328390  0.8350604  0.8553474
##   variance    48     8             1.328514  0.8350560  0.8533839
##   variance    48     9             1.326800  0.8354832  0.8535098
##   variance    48    10             1.324920  0.8360505  0.8524417
##   variance    50     1             1.335152  0.8331818  0.8593207
##   variance    50     2             1.333297  0.8335790  0.8567243
##   variance    50     3             1.334035  0.8335643  0.8578439
##   variance    50     4             1.332929  0.8338782  0.8566703
##   variance    50     5             1.327261  0.8351136  0.8554863
##   variance    50     6             1.332834  0.8337942  0.8569689
##   variance    50     7             1.328753  0.8348733  0.8543881
##   variance    50     8             1.328750  0.8347862  0.8552245
##   variance    50     9             1.327696  0.8351239  0.8535439
##   variance    50    10             1.326696  0.8352006  0.8531400
##   variance    52     1             1.339058  0.8321510  0.8613817
##   variance    52     2             1.335897  0.8328892  0.8592826
##   variance    52     3             1.336815  0.8327205  0.8586926
##   variance    52     4             1.334774  0.8333031  0.8586056
##   variance    52     5             1.330411  0.8342853  0.8565100
##   variance    52     6             1.332062  0.8338640  0.8567949
##   variance    52     7             1.329409  0.8346121  0.8545169
##   variance    52     8             1.326546  0.8353887  0.8544571
##   variance    52     9             1.326604  0.8352086  0.8547701
##   variance    52    10             1.324858  0.8358415  0.8515817
##   variance    54     1             1.339438  0.8320941  0.8608356
##   variance    54     2             1.336583  0.8326678  0.8588230
##   variance    54     3             1.334571  0.8333159  0.8586398
##   variance    54     4             1.333612  0.8333486  0.8575114
##   variance    54     5             1.331628  0.8340053  0.8574301
##   variance    54     6             1.331798  0.8338328  0.8569963
##   variance    54     7             1.331468  0.8341805  0.8558717
##   variance    54     8             1.328779  0.8346501  0.8552630
##   variance    54     9             1.329067  0.8346248  0.8547711
##   variance    54    10             1.325108  0.8356046  0.8532881
##   variance    56     1             1.336098  0.8328061  0.8591514
##   variance    56     2             1.336694  0.8327624  0.8602211
##   variance    56     3             1.332034  0.8336632  0.8571458
##   variance    56     4             1.335753  0.8327686  0.8595249
##   variance    56     5             1.334269  0.8334682  0.8587135
##   variance    56     6             1.332736  0.8334263  0.8575950
##   variance    56     7             1.329434  0.8344379  0.8563923
##   variance    56     8             1.325990  0.8351338  0.8543020
##   variance    56     9             1.326542  0.8351796  0.8535194
##   variance    56    10             1.331704  0.8342096  0.8568014
##   variance    58     1             1.341212  0.8315597  0.8617310
##   variance    58     2             1.338155  0.8323164  0.8598174
##   variance    58     3             1.337651  0.8322878  0.8601191
##   variance    58     4             1.335649  0.8330410  0.8588664
##   variance    58     5             1.333211  0.8333436  0.8577182
##   variance    58     6             1.334425  0.8331087  0.8588736
##   variance    58     7             1.331933  0.8338809  0.8576046
##   variance    58     8             1.330944  0.8338760  0.8571087
##   variance    58     9             1.326875  0.8349690  0.8542378
##   variance    58    10             1.326915  0.8349575  0.8531167
##   variance    60     1             1.339907  0.8316996  0.8597623
##   variance    60     2             1.339078  0.8319042  0.8614850
##   variance    60     3             1.337377  0.8324505  0.8609218
##   variance    60     4             1.335012  0.8328886  0.8585221
##   variance    60     5             1.335956  0.8326814  0.8602238
##   variance    60     6             1.336813  0.8324576  0.8610144
##   variance    60     7             1.333635  0.8333510  0.8587829
##   variance    60     8             1.333400  0.8333189  0.8577025
##   variance    60     9             1.327190  0.8349132  0.8557101
##   variance    60    10             1.329782  0.8342808  0.8562978
##   variance    62     1             1.342856  0.8309539  0.8635399
##   variance    62     2             1.342705  0.8313093  0.8636480
##   variance    62     3             1.341050  0.8315549  0.8620549
##   variance    62     4             1.341316  0.8315526  0.8625675
##   variance    62     5             1.339484  0.8320184  0.8620530
##   variance    62     6             1.335107  0.8329207  0.8597387
##   variance    62     7             1.335599  0.8326076  0.8603705
##   variance    62     8             1.334449  0.8328604  0.8586490
##   variance    62     9             1.334823  0.8329697  0.8586450
##   variance    62    10             1.334007  0.8333797  0.8577082
##   variance    64     1             1.346323  0.8300364  0.8648164
##   variance    64     2             1.345276  0.8303370  0.8643561
##   variance    64     3             1.343563  0.8307721  0.8629250
##   variance    64     4             1.342799  0.8310154  0.8635823
##   variance    64     5             1.343023  0.8309613  0.8627392
##   variance    64     6             1.339169  0.8317798  0.8604201
##   variance    64     7             1.338549  0.8319755  0.8606543
##   variance    64     8             1.335124  0.8327636  0.8587239
##   variance    64     9             1.335856  0.8327751  0.8585104
##   variance    64    10             1.335678  0.8325813  0.8591046
##   variance    66     1             1.349577  0.8295539  0.8670316
##   variance    66     2             1.345125  0.8302964  0.8635574
##   variance    66     3             1.344393  0.8307291  0.8645434
##   variance    66     4             1.342984  0.8308814  0.8631294
##   variance    66     5             1.342412  0.8309094  0.8636108
##   variance    66     6             1.342985  0.8308151  0.8638655
##   variance    66     7             1.342860  0.8309417  0.8628196
##   variance    66     8             1.339961  0.8316644  0.8617989
##   variance    66     9             1.338182  0.8320436  0.8609783
##   variance    66    10             1.337023  0.8325464  0.8606262
##   variance    68     1             1.348511  0.8292667  0.8666414
##   variance    68     2             1.350095  0.8293947  0.8677363
##   variance    68     3             1.350463  0.8291745  0.8675020
##   variance    68     4             1.348276  0.8296947  0.8671250
##   variance    68     5             1.347295  0.8300122  0.8662821
##   variance    68     6             1.347365  0.8300294  0.8663603
##   variance    68     7             1.344357  0.8304300  0.8644353
##   variance    68     8             1.342599  0.8309355  0.8631230
##   variance    68     9             1.342045  0.8312192  0.8625727
##   variance    68    10             1.340448  0.8314650  0.8621441
##   variance    70     1             1.352269  0.8284792  0.8687205
##   variance    70     2             1.353498  0.8285757  0.8686645
##   variance    70     3             1.351162  0.8290253  0.8694881
##   variance    70     4             1.351317  0.8289070  0.8685161
##   variance    70     5             1.351815  0.8284775  0.8697243
##   variance    70     6             1.350016  0.8293402  0.8673371
##   variance    70     7             1.347022  0.8300723  0.8654462
##   variance    70     8             1.347095  0.8295813  0.8657571
##   variance    70     9             1.345928  0.8301734  0.8654948
##   variance    70    10             1.344120  0.8306820  0.8638135
##   variance    72     1             1.356962  0.8274546  0.8717786
##   variance    72     2             1.354968  0.8278927  0.8699518
##   variance    72     3             1.357489  0.8274426  0.8714654
##   variance    72     4             1.356925  0.8276539  0.8724947
##   variance    72     5             1.355856  0.8277021  0.8712722
##   variance    72     6             1.352897  0.8285674  0.8705194
##   variance    72     7             1.352086  0.8287302  0.8688157
##   variance    72     8             1.351827  0.8288505  0.8677244
##   variance    72     9             1.347522  0.8297915  0.8670259
##   variance    72    10             1.347822  0.8296083  0.8667080
##   extratrees  48     1             1.347521  0.8308507  0.8649293
##   extratrees  48     2             1.348622  0.8306127  0.8655997
##   extratrees  48     3             1.348978  0.8305022  0.8641479
##   extratrees  48     4             1.346157  0.8314651  0.8626972
##   extratrees  48     5             1.346153  0.8314617  0.8619114
##   extratrees  48     6             1.345810  0.8316171  0.8636060
##   extratrees  48     7             1.346626  0.8313734  0.8613714
##   extratrees  48     8             1.343627  0.8324255  0.8609683
##   extratrees  48     9             1.344505  0.8320092  0.8590062
##   extratrees  48    10             1.344030  0.8322274  0.8587626
##   extratrees  50     1             1.345761  0.8310370  0.8648987
##   extratrees  50     2             1.346102  0.8309838  0.8647358
##   extratrees  50     3             1.343078  0.8318616  0.8619370
##   extratrees  50     4             1.344699  0.8317251  0.8635136
##   extratrees  50     5             1.342836  0.8319075  0.8620110
##   extratrees  50     6             1.341633  0.8324208  0.8613458
##   extratrees  50     7             1.338845  0.8330308  0.8592019
##   extratrees  50     8             1.340494  0.8328322  0.8592757
##   extratrees  50     9             1.339838  0.8330136  0.8583695
##   extratrees  50    10             1.339943  0.8331555  0.8587458
##   extratrees  52     1             1.342341  0.8318224  0.8635294
##   extratrees  52     2             1.342861  0.8314324  0.8644988
##   extratrees  52     3             1.341512  0.8319194  0.8627107
##   extratrees  52     4             1.341623  0.8319885  0.8621626
##   extratrees  52     5             1.343451  0.8316812  0.8639717
##   extratrees  52     6             1.339699  0.8326377  0.8620033
##   extratrees  52     7             1.339884  0.8327289  0.8602154
##   extratrees  52     8             1.338498  0.8329775  0.8595290
##   extratrees  52     9             1.337864  0.8332135  0.8579209
##   extratrees  52    10             1.338310  0.8331532  0.8587504
##   extratrees  54     1             1.341816  0.8318146  0.8638307
##   extratrees  54     2             1.340339  0.8320702  0.8652936
##   extratrees  54     3             1.339778  0.8323366  0.8634080
##   extratrees  54     4             1.338756  0.8324720  0.8627294
##   extratrees  54     5             1.337887  0.8329298  0.8629117
##   extratrees  54     6             1.337586  0.8331037  0.8603782
##   extratrees  54     7             1.337037  0.8328204  0.8615658
##   extratrees  54     8             1.336425  0.8333047  0.8592163
##   extratrees  54     9             1.335152  0.8337030  0.8584354
##   extratrees  54    10             1.334316  0.8338116  0.8578567
##   extratrees  56     1             1.337571  0.8323544  0.8653848
##   extratrees  56     2             1.335584  0.8331499  0.8624342
##   extratrees  56     3             1.340494  0.8317463  0.8645034
##   extratrees  56     4             1.336849  0.8328491  0.8621175
##   extratrees  56     5             1.336295  0.8330677  0.8623037
##   extratrees  56     6             1.335107  0.8332495  0.8619530
##   extratrees  56     7             1.334273  0.8335571  0.8603396
##   extratrees  56     8             1.334725  0.8336201  0.8595429
##   extratrees  56     9             1.330419  0.8345859  0.8578281
##   extratrees  56    10             1.329578  0.8347221  0.8573440
##   extratrees  58     1             1.335413  0.8327573  0.8634593
##   extratrees  58     2             1.339026  0.8322598  0.8646372
##   extratrees  58     3             1.335810  0.8330549  0.8623792
##   extratrees  58     4             1.335018  0.8329572  0.8626948
##   extratrees  58     5             1.332236  0.8337868  0.8612051
##   extratrees  58     6             1.334153  0.8334467  0.8612894
##   extratrees  58     7             1.333057  0.8336303  0.8612747
##   extratrees  58     8             1.331538  0.8339040  0.8585468
##   extratrees  58     9             1.328958  0.8345418  0.8580625
##   extratrees  58    10             1.330802  0.8344161  0.8585206
##   extratrees  60     1             1.334224  0.8329314  0.8635392
##   extratrees  60     2             1.334193  0.8331909  0.8635151
##   extratrees  60     3             1.336306  0.8327998  0.8643734
##   extratrees  60     4             1.332906  0.8336170  0.8626422
##   extratrees  60     5             1.331913  0.8337660  0.8606709
##   extratrees  60     6             1.333312  0.8335311  0.8605103
##   extratrees  60     7             1.330675  0.8340662  0.8585362
##   extratrees  60     8             1.328732  0.8346548  0.8589186
##   extratrees  60     9             1.328413  0.8349366  0.8573767
##   extratrees  60    10             1.325265  0.8355190  0.8565007
##   extratrees  62     1             1.334612  0.8328889  0.8635929
##   extratrees  62     2             1.335642  0.8326700  0.8639810
##   extratrees  62     3             1.332889  0.8333563  0.8623418
##   extratrees  62     4             1.335857  0.8327011  0.8640703
##   extratrees  62     5             1.331430  0.8339716  0.8618315
##   extratrees  62     6             1.329899  0.8341603  0.8606858
##   extratrees  62     7             1.328153  0.8347101  0.8584517
##   extratrees  62     8             1.328464  0.8344641  0.8596425
##   extratrees  62     9             1.326372  0.8351879  0.8570727
##   extratrees  62    10             1.324148  0.8357049  0.8556846
##   extratrees  64     1             1.332992  0.8331434  0.8635162
##   extratrees  64     2             1.331516  0.8333913  0.8625641
##   extratrees  64     3             1.334705  0.8329079  0.8643609
##   extratrees  64     4             1.331793  0.8333615  0.8627229
##   extratrees  64     5             1.328893  0.8341418  0.8611065
##   extratrees  64     6             1.326888  0.8347065  0.8609889
##   extratrees  64     7             1.326945  0.8348037  0.8584164
##   extratrees  64     8             1.328275  0.8345450  0.8590783
##   extratrees  64     9             1.323051  0.8357564  0.8562132
##   extratrees  64    10             1.323540  0.8357246  0.8573136
##   extratrees  66     1             1.331054  0.8332347  0.8631637
##   extratrees  66     2             1.330549  0.8337267  0.8643860
##   extratrees  66     3             1.334523  0.8327725  0.8641653
##   extratrees  66     4             1.330251  0.8340934  0.8625296
##   extratrees  66     5             1.327571  0.8345165  0.8616170
##   extratrees  66     6             1.326612  0.8346667  0.8606729
##   extratrees  66     7             1.327485  0.8348269  0.8588624
##   extratrees  66     8             1.322938  0.8357517  0.8565319
##   extratrees  66     9             1.323364  0.8357619  0.8571829
##   extratrees  66    10             1.320224  0.8363500  0.8553562
##   extratrees  68     1             1.333478  0.8329734  0.8658531
##   extratrees  68     2             1.332544  0.8332008  0.8624033
##   extratrees  68     3             1.331038  0.8334506  0.8628630
##   extratrees  68     4             1.329276  0.8339825  0.8626479
##   extratrees  68     5             1.327515  0.8341302  0.8616624
##   extratrees  68     6             1.328497  0.8342479  0.8616911
##   extratrees  68     7             1.324771  0.8351084  0.8583103
##   extratrees  68     8             1.325588  0.8349738  0.8595417
##   extratrees  68     9             1.322489  0.8357658  0.8567311
##   extratrees  68    10             1.320615  0.8362122  0.8551947
##   extratrees  70     1             1.332192  0.8330204  0.8652999
##   extratrees  70     2             1.332710  0.8331645  0.8658322
##   extratrees  70     3             1.332521  0.8329066  0.8641441
##   extratrees  70     4             1.328974  0.8338091  0.8625399
##   extratrees  70     5             1.325444  0.8347988  0.8609953
##   extratrees  70     6             1.326100  0.8346392  0.8601731
##   extratrees  70     7             1.325402  0.8348417  0.8592183
##   extratrees  70     8             1.322734  0.8354970  0.8581778
##   extratrees  70     9             1.320029  0.8363663  0.8556533
##   extratrees  70    10             1.319497  0.8363125  0.8562400
##   extratrees  72     1             1.330268  0.8334010  0.8628994
##   extratrees  72     2             1.329614  0.8335267  0.8637409
##   extratrees  72     3             1.328619  0.8338012  0.8621581
##   extratrees  72     4             1.325283  0.8348138  0.8597814
##   extratrees  72     5             1.327494  0.8342769  0.8617857
##   extratrees  72     6             1.327317  0.8342458  0.8615627
##   extratrees  72     7             1.323704  0.8352427  0.8596404
##   extratrees  72     8             1.323539  0.8352105  0.8584455
##   extratrees  72     9             1.317547  0.8365602  0.8551783
##   extratrees  72    10             1.319702  0.8362721  0.8554894
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were mtry = 72, splitrule = extratrees
##  and min.node.size = 9.

Best parameters from tuning:

get_best_result(rf_reg_manual_tune$model)
##    splitrule mtry min.node.size     RMSE  Rsquared       MAE    RMSESD
## 1 extratrees   72             9 1.317547 0.8365602 0.8551783 0.3567946
##   RsquaredSD     MAESD
## 1 0.07302355 0.1312441

Plot the trained model:

plot(rf_reg_manual_tune$model)

Optimal value for each parameters:

  • mtry = 72
  • splitrule = extratrees
  • min.node.size = 9

Evaluation of model:

rf_reg_manual_tune$metrics
##           Model     RMSE  Rsquared      MAE
## 1 Random Forest 1.106174 0.8666637 0.771789

3. SVM with Polynomial Kernel

Support Vector Machines with Polynomial Kernel

  • method = ‘svmPoly’
  • Type: Regression, Classification
  • Tuning parameters:
    • degree (Polynomial Degree)
    • scale (Scale)
    • C (Cost)
  • Required packages: kernlab

Training model with auto tuning:

  • We have tried with tuneLength of 10, unfortunately it took too much of time to finish, thus we have decided to go with tuneLength of 5 in SVM.
svm_reg <- train_evaluate_reg("svmPoly", name = "Support Vector Machine", tuneLength = 5)
## Support Vector Machines with Polynomial Kernel 
## 
## 520 samples
##  32 predictor
## 
## Pre-processing: centered (72), scaled (72) 
## Resampling: Cross-Validated (10 fold, repeated 10 times) 
## Summary of sample sizes: 467, 469, 467, 468, 469, 469, ... 
## Resampling results across tuning parameters:
## 
##   degree  scale  C     RMSE      Rsquared   MAE      
##   1       1e-03  0.25  2.549829  0.6456328  1.7925306
##   1       1e-03  0.50  2.211695  0.7116538  1.5148416
##   1       1e-03  1.00  1.850422  0.7721718  1.2223146
##   1       1e-03  2.00  1.576803  0.8051574  0.9973955
##   1       1e-03  4.00  1.463469  0.8158565  0.9114372
##   1       1e-02  0.25  1.531635  0.8096725  0.9628491
##   1       1e-02  0.50  1.436807  0.8189442  0.8921404
##   1       1e-02  1.00  1.375945  0.8266689  0.8589232
##   1       1e-02  2.00  1.340942  0.8320214  0.8390972
##   1       1e-02  4.00  1.329444  0.8330228  0.8348422
##   1       1e-01  0.25  1.336747  0.8324161  0.8375515
##   1       1e-01  0.50  1.328011  0.8330293  0.8360650
##   1       1e-01  1.00  1.325745  0.8328028  0.8388928
##   1       1e-01  2.00  1.324562  0.8327915  0.8404564
##   1       1e-01  4.00  1.324348  0.8326577  0.8413139
##   1       1e+00  0.25  1.324293  0.8327982  0.8403833
##   1       1e+00  0.50  1.324081  0.8327034  0.8414354
##   1       1e+00  1.00  1.324012  0.8326233  0.8416821
##   1       1e+00  2.00  1.324193  0.8325544  0.8418343
##   1       1e+00  4.00  1.324190  0.8325819  0.8421801
##   1       1e+01  0.25  1.323996  0.8326103  0.8417843
##   1       1e+01  0.50  1.324120  0.8325511  0.8420901
##   1       1e+01  1.00  1.324485  0.8324661  0.8425282
##   1       1e+01  2.00  1.324095  0.8325926  0.8430595
##   1       1e+01  4.00  1.332690  0.8302568  0.8495567
##   2       1e-03  0.25  2.211160  0.7113696  1.5143892
##   2       1e-03  0.50  1.850651  0.7717793  1.2224658
##   2       1e-03  1.00  1.577994  0.8046823  0.9982192
##   2       1e-03  2.00  1.463282  0.8159675  0.9108371
##   2       1e-03  4.00  1.389879  0.8247852  0.8645875
##   2       1e-02  0.25  1.447221  0.8147122  0.9076882
##   2       1e-02  0.50  1.386995  0.8222164  0.8797818
##   2       1e-02  1.00  1.373413  0.8216364  0.8873369
##   2       1e-02  2.00  1.428855  0.8068908  0.9522472
##   2       1e-02  4.00  1.509366  0.7881941  1.0228407
##   2       1e-01  0.25  1.689545  0.7337991  1.1839978
##   2       1e-01  0.50  1.689545  0.7337991  1.1839978
##   2       1e-01  1.00  1.689545  0.7337991  1.1839978
##   2       1e-01  2.00  1.689545  0.7337991  1.1839978
##   2       1e-01  4.00  1.689545  0.7337991  1.1839978
##   2       1e+00  0.25  2.424154  0.4529740  1.8054147
##   2       1e+00  0.50  2.424154  0.4529740  1.8054147
##   2       1e+00  1.00  2.424154  0.4529740  1.8054147
##   2       1e+00  2.00  2.424154  0.4529740  1.8054147
##   2       1e+00  4.00  2.424154  0.4529740  1.8054147
##   2       1e+01  0.25  2.883325  0.2572559  2.1827675
##   2       1e+01  0.50  2.883325  0.2572559  2.1827675
##   2       1e+01  1.00  2.883325  0.2572559  2.1827675
##   2       1e+01  2.00  2.883325  0.2572559  2.1827675
##   2       1e+01  4.00  2.883325  0.2572559  2.1827675
##   3       1e-03  0.25  1.992879  0.7507860  1.3394679
##   3       1e-03  0.50  1.673788  0.7936734  1.0793318
##   3       1e-03  1.00  1.503178  0.8121150  0.9412749
##   3       1e-03  2.00  1.416378  0.8211781  0.8799348
##   3       1e-03  4.00  1.355783  0.8297120  0.8470772
##   3       1e-02  0.25  1.426639  0.8153584  0.9143771
##   3       1e-02  0.50  1.436434  0.8069174  0.9435222
##   3       1e-02  1.00  1.479660  0.7942536  0.9926754
##   3       1e-02  2.00  1.502743  0.7879333  1.0192699
##   3       1e-02  4.00  1.504598  0.7874536  1.0209965
##   3       1e-01  0.25  1.914433  0.7043706  1.3246757
##   3       1e-01  0.50  1.914433  0.7043706  1.3246757
##   3       1e-01  1.00  1.914433  0.7043706  1.3246757
##   3       1e-01  2.00  1.914433  0.7043706  1.3246757
##   3       1e-01  4.00  1.914433  0.7043706  1.3246757
##   3       1e+00  0.25  2.054610  0.6713533  1.4354513
##   3       1e+00  0.50  2.054610  0.6713533  1.4354513
##   3       1e+00  1.00  2.054610  0.6713533  1.4354513
##   3       1e+00  2.00  2.054610  0.6713533  1.4354513
##   3       1e+00  4.00  2.054610  0.6713533  1.4354513
##   3       1e+01  0.25  2.049759  0.6723365  1.4325597
##   3       1e+01  0.50  2.049759  0.6723365  1.4325597
##   3       1e+01  1.00  2.049759  0.6723365  1.4325597
##   3       1e+01  2.00  2.049759  0.6723365  1.4325597
##   3       1e+01  4.00  2.049759  0.6723365  1.4325597
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were degree = 1, scale = 10 and C = 0.25.

Best parameters from tuning:

get_best_result(svm_reg$model)
##   degree scale    C     RMSE  Rsquared       MAE    RMSESD RsquaredSD     MAESD
## 1      1    10 0.25 1.323996 0.8326103 0.8417843 0.3969325 0.08127431 0.1390743

Plot the trained model:

plot(svm_reg$model)

  • The parameters degree (Polynomial Degree) = 1, scale = 10, and C (Cost) = 0.25 reached their optimal value.

Evaluation of model:

svm_reg$metrics
##                    Model     RMSE  Rsquared       MAE
## 1 Support Vector Machine 1.177426 0.8521626 0.7618882

4. XGBoost

  • method = ‘xgbLinear’
  • Type: Regression, Classification
  • Tuning parameters:
    • nrounds (#Boosting Iterations)
    • lambda (L2 Regularization)
    • alpha (L1 Regularization)
    • eta (Learning Rate)
  • Required packages: xgboost
  • A model-specific variable importance metric is available.

Training model with auto tuning:

xgb_reg <- train_evaluate_reg("xgbLinear", name = "XGBoost", tuneLength = 3)
## eXtreme Gradient Boosting 
## 
## 520 samples
##  32 predictor
## 
## Pre-processing: centered (72), scaled (72) 
## Resampling: Cross-Validated (10 fold, repeated 10 times) 
## Summary of sample sizes: 467, 469, 467, 468, 469, 469, ... 
## Resampling results across tuning parameters:
## 
##   lambda  alpha  nrounds  RMSE      Rsquared   MAE      
##   0e+00   0e+00   50      1.511613  0.7918378  0.9393092
##   0e+00   0e+00  100      1.512005  0.7917418  0.9397081
##   0e+00   0e+00  150      1.512005  0.7917418  0.9397081
##   0e+00   1e-04   50      1.505989  0.7930916  0.9383068
##   0e+00   1e-04  100      1.506339  0.7930240  0.9387463
##   0e+00   1e-04  150      1.506339  0.7930240  0.9387465
##   0e+00   1e-01   50      1.510758  0.7910783  0.9447155
##   0e+00   1e-01  100      1.510970  0.7910251  0.9449051
##   0e+00   1e-01  150      1.510970  0.7910251  0.9449051
##   1e-04   0e+00   50      1.506888  0.7926763  0.9373235
##   1e-04   0e+00  100      1.507267  0.7925891  0.9377648
##   1e-04   0e+00  150      1.507267  0.7925891  0.9377648
##   1e-04   1e-04   50      1.508816  0.7921554  0.9391927
##   1e-04   1e-04  100      1.509250  0.7920720  0.9397714
##   1e-04   1e-04  150      1.509250  0.7920720  0.9397714
##   1e-04   1e-01   50      1.507130  0.7922016  0.9429704
##   1e-04   1e-01  100      1.507336  0.7921494  0.9432016
##   1e-04   1e-01  150      1.507336  0.7921494  0.9432016
##   1e-01   0e+00   50      1.499107  0.7932489  0.9410474
##   1e-01   0e+00  100      1.499690  0.7931013  0.9416606
##   1e-01   0e+00  150      1.499690  0.7931014  0.9416605
##   1e-01   1e-04   50      1.499257  0.7930752  0.9417873
##   1e-01   1e-04  100      1.500019  0.7928812  0.9425275
##   1e-01   1e-04  150      1.500019  0.7928811  0.9425286
##   1e-01   1e-01   50      1.491578  0.7960916  0.9394836
##   1e-01   1e-01  100      1.492072  0.7959774  0.9399642
##   1e-01   1e-01  150      1.492072  0.7959774  0.9399642
## 
## Tuning parameter 'eta' was held constant at a value of 0.3
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were nrounds = 50, lambda = 0.1, alpha
##  = 0.1 and eta = 0.3.

Best parameters from tuning:

get_best_result(xgb_reg$model)
##   lambda alpha nrounds eta     RMSE  Rsquared       MAE    RMSESD RsquaredSD
## 1    0.1   0.1      50 0.3 1.491578 0.7960916 0.9394836 0.3821073 0.09009631
##       MAESD
## 1 0.1550427

Plot the trained model:

plot(xgb_reg$model)

Training model with manual tuning:

tuneGrid_xgb_reg <- expand.grid(
  eta = 0.3,
  lambda = 0.1,
  alpha = c(0.001, 0.01, 0.1),
  nrounds = seq(10, 70, 10)
)

xgb_reg_manual_tune <- train_evaluate_reg("xgbLinear", name = "XGBoost", tuneGrid = tuneGrid_xgb_reg)
## eXtreme Gradient Boosting 
## 
## 520 samples
##  32 predictor
## 
## Pre-processing: centered (72), scaled (72) 
## Resampling: Cross-Validated (10 fold, repeated 10 times) 
## Summary of sample sizes: 467, 469, 467, 468, 469, 469, ... 
## Resampling results across tuning parameters:
## 
##   alpha  nrounds  RMSE      Rsquared   MAE      
##   0.001  10       1.518110  0.7981275  0.9601702
##   0.001  20       1.490676  0.7950565  0.9326202
##   0.001  30       1.496577  0.7936059  0.9404913
##   0.001  40       1.498049  0.7932848  0.9418572
##   0.001  50       1.499322  0.7929895  0.9431174
##   0.001  60       1.499623  0.7929325  0.9435549
##   0.001  70       1.499770  0.7929032  0.9436339
##   0.010  10       1.518822  0.7984699  0.9556418
##   0.010  20       1.487909  0.7964229  0.9250986
##   0.010  30       1.494958  0.7947124  0.9316516
##   0.010  40       1.497355  0.7941440  0.9342274
##   0.010  50       1.498089  0.7939856  0.9350578
##   0.010  60       1.498446  0.7938916  0.9354407
##   0.010  70       1.498518  0.7938762  0.9355422
##   0.100  10       1.515027  0.8002019  0.9603495
##   0.100  20       1.484927  0.7975879  0.9298917
##   0.100  30       1.489417  0.7965265  0.9365004
##   0.100  40       1.490933  0.7962359  0.9383227
##   0.100  50       1.491578  0.7960916  0.9394836
##   0.100  60       1.492002  0.7959988  0.9399262
##   0.100  70       1.492072  0.7959768  0.9399641
## 
## Tuning parameter 'lambda' was held constant at a value of 0.1
## Tuning
##  parameter 'eta' was held constant at a value of 0.3
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were nrounds = 20, lambda = 0.1, alpha
##  = 0.1 and eta = 0.3.

Best parameters from tuning:

get_best_result(xgb_reg_manual_tune$model)
##   eta lambda alpha nrounds     RMSE  Rsquared       MAE    RMSESD RsquaredSD
## 1 0.3    0.1   0.1      20 1.484927 0.7975879 0.9298917 0.3860466 0.09051595
##       MAESD
## 1 0.1561429

Plot the trained model:

plot(xgb_reg_manual_tune$model)

Evaluation of model:

xgb_reg_manual_tune$metrics
##     Model     RMSE  Rsquared       MAE
## 1 XGBoost 1.250256 0.8338254 0.8697652

5. Linear Regression with Regularization

  • method = ‘glmnet’
  • Type: Regression, Classification
  • Tuning parameters:
    • alpha (Mixing Percentage)
    • lambda (Regularization Parameter)
  • Required packages: glmnet, Matrix
  • A model-specific variable importance metric is available.

Training model with auto tuning:

lr_reg <- train_evaluate_reg("glmnet", name = "Linear Regression", tuneLength = 10)
## glmnet 
## 
## 520 samples
##  32 predictor
## 
## Pre-processing: centered (72), scaled (72) 
## Resampling: Cross-Validated (10 fold, repeated 10 times) 
## Summary of sample sizes: 467, 469, 467, 468, 469, 469, ... 
## Resampling results across tuning parameters:
## 
##   alpha  lambda       RMSE      Rsquared   MAE      
##   0.1    0.001388258  1.395236  0.8165417  0.9295931
##   0.1    0.003207056  1.395236  0.8165417  0.9295931
##   0.1    0.007408715  1.395236  0.8165417  0.9295931
##   0.1    0.017115092  1.391001  0.8174735  0.9254500
##   0.1    0.039538082  1.378930  0.8201344  0.9135626
##   0.1    0.091338097  1.359258  0.8246227  0.8937805
##   0.1    0.211002851  1.336974  0.8305641  0.8640433
##   0.1    0.487443953  1.334994  0.8354640  0.8454409
##   0.1    1.126058753  1.416341  0.8334863  0.9057614
##   0.1    2.601341770  1.661772  0.8275410  1.0899052
##   0.2    0.001388258  1.394987  0.8166643  0.9296277
##   0.2    0.003207056  1.394987  0.8166643  0.9296277
##   0.2    0.007408715  1.393455  0.8170050  0.9281990
##   0.2    0.017115092  1.383181  0.8193159  0.9185476
##   0.2    0.039538082  1.364119  0.8236290  0.9006899
##   0.2    0.091338097  1.333282  0.8307732  0.8684140
##   0.2    0.211002851  1.303239  0.8388678  0.8293382
##   0.2    0.487443953  1.317026  0.8419026  0.8219884
##   0.2    1.126058753  1.430309  0.8390100  0.8952581
##   0.2    2.601341770  1.753044  0.8379850  1.1504334
##   0.3    0.001388258  1.394869  0.8167031  0.9296248
##   0.3    0.003207056  1.394869  0.8167031  0.9296248
##   0.3    0.007408715  1.389865  0.8178422  0.9250100
##   0.3    0.017115092  1.376191  0.8209509  0.9124625
##   0.3    0.039538082  1.350549  0.8267956  0.8883604
##   0.3    0.091338097  1.313632  0.8353737  0.8489229
##   0.3    0.211002851  1.285210  0.8434454  0.8122952
##   0.3    0.487443953  1.315619  0.8441240  0.8123446
##   0.3    1.126058753  1.447786  0.8432459  0.9018030
##   0.3    2.601341770  1.857464  0.8404136  1.2383812
##   0.4    0.001388258  1.394780  0.8167352  0.9296183
##   0.4    0.003207056  1.394619  0.8167700  0.9294568
##   0.4    0.007408715  1.386468  0.8186380  0.9220031
##   0.4    0.017115092  1.369658  0.8224754  0.9067854
##   0.4    0.039538082  1.338745  0.8295142  0.8769667
##   0.4    0.091338097  1.298348  0.8389306  0.8344893
##   0.4    0.211002851  1.276474  0.8458366  0.8036659
##   0.4    0.487443953  1.310285  0.8472172  0.8041133
##   0.4    1.126058753  1.473854  0.8455473  0.9206746
##   0.4    2.601341770  1.969191  0.8433381  1.3368708
##   0.5    0.001388258  1.394752  0.8167549  0.9295630
##   0.5    0.003207056  1.393504  0.8170373  0.9284576
##   0.5    0.007408715  1.383264  0.8193875  0.9191885
##   0.5    0.017115092  1.363393  0.8239354  0.9013340
##   0.5    0.039538082  1.328501  0.8318604  0.8668626
##   0.5    0.091338097  1.286959  0.8415481  0.8244149
##   0.5    0.211002851  1.272713  0.8470640  0.7997977
##   0.5    0.487443953  1.309547  0.8491135  0.8013188
##   0.5    1.126058753  1.501877  0.8478301  0.9448245
##   0.5    2.601341770  2.095044  0.8468928  1.4479330
##   0.6    0.001388258  1.394769  0.8167525  0.9296063
##   0.6    0.003207056  1.391976  0.8173928  0.9271044
##   0.6    0.007408715  1.380195  0.8201043  0.9165016
##   0.6    0.017115092  1.357236  0.8253618  0.8958307
##   0.6    0.039538082  1.319490  0.8339241  0.8581293
##   0.6    0.091338097  1.278487  0.8434941  0.8170438
##   0.6    0.211002851  1.269765  0.8480685  0.7968888
##   0.6    0.487443953  1.310810  0.8505788  0.8024391
##   0.6    1.126058753  1.534193  0.8497732  0.9727712
##   0.6    2.601341770  2.235487  0.8506684  1.5736649
##   0.7    0.001388258  1.394891  0.8167302  0.9297422
##   0.7    0.003207056  1.390445  0.8177502  0.9257484
##   0.7    0.007408715  1.377189  0.8208053  0.9139016
##   0.7    0.017115092  1.351399  0.8267086  0.8903188
##   0.7    0.039538082  1.311385  0.8357719  0.8505177
##   0.7    0.091338097  1.272503  0.8449121  0.8120996
##   0.7    0.211002851  1.265901  0.8492333  0.7939290
##   0.7    0.487443953  1.312701  0.8518868  0.8058587
##   0.7    1.126058753  1.569746  0.8514890  1.0052265
##   0.7    2.601341770  2.391487  0.8517634  1.7114318
##   0.8    0.001388258  1.394901  0.8167291  0.9297347
##   0.8    0.003207056  1.388944  0.8181009  0.9244193
##   0.8    0.007408715  1.374259  0.8214892  0.9113534
##   0.8    0.017115092  1.345954  0.8279555  0.8850432
##   0.8    0.039538082  1.304099  0.8374251  0.8439686
##   0.8    0.091338097  1.268101  0.8459609  0.8084763
##   0.8    0.211002851  1.262232  0.8503046  0.7919023
##   0.8    0.487443953  1.317599  0.8523296  0.8114253
##   0.8    1.126058753  1.608952  0.8522654  1.0414983
##   0.8    2.601341770  2.544513  0.8492426  1.8444158
##   0.9    0.001388258  1.394882  0.8167345  0.9297440
##   0.9    0.003207056  1.387486  0.8184429  0.9231338
##   0.9    0.007408715  1.371428  0.8221452  0.9088785
##   0.9    0.017115092  1.340883  0.8291087  0.8800074
##   0.9    0.039538082  1.297787  0.8388634  0.8384294
##   0.9    0.091338097  1.265039  0.8467000  0.8061509
##   0.9    0.211002851  1.260593  0.8508655  0.7918476
##   0.9    0.487443953  1.324513  0.8520985  0.8183692
##   0.9    1.126058753  1.652316  0.8503275  1.0865619
##   0.9    2.601341770  2.704508  0.8492426  1.9770810
##   1.0    0.001388258  1.394612  0.8167989  0.9295231
##   1.0    0.003207056  1.386058  0.8187784  0.9218740
##   1.0    0.007408715  1.368660  0.8227886  0.9064881
##   1.0    0.017115092  1.336081  0.8301996  0.8752796
##   1.0    0.039538082  1.292377  0.8400863  0.8339090
##   1.0    0.091338097  1.263038  0.8471983  0.8052305
##   1.0    0.211002851  1.260452  0.8510329  0.7928667
##   1.0    0.487443953  1.333989  0.8508628  0.8276583
##   1.0    1.126058753  1.686923  0.8492426  1.1216198
##   1.0    2.601341770  2.893369  0.8492426  2.1324632
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were alpha = 1 and lambda = 0.2110029.

Best parameters from tuning:

get_best_result(lr_reg$model)
##   alpha    lambda     RMSE  Rsquared       MAE    RMSESD RsquaredSD     MAESD
## 1     1 0.2110029 1.260452 0.8510329 0.7928667 0.4019447 0.07640349 0.1336843

Plot the trained model:

plot(lr_reg$model)

  • The parameter alpha (Mixing Percentage) = 1 is the optimal value while there’s possibility to get better result with lambda (Regularization Parameter) < 0.2110029.

Training model with manual tuning:

  • alpha = 1
  • lambda between 0.001 to 0.300
tuneGrid_lr_reg <- expand.grid(
  alpha = c(1),
  lambda = seq(0.00, 0.30, by = 0.05)
)

lr_reg_manual_tune <- train_evaluate_reg("glmnet", name = "Linear Regression", tuneGrid = tuneGrid_lr_reg)
## glmnet 
## 
## 520 samples
##  32 predictor
## 
## Pre-processing: centered (72), scaled (72) 
## Resampling: Cross-Validated (10 fold, repeated 10 times) 
## Summary of sample sizes: 467, 469, 467, 468, 469, 469, ... 
## Resampling results across tuning parameters:
## 
##   lambda  RMSE      Rsquared   MAE      
##   0.00    1.394938  0.8167240  0.9298049
##   0.05    1.281021  0.8426330  0.8235131
##   0.10    1.261976  0.8476099  0.8039718
##   0.15    1.258282  0.8496377  0.7969222
##   0.20    1.259368  0.8508915  0.7932318
##   0.25    1.265253  0.8514903  0.7922721
##   0.30    1.274531  0.8517463  0.7938045
## 
## Tuning parameter 'alpha' was held constant at a value of 1
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were alpha = 1 and lambda = 0.15.

Best parameters from tuning:

get_best_result(lr_reg_manual_tune$model)
##   alpha lambda     RMSE  Rsquared       MAE   RMSESD RsquaredSD     MAESD
## 1     1   0.15 1.258282 0.8496377 0.7969222 0.398786 0.07721303 0.1329245

Plot the trained model:

plot(lr_reg_manual_tune$model)

  • The parameter alpha (Mixing Percentage) = 1 and lambda (Regularization Parameter) = 0.15 reached their optimal value.

Evaluation of model:

lr_reg_manual_tune$metrics
##               Model     RMSE  Rsquared       MAE
## 1 Linear Regression 1.117378 0.8643585 0.7226734

Classification

Define a helper function to train model.

train_model_clf <- function(method = "",
                            data = train_set_clf,
                            tuneGrid = NULL,
                            tuneLength = 3) {
  set.seed(42)

  fit <- train(
    result ~ .,
    data = data,
    method = method,
    trControl = fit_control,
    preProcess = c("center", "scale"),
    tuneGrid = tuneGrid,
    tuneLength = tuneLength
  )

  print(fit)

  return(fit)
}

Define a helper function to evaluate model.

evaluate_model_clf <- function(model = NULL, name = "") {
  pred <- predict(model, test_set_clf)

  cm_fail <- confusionMatrix(pred, test_set_clf$result, mode = "prec_recall")
  print(cm_fail)

  cm_pass <- confusionMatrix(
    pred,
    test_set_clf$result,
    mode = "prec_recall",
    positive = "pass"
  )

  return(data.frame(
    Model = name,
    Accuracy = round(cm_fail$overall[["Accuracy"]], 3),
    Precision.Fail = round(cm_fail$byClass[["Precision"]], 3),
    Recall.Fail = round(cm_fail$byClass[["Recall"]], 3),
    F1.Fail = round(cm_fail$byClass[["F1"]], 3),
    Precision.Pass = round(cm_pass$byClass[["Precision"]], 3),
    Recall.Pass = round(cm_pass$byClass[["Recall"]], 3),
    F1.Pass = round(cm_pass$byClass[["F1"]], 3)
  ))
}

1. Decision Tree

  • method = ‘rpart2’
  • Type: Regression, Classification
  • Tuning parameters:
    • maxdepth (Max Tree Depth)
  • Required packages: rpart
  • A model-specific variable importance metric is available.
Original Train Set
dt_clf <- train_model_clf("rpart2")
## note: only 2 possible values of the max tree depth from the initial fit.
##  Truncating the grid to 2 .
## 
## CART 
## 
## 520 samples
##  32 predictor
##   2 classes: 'fail', 'pass' 
## 
## Pre-processing: centered (72), scaled (72) 
## Resampling: Cross-Validated (10 fold, repeated 10 times) 
## Summary of sample sizes: 468, 468, 468, 468, 468, 468, ... 
## Resampling results across tuning parameters:
## 
##   maxdepth  Accuracy   Kappa    
##   1         0.9365385  0.7199437
##   4         0.9286538  0.7122307
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was maxdepth = 1.
plot(dt_clf)

No further tuning required.

dt_metrics_clf <- evaluate_model_clf(dt_clf, "Decision Tree")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction fail pass
##       fail   13    1
##       pass    7  108
##                                           
##                Accuracy : 0.938           
##                  95% CI : (0.8815, 0.9728)
##     No Information Rate : 0.845           
##     P-Value [Acc > NIR] : 0.001075        
##                                           
##                   Kappa : 0.7303          
##                                           
##  Mcnemar's Test P-Value : 0.077100        
##                                           
##               Precision : 0.9286          
##                  Recall : 0.6500          
##                      F1 : 0.7647          
##              Prevalence : 0.1550          
##          Detection Rate : 0.1008          
##    Detection Prevalence : 0.1085          
##       Balanced Accuracy : 0.8204          
##                                           
##        'Positive' Class : fail            
## 
Down-Sampled Train Set
down_dt_clf <- train_model_clf("rpart2", train_set_clf_down)
## note: only 2 possible values of the max tree depth from the initial fit.
##  Truncating the grid to 2 .
## 
## CART 
## 
## 160 samples
##  32 predictor
##   2 classes: 'fail', 'pass' 
## 
## Pre-processing: centered (72), scaled (72) 
## Resampling: Cross-Validated (10 fold, repeated 10 times) 
## Summary of sample sizes: 144, 144, 144, 144, 144, 144, ... 
## Resampling results across tuning parameters:
## 
##   maxdepth  Accuracy  Kappa  
##   1         0.838125  0.67625
##   2         0.823750  0.64750
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was maxdepth = 1.
plot(down_dt_clf)

No further tuning required.

down_dt_metrics_clf <- evaluate_model_clf(down_dt_clf, "Decision Tree")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction fail pass
##       fail   17   15
##       pass    3   94
##                                           
##                Accuracy : 0.8605          
##                  95% CI : (0.7885, 0.9152)
##     No Information Rate : 0.845           
##     P-Value [Acc > NIR] : 0.366920        
##                                           
##                   Kappa : 0.5722          
##                                           
##  Mcnemar's Test P-Value : 0.009522        
##                                           
##               Precision : 0.5312          
##                  Recall : 0.8500          
##                      F1 : 0.6538          
##              Prevalence : 0.1550          
##          Detection Rate : 0.1318          
##    Detection Prevalence : 0.2481          
##       Balanced Accuracy : 0.8562          
##                                           
##        'Positive' Class : fail            
## 
Up-Sampled Train Set
up_dt_clf <- train_model_clf("rpart2", train_set_clf_up)
## CART 
## 
## 880 samples
##  32 predictor
##   2 classes: 'fail', 'pass' 
## 
## Pre-processing: centered (72), scaled (72) 
## Resampling: Cross-Validated (10 fold, repeated 10 times) 
## Summary of sample sizes: 792, 792, 792, 792, 792, 792, ... 
## Resampling results across tuning parameters:
## 
##   maxdepth  Accuracy   Kappa    
##   1         0.8892045  0.7784091
##   3         0.9173864  0.8347727
##   4         0.9296591  0.8593182
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was maxdepth = 4.
plot(up_dt_clf)

Perform hyperparameter tuning using tuneLength.

up_dt_clf <- train_model_clf("rpart2", train_set_clf_up, tuneLength = 10)
## note: only 7 possible values of the max tree depth from the initial fit.
##  Truncating the grid to 7 .
## 
## CART 
## 
## 880 samples
##  32 predictor
##   2 classes: 'fail', 'pass' 
## 
## Pre-processing: centered (72), scaled (72) 
## Resampling: Cross-Validated (10 fold, repeated 10 times) 
## Summary of sample sizes: 792, 792, 792, 792, 792, 792, ... 
## Resampling results across tuning parameters:
## 
##   maxdepth  Accuracy   Kappa    
##    1        0.8892045  0.7784091
##    3        0.9173864  0.8347727
##    4        0.9238636  0.8477273
##    7        0.9294318  0.8588636
##    8        0.9294318  0.8588636
##   11        0.9294318  0.8588636
##   12        0.9294318  0.8588636
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was maxdepth = 7.
plot(up_dt_clf)

Tuning done.

up_dt_metrics_clf <- evaluate_model_clf(up_dt_clf, "Decision Tree")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction fail pass
##       fail   16   11
##       pass    4   98
##                                           
##                Accuracy : 0.8837          
##                  95% CI : (0.8155, 0.9334)
##     No Information Rate : 0.845           
##     P-Value [Acc > NIR] : 0.1352          
##                                           
##                   Kappa : 0.6117          
##                                           
##  Mcnemar's Test P-Value : 0.1213          
##                                           
##               Precision : 0.5926          
##                  Recall : 0.8000          
##                      F1 : 0.6809          
##              Prevalence : 0.1550          
##          Detection Rate : 0.1240          
##    Detection Prevalence : 0.2093          
##       Balanced Accuracy : 0.8495          
##                                           
##        'Positive' Class : fail            
## 

2. Random Forest

  • method = ‘ranger’
  • Type: Regression, Classification
  • Tuning parameters:
    • mtry (#Randomly Selected Predictors)
    • splitrule (Splitting Rule)
    • min.node.size (Minimal Node Size)
  • Required packages: e1071, ranger, dplyr
  • A model-specific variable importance metric is available.
Original Train Set
rf_clf <- train_model_clf("ranger")
## Random Forest 
## 
## 520 samples
##  32 predictor
##   2 classes: 'fail', 'pass' 
## 
## Pre-processing: centered (72), scaled (72) 
## Resampling: Cross-Validated (10 fold, repeated 10 times) 
## Summary of sample sizes: 468, 468, 468, 468, 468, 468, ... 
## Resampling results across tuning parameters:
## 
##   mtry  splitrule   Accuracy   Kappa    
##    2    gini        0.8744231  0.2633357
##    2    extratrees  0.8594231  0.1295123
##   37    gini        0.9307692  0.7051503
##   37    extratrees  0.9275000  0.6794896
##   72    gini        0.9313462  0.7069938
##   72    extratrees  0.9350000  0.7230234
## 
## Tuning parameter 'min.node.size' was held constant at a value of 1
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were mtry = 72, splitrule = extratrees
##  and min.node.size = 1.
plot(rf_clf)

Perform hyperparameter tuning using tuneLength.

rf_clf <- train_model_clf("ranger", tuneLength = 7)
## Random Forest 
## 
## 520 samples
##  32 predictor
##   2 classes: 'fail', 'pass' 
## 
## Pre-processing: centered (72), scaled (72) 
## Resampling: Cross-Validated (10 fold, repeated 10 times) 
## Summary of sample sizes: 468, 468, 468, 468, 468, 468, ... 
## Resampling results across tuning parameters:
## 
##   mtry  splitrule   Accuracy   Kappa    
##    2    gini        0.8753846  0.2723578
##    2    extratrees  0.8582692  0.1202263
##   13    gini        0.9336538  0.7118338
##   13    extratrees  0.9017308  0.5067021
##   25    gini        0.9311538  0.7082097
##   25    extratrees  0.9217308  0.6373321
##   37    gini        0.9313462  0.7085045
##   37    extratrees  0.9257692  0.6709289
##   48    gini        0.9296154  0.7010125
##   48    extratrees  0.9300000  0.6973189
##   60    gini        0.9315385  0.7095738
##   60    extratrees  0.9351923  0.7227439
##   72    gini        0.9311538  0.7077921
##   72    extratrees  0.9344231  0.7236122
## 
## Tuning parameter 'min.node.size' was held constant at a value of 1
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were mtry = 60, splitrule = extratrees
##  and min.node.size = 1.
plot(rf_clf)

Tuning done.

rf_metrics_clf <- evaluate_model_clf(rf_clf, "Random Forest")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction fail pass
##       fail   14    4
##       pass    6  105
##                                           
##                Accuracy : 0.9225          
##                  95% CI : (0.8621, 0.9622)
##     No Information Rate : 0.845           
##     P-Value [Acc > NIR] : 0.006716        
##                                           
##                   Kappa : 0.6915          
##                                           
##  Mcnemar's Test P-Value : 0.751830        
##                                           
##               Precision : 0.7778          
##                  Recall : 0.7000          
##                      F1 : 0.7368          
##              Prevalence : 0.1550          
##          Detection Rate : 0.1085          
##    Detection Prevalence : 0.1395          
##       Balanced Accuracy : 0.8317          
##                                           
##        'Positive' Class : fail            
## 
Down-Sampled Train Set
down_rf_clf <- train_model_clf("ranger", train_set_clf_down)
## Random Forest 
## 
## 160 samples
##  32 predictor
##   2 classes: 'fail', 'pass' 
## 
## Pre-processing: centered (72), scaled (72) 
## Resampling: Cross-Validated (10 fold, repeated 10 times) 
## Summary of sample sizes: 144, 144, 144, 144, 144, 144, ... 
## Resampling results across tuning parameters:
## 
##   mtry  splitrule   Accuracy  Kappa  
##    2    gini        0.795625  0.59125
##    2    extratrees  0.757500  0.51500
##   37    gini        0.879375  0.75875
##   37    extratrees  0.846875  0.69375
##   72    gini        0.880000  0.76000
##   72    extratrees  0.868125  0.73625
## 
## Tuning parameter 'min.node.size' was held constant at a value of 1
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were mtry = 72, splitrule = gini
##  and min.node.size = 1.
plot(down_rf_clf)

Perform hyperparameter tuning using tuneLength.

down_rf_clf <- train_model_clf("ranger", train_set_clf_down, tuneLength = 7)
## Random Forest 
## 
## 160 samples
##  32 predictor
##   2 classes: 'fail', 'pass' 
## 
## Pre-processing: centered (72), scaled (72) 
## Resampling: Cross-Validated (10 fold, repeated 10 times) 
## Summary of sample sizes: 144, 144, 144, 144, 144, 144, ... 
## Resampling results across tuning parameters:
## 
##   mtry  splitrule   Accuracy  Kappa  
##    2    gini        0.801875  0.60375
##    2    extratrees  0.746875  0.49375
##   13    gini        0.876250  0.75250
##   13    extratrees  0.814375  0.62875
##   25    gini        0.891250  0.78250
##   25    extratrees  0.843750  0.68750
##   37    gini        0.886250  0.77250
##   37    extratrees  0.853750  0.70750
##   48    gini        0.880000  0.76000
##   48    extratrees  0.854375  0.70875
##   60    gini        0.881875  0.76375
##   60    extratrees  0.869375  0.73875
##   72    gini        0.879375  0.75875
##   72    extratrees  0.873125  0.74625
## 
## Tuning parameter 'min.node.size' was held constant at a value of 1
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were mtry = 25, splitrule = gini
##  and min.node.size = 1.
plot(down_rf_clf)

Tuning done.

down_rf_metrics_clf <- evaluate_model_clf(down_rf_clf, "Random Forest")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction fail pass
##       fail   17   13
##       pass    3   96
##                                           
##                Accuracy : 0.876           
##                  95% CI : (0.8064, 0.9274)
##     No Information Rate : 0.845           
##     P-Value [Acc > NIR] : 0.19935         
##                                           
##                   Kappa : 0.6069          
##                                           
##  Mcnemar's Test P-Value : 0.02445         
##                                           
##               Precision : 0.5667          
##                  Recall : 0.8500          
##                      F1 : 0.6800          
##              Prevalence : 0.1550          
##          Detection Rate : 0.1318          
##    Detection Prevalence : 0.2326          
##       Balanced Accuracy : 0.8654          
##                                           
##        'Positive' Class : fail            
## 
Up-Sampled Train Set
up_rf_clf <- train_model_clf("ranger", train_set_clf_up)
## Random Forest 
## 
## 880 samples
##  32 predictor
##   2 classes: 'fail', 'pass' 
## 
## Pre-processing: centered (72), scaled (72) 
## Resampling: Cross-Validated (10 fold, repeated 10 times) 
## Summary of sample sizes: 792, 792, 792, 792, 792, 792, ... 
## Resampling results across tuning parameters:
## 
##   mtry  splitrule   Accuracy   Kappa    
##    2    gini        0.9803409  0.9606818
##    2    extratrees  0.9795455  0.9590909
##   37    gini        0.9789773  0.9579545
##   37    extratrees  0.9820455  0.9640909
##   72    gini        0.9786364  0.9572727
##   72    extratrees  0.9786364  0.9572727
## 
## Tuning parameter 'min.node.size' was held constant at a value of 1
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were mtry = 37, splitrule = extratrees
##  and min.node.size = 1.
plot(up_rf_clf)

Perform hyperparameter tuning using tuneLength.

up_rf_clf <- train_model_clf("ranger", train_set_clf_up, tuneLength = 7)
## Random Forest 
## 
## 880 samples
##  32 predictor
##   2 classes: 'fail', 'pass' 
## 
## Pre-processing: centered (72), scaled (72) 
## Resampling: Cross-Validated (10 fold, repeated 10 times) 
## Summary of sample sizes: 792, 792, 792, 792, 792, 792, ... 
## Resampling results across tuning parameters:
## 
##   mtry  splitrule   Accuracy   Kappa    
##    2    gini        0.9812500  0.9625000
##    2    extratrees  0.9801136  0.9602273
##   13    gini        0.9807955  0.9615909
##   13    extratrees  0.9819318  0.9638636
##   25    gini        0.9794318  0.9588636
##   25    extratrees  0.9802273  0.9604545
##   37    gini        0.9785227  0.9570455
##   37    extratrees  0.9815909  0.9631818
##   48    gini        0.9786364  0.9572727
##   48    extratrees  0.9817045  0.9634091
##   60    gini        0.9787500  0.9575000
##   60    extratrees  0.9803409  0.9606818
##   72    gini        0.9781818  0.9563636
##   72    extratrees  0.9780682  0.9561364
## 
## Tuning parameter 'min.node.size' was held constant at a value of 1
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were mtry = 13, splitrule = extratrees
##  and min.node.size = 1.
plot(up_rf_clf)

Tuning done.

up_rf_metrics_clf <- evaluate_model_clf(up_rf_clf, "Random Forest")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction fail pass
##       fail   12    3
##       pass    8  106
##                                           
##                Accuracy : 0.9147          
##                  95% CI : (0.8525, 0.9567)
##     No Information Rate : 0.845           
##     P-Value [Acc > NIR] : 0.01442         
##                                           
##                   Kappa : 0.6375          
##                                           
##  Mcnemar's Test P-Value : 0.22780         
##                                           
##               Precision : 0.80000         
##                  Recall : 0.60000         
##                      F1 : 0.68571         
##              Prevalence : 0.15504         
##          Detection Rate : 0.09302         
##    Detection Prevalence : 0.11628         
##       Balanced Accuracy : 0.78624         
##                                           
##        'Positive' Class : fail            
## 

3. SVM with Polynomial Kernel

  • method = ‘svmPoly’
  • Type: Regression, Classification
  • Tuning parameters:
    • degree (Polynomial Degree)
    • scale (Scale)
    • C (Cost)
  • Required packages: kernlab
Original Train Set
svm_poly_clf <- train_model_clf("svmPoly")
## Support Vector Machines with Polynomial Kernel 
## 
## 520 samples
##  32 predictor
##   2 classes: 'fail', 'pass' 
## 
## Pre-processing: centered (72), scaled (72) 
## Resampling: Cross-Validated (10 fold, repeated 10 times) 
## Summary of sample sizes: 468, 468, 468, 468, 468, 468, ... 
## Resampling results across tuning parameters:
## 
##   degree  scale  C     Accuracy   Kappa      
##   1       0.001  0.25  0.8461538  0.000000000
##   1       0.001  0.50  0.8461538  0.000000000
##   1       0.001  1.00  0.8461538  0.000000000
##   1       0.010  0.25  0.8546154  0.083940229
##   1       0.010  0.50  0.8732692  0.260599844
##   1       0.010  1.00  0.8817308  0.394972860
##   1       0.100  0.25  0.9042308  0.574306440
##   1       0.100  0.50  0.9138462  0.636752983
##   1       0.100  1.00  0.9167308  0.660537348
##   2       0.001  0.25  0.8461538  0.000000000
##   2       0.001  0.50  0.8461538  0.000000000
##   2       0.001  1.00  0.8492308  0.030863195
##   2       0.010  0.25  0.8748077  0.276356280
##   2       0.010  0.50  0.8878846  0.454151837
##   2       0.010  1.00  0.9117308  0.611795906
##   2       0.100  0.25  0.8886538  0.501281678
##   2       0.100  0.50  0.8886538  0.501281678
##   2       0.100  1.00  0.8886538  0.501281678
##   3       0.001  0.25  0.8461538  0.000000000
##   3       0.001  0.50  0.8465385  0.003893805
##   3       0.001  1.00  0.8598077  0.132831582
##   3       0.010  0.25  0.8892308  0.447824384
##   3       0.010  0.50  0.9067308  0.579955776
##   3       0.010  1.00  0.9036538  0.576579426
##   3       0.100  0.25  0.8775000  0.358828319
##   3       0.100  0.50  0.8775000  0.358828319
##   3       0.100  1.00  0.8775000  0.358828319
## 
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were degree = 1, scale = 0.1 and C = 1.
plot(svm_poly_clf)

Perform hyperparameter tuning using tuneGrid.

svm_poly_tuneGrid_clf <- expand.grid(
  degree = 1,
  scale = c(0.01, 0.1),
  C = seq(0.1, 1.5, 0.05)
)

svm_poly_clf <- train_model_clf("svmPoly", tuneGrid = svm_poly_tuneGrid_clf)
## Support Vector Machines with Polynomial Kernel 
## 
## 520 samples
##  32 predictor
##   2 classes: 'fail', 'pass' 
## 
## Pre-processing: centered (72), scaled (72) 
## Resampling: Cross-Validated (10 fold, repeated 10 times) 
## Summary of sample sizes: 468, 468, 468, 468, 468, 468, ... 
## Resampling results across tuning parameters:
## 
##   scale  C     Accuracy   Kappa     
##   0.01   0.10  0.8461538  0.00000000
##   0.01   0.15  0.8461538  0.00000000
##   0.01   0.20  0.8492308  0.03086319
##   0.01   0.25  0.8546154  0.08394023
##   0.01   0.30  0.8576923  0.11365443
##   0.01   0.35  0.8638462  0.17052312
##   0.01   0.40  0.8682692  0.21034175
##   0.01   0.45  0.8713462  0.24060396
##   0.01   0.50  0.8732692  0.26059984
##   0.01   0.55  0.8755769  0.28188147
##   0.01   0.60  0.8753846  0.28132437
##   0.01   0.65  0.8773077  0.29947793
##   0.01   0.70  0.8782692  0.31299249
##   0.01   0.75  0.8803846  0.33723638
##   0.01   0.80  0.8819231  0.35690555
##   0.01   0.85  0.8821154  0.36999562
##   0.01   0.90  0.8826923  0.38047267
##   0.01   0.95  0.8825000  0.39115441
##   0.01   1.00  0.8817308  0.39497286
##   0.01   1.05  0.8821154  0.40146122
##   0.01   1.10  0.8838462  0.41961460
##   0.01   1.15  0.8842308  0.42965830
##   0.01   1.20  0.8853846  0.43706948
##   0.01   1.25  0.8878846  0.45460578
##   0.01   1.30  0.8900000  0.47114159
##   0.01   1.35  0.8917308  0.48367835
##   0.01   1.40  0.8919231  0.48581998
##   0.01   1.45  0.8938462  0.50044125
##   0.01   1.50  0.8953846  0.51256552
##   0.10   0.10  0.8817308  0.39497286
##   0.10   0.15  0.8953846  0.51256552
##   0.10   0.20  0.9011538  0.55073283
##   0.10   0.25  0.9042308  0.57430644
##   0.10   0.30  0.9073077  0.59887202
##   0.10   0.35  0.9082692  0.60766579
##   0.10   0.40  0.9094231  0.61349688
##   0.10   0.45  0.9109615  0.62042831
##   0.10   0.50  0.9138462  0.63675298
##   0.10   0.55  0.9163462  0.64742641
##   0.10   0.60  0.9182692  0.65957162
##   0.10   0.65  0.9200000  0.66690415
##   0.10   0.70  0.9201923  0.66990832
##   0.10   0.75  0.9200000  0.66753386
##   0.10   0.80  0.9196154  0.66678110
##   0.10   0.85  0.9188462  0.66656228
##   0.10   0.90  0.9186538  0.66640525
##   0.10   0.95  0.9173077  0.66144950
##   0.10   1.00  0.9167308  0.66053735
##   0.10   1.05  0.9163462  0.65870244
##   0.10   1.10  0.9155769  0.65680728
##   0.10   1.15  0.9161538  0.65811114
##   0.10   1.20  0.9163462  0.65900978
##   0.10   1.25  0.9167308  0.66071945
##   0.10   1.30  0.9161538  0.65865021
##   0.10   1.35  0.9155769  0.65711508
##   0.10   1.40  0.9148077  0.65527095
##   0.10   1.45  0.9140385  0.65206420
##   0.10   1.50  0.9140385  0.65320292
## 
## Tuning parameter 'degree' was held constant at a value of 1
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were degree = 1, scale = 0.1 and C = 0.7.
plot(svm_poly_clf)

Tuning done.

svm_poly_metrics_clf <- evaluate_model_clf(svm_poly_clf, "Support Vector Machine")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction fail pass
##       fail   14    5
##       pass    6  104
##                                           
##                Accuracy : 0.9147          
##                  95% CI : (0.8525, 0.9567)
##     No Information Rate : 0.845           
##     P-Value [Acc > NIR] : 0.01442         
##                                           
##                   Kappa : 0.6678          
##                                           
##  Mcnemar's Test P-Value : 1.00000         
##                                           
##               Precision : 0.7368          
##                  Recall : 0.7000          
##                      F1 : 0.7179          
##              Prevalence : 0.1550          
##          Detection Rate : 0.1085          
##    Detection Prevalence : 0.1473          
##       Balanced Accuracy : 0.8271          
##                                           
##        'Positive' Class : fail            
## 
Down-Sampled Train Set
down_svm_poly_clf <- train_model_clf("svmPoly", train_set_clf_down)
## Support Vector Machines with Polynomial Kernel 
## 
## 160 samples
##  32 predictor
##   2 classes: 'fail', 'pass' 
## 
## Pre-processing: centered (72), scaled (72) 
## Resampling: Cross-Validated (10 fold, repeated 10 times) 
## Summary of sample sizes: 144, 144, 144, 144, 144, 144, ... 
## Resampling results across tuning parameters:
## 
##   degree  scale  C     Accuracy  Kappa  
##   1       0.001  0.25  0.768750  0.53750
##   1       0.001  0.50  0.768750  0.53750
##   1       0.001  1.00  0.770000  0.54000
##   1       0.010  0.25  0.758750  0.51750
##   1       0.010  0.50  0.774375  0.54875
##   1       0.010  1.00  0.798750  0.59750
##   1       0.100  0.25  0.770000  0.54000
##   1       0.100  0.50  0.765625  0.53125
##   1       0.100  1.00  0.766250  0.53250
##   2       0.001  0.25  0.769375  0.53875
##   2       0.001  0.50  0.768750  0.53750
##   2       0.001  1.00  0.765625  0.53125
##   2       0.010  0.25  0.779375  0.55875
##   2       0.010  0.50  0.796875  0.59375
##   2       0.010  1.00  0.780000  0.56000
##   2       0.100  0.25  0.763125  0.52625
##   2       0.100  0.50  0.763125  0.52625
##   2       0.100  1.00  0.763125  0.52625
##   3       0.001  0.25  0.768750  0.53750
##   3       0.001  0.50  0.762500  0.52500
##   3       0.001  1.00  0.761250  0.52250
##   3       0.010  0.25  0.794375  0.58875
##   3       0.010  0.50  0.804375  0.60875
##   3       0.010  1.00  0.781250  0.56250
##   3       0.100  0.25  0.748750  0.49750
##   3       0.100  0.50  0.748750  0.49750
##   3       0.100  1.00  0.748750  0.49750
## 
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were degree = 3, scale = 0.01 and C = 0.5.
plot(down_svm_poly_clf)

Perform hyperparameter tuning using tuneGrid.

down_svm_poly_tuneGrid_clf <- expand.grid(
  degree = 3,
  scale = 0.01,
  C = seq(0.25, 1, 0.05)
)

down_svm_poly_clf <- train_model_clf(
  "svmPoly",
  train_set_clf_down,
  tuneGrid = down_svm_poly_tuneGrid_clf
)
## Support Vector Machines with Polynomial Kernel 
## 
## 160 samples
##  32 predictor
##   2 classes: 'fail', 'pass' 
## 
## Pre-processing: centered (72), scaled (72) 
## Resampling: Cross-Validated (10 fold, repeated 10 times) 
## Summary of sample sizes: 144, 144, 144, 144, 144, 144, ... 
## Resampling results across tuning parameters:
## 
##   C     Accuracy  Kappa  
##   0.25  0.794375  0.58875
##   0.30  0.800000  0.60000
##   0.35  0.801875  0.60375
##   0.40  0.800625  0.60125
##   0.45  0.802500  0.60500
##   0.50  0.804375  0.60875
##   0.55  0.801875  0.60375
##   0.60  0.793125  0.58625
##   0.65  0.795000  0.59000
##   0.70  0.790625  0.58125
##   0.75  0.788750  0.57750
##   0.80  0.785000  0.57000
##   0.85  0.784375  0.56875
##   0.90  0.784375  0.56875
##   0.95  0.781250  0.56250
##   1.00  0.781250  0.56250
## 
## Tuning parameter 'degree' was held constant at a value of 3
## Tuning
##  parameter 'scale' was held constant at a value of 0.01
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were degree = 3, scale = 0.01 and C = 0.5.
plot(down_svm_poly_clf)

Tuning done.

down_svm_poly_metrics_clf <- evaluate_model_clf(down_svm_poly_clf, "Support Vector Machine")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction fail pass
##       fail   18   16
##       pass    2   93
##                                           
##                Accuracy : 0.8605          
##                  95% CI : (0.7885, 0.9152)
##     No Information Rate : 0.845           
##     P-Value [Acc > NIR] : 0.366920        
##                                           
##                   Kappa : 0.5858          
##                                           
##  Mcnemar's Test P-Value : 0.002183        
##                                           
##               Precision : 0.5294          
##                  Recall : 0.9000          
##                      F1 : 0.6667          
##              Prevalence : 0.1550          
##          Detection Rate : 0.1395          
##    Detection Prevalence : 0.2636          
##       Balanced Accuracy : 0.8766          
##                                           
##        'Positive' Class : fail            
## 
Up-Sampled Train Set
up_svm_poly_clf <- train_model_clf("svmPoly", train_set_clf_up)
## Support Vector Machines with Polynomial Kernel 
## 
## 880 samples
##  32 predictor
##   2 classes: 'fail', 'pass' 
## 
## Pre-processing: centered (72), scaled (72) 
## Resampling: Cross-Validated (10 fold, repeated 10 times) 
## Summary of sample sizes: 792, 792, 792, 792, 792, 792, ... 
## Resampling results across tuning parameters:
## 
##   degree  scale  C     Accuracy   Kappa    
##   1       0.001  0.25  0.8207955  0.6415909
##   1       0.001  0.50  0.8480682  0.6961364
##   1       0.001  1.00  0.8709091  0.7418182
##   1       0.010  0.25  0.8906818  0.7813636
##   1       0.010  0.50  0.8967045  0.7934091
##   1       0.010  1.00  0.9063636  0.8127273
##   1       0.100  0.25  0.9256818  0.8513636
##   1       0.100  0.50  0.9351136  0.8702273
##   1       0.100  1.00  0.9400000  0.8800000
##   2       0.001  0.25  0.8502273  0.7004545
##   2       0.001  0.50  0.8738636  0.7477273
##   2       0.001  1.00  0.8902273  0.7804545
##   2       0.010  0.25  0.9232955  0.8465909
##   2       0.010  0.50  0.9387500  0.8775000
##   2       0.010  1.00  0.9576136  0.9152273
##   2       0.100  0.25  0.9738636  0.9477273
##   2       0.100  0.50  0.9738636  0.9477273
##   2       0.100  1.00  0.9738636  0.9477273
##   3       0.001  0.25  0.8648864  0.7297727
##   3       0.001  0.50  0.8860227  0.7720455
##   3       0.001  1.00  0.8954545  0.7909091
##   3       0.010  0.25  0.9564773  0.9129545
##   3       0.010  0.50  0.9711364  0.9422727
##   3       0.010  1.00  0.9750000  0.9500000
##   3       0.100  0.25  0.9913636  0.9827273
##   3       0.100  0.50  0.9913636  0.9827273
##   3       0.100  1.00  0.9913636  0.9827273
## 
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were degree = 3, scale = 0.1 and C = 0.25.
plot(up_svm_poly_clf)

Perform hyperparameter tuning using tuneGrid.

up_svm_poly_tuneGrid_clf <- expand.grid(
  degree = 3,
  scale = 0.1,
  C = seq(0.05, 0.25, 0.05)
)

up_svm_poly_clf <- train_model_clf(
  "svmPoly",
  train_set_clf_up,
  tuneGrid = up_svm_poly_tuneGrid_clf
)
## Support Vector Machines with Polynomial Kernel 
## 
## 880 samples
##  32 predictor
##   2 classes: 'fail', 'pass' 
## 
## Pre-processing: centered (72), scaled (72) 
## Resampling: Cross-Validated (10 fold, repeated 10 times) 
## Summary of sample sizes: 792, 792, 792, 792, 792, 792, ... 
## Resampling results across tuning parameters:
## 
##   C     Accuracy   Kappa    
##   0.05  0.9913636  0.9827273
##   0.10  0.9913636  0.9827273
##   0.15  0.9913636  0.9827273
##   0.20  0.9913636  0.9827273
##   0.25  0.9913636  0.9827273
## 
## Tuning parameter 'degree' was held constant at a value of 3
## Tuning
##  parameter 'scale' was held constant at a value of 0.1
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were degree = 3, scale = 0.1 and C = 0.05.
plot(up_svm_poly_clf)

Tuning done.

up_svm_poly_metrics_clf <- evaluate_model_clf(up_svm_poly_clf, "Support Vector Machine")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction fail pass
##       fail    7    1
##       pass   13  108
##                                           
##                Accuracy : 0.8915          
##                  95% CI : (0.8246, 0.9394)
##     No Information Rate : 0.845           
##     P-Value [Acc > NIR] : 0.086135        
##                                           
##                   Kappa : 0.4514          
##                                           
##  Mcnemar's Test P-Value : 0.003283        
##                                           
##               Precision : 0.87500         
##                  Recall : 0.35000         
##                      F1 : 0.50000         
##              Prevalence : 0.15504         
##          Detection Rate : 0.05426         
##    Detection Prevalence : 0.06202         
##       Balanced Accuracy : 0.67041         
##                                           
##        'Positive' Class : fail            
## 

4. K-Nearest Neighbors

  • method = ‘knn’
  • Type: Classification, Regression
  • Tuning parameters:
    • k (#Neighbors)
Original Train Set
knn_clf <- train_model_clf("knn")
## k-Nearest Neighbors 
## 
## 520 samples
##  32 predictor
##   2 classes: 'fail', 'pass' 
## 
## Pre-processing: centered (72), scaled (72) 
## Resampling: Cross-Validated (10 fold, repeated 10 times) 
## Summary of sample sizes: 468, 468, 468, 468, 468, 468, ... 
## Resampling results across tuning parameters:
## 
##   k  Accuracy   Kappa    
##   5  0.8580769  0.2849825
##   7  0.8525000  0.2177127
##   9  0.8575000  0.2242462
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 5.
plot(knn_clf)

Perform hyperparameter tuning using tuneGrid.

knn_clf <- train_model_clf("knn", tuneGrid = expand.grid(k = c(3, 5, 7, 11)))
## k-Nearest Neighbors 
## 
## 520 samples
##  32 predictor
##   2 classes: 'fail', 'pass' 
## 
## Pre-processing: centered (72), scaled (72) 
## Resampling: Cross-Validated (10 fold, repeated 10 times) 
## Summary of sample sizes: 468, 468, 468, 468, 468, 468, ... 
## Resampling results across tuning parameters:
## 
##   k   Accuracy   Kappa    
##    3  0.8526923  0.3232641
##    5  0.8580769  0.2849825
##    7  0.8525000  0.2177127
##   11  0.8551923  0.1845602
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 5.
plot(knn_clf)

No further tuning necessary.

knn_metrics_clf <- evaluate_model_clf(knn_clf, "K-Nearest Neighbors")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction fail pass
##       fail    8    5
##       pass   12  104
##                                           
##                Accuracy : 0.8682          
##                  95% CI : (0.7974, 0.9213)
##     No Information Rate : 0.845           
##     P-Value [Acc > NIR] : 0.2776          
##                                           
##                   Kappa : 0.4132          
##                                           
##  Mcnemar's Test P-Value : 0.1456          
##                                           
##               Precision : 0.61538         
##                  Recall : 0.40000         
##                      F1 : 0.48485         
##              Prevalence : 0.15504         
##          Detection Rate : 0.06202         
##    Detection Prevalence : 0.10078         
##       Balanced Accuracy : 0.67706         
##                                           
##        'Positive' Class : fail            
## 
Down-Sampled Train Set
down_knn_clf <- train_model_clf("knn", train_set_clf_down)
## k-Nearest Neighbors 
## 
## 160 samples
##  32 predictor
##   2 classes: 'fail', 'pass' 
## 
## Pre-processing: centered (72), scaled (72) 
## Resampling: Cross-Validated (10 fold, repeated 10 times) 
## Summary of sample sizes: 144, 144, 144, 144, 144, 144, ... 
## Resampling results across tuning parameters:
## 
##   k  Accuracy  Kappa  
##   5  0.714375  0.42875
##   7  0.715625  0.43125
##   9  0.727500  0.45500
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 9.
plot(down_knn_clf)

Perform hyperparameter tuning using tuneGrid.

down_knn_clf <- train_model_clf(
  "knn",
  train_set_clf_down,
  tuneGrid = expand.grid(k = c(9, 11, 13, 15))
)
## k-Nearest Neighbors 
## 
## 160 samples
##  32 predictor
##   2 classes: 'fail', 'pass' 
## 
## Pre-processing: centered (72), scaled (72) 
## Resampling: Cross-Validated (10 fold, repeated 10 times) 
## Summary of sample sizes: 144, 144, 144, 144, 144, 144, ... 
## Resampling results across tuning parameters:
## 
##   k   Accuracy  Kappa  
##    9  0.726875  0.45375
##   11  0.729375  0.45875
##   13  0.736875  0.47375
##   15  0.727500  0.45500
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 13.
plot(down_knn_clf)

No further tuning required.

down_knn_metrics_clf <- evaluate_model_clf(down_knn_clf, "K-Nearest Neighbors")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction fail pass
##       fail   16   15
##       pass    4   94
##                                           
##                Accuracy : 0.8527          
##                  95% CI : (0.7796, 0.9089)
##     No Information Rate : 0.845           
##     P-Value [Acc > NIR] : 0.46267         
##                                           
##                   Kappa : 0.5409          
##                                           
##  Mcnemar's Test P-Value : 0.02178         
##                                           
##               Precision : 0.5161          
##                  Recall : 0.8000          
##                      F1 : 0.6275          
##              Prevalence : 0.1550          
##          Detection Rate : 0.1240          
##    Detection Prevalence : 0.2403          
##       Balanced Accuracy : 0.8312          
##                                           
##        'Positive' Class : fail            
## 
Up-Sampled Train Set
up_knn_clf <- train_model_clf("knn", train_set_clf_up)
## k-Nearest Neighbors 
## 
## 880 samples
##  32 predictor
##   2 classes: 'fail', 'pass' 
## 
## Pre-processing: centered (72), scaled (72) 
## Resampling: Cross-Validated (10 fold, repeated 10 times) 
## Summary of sample sizes: 792, 792, 792, 792, 792, 792, ... 
## Resampling results across tuning parameters:
## 
##   k  Accuracy   Kappa    
##   5  0.8503409  0.7006818
##   7  0.8335227  0.6670455
##   9  0.8397727  0.6795455
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 5.
plot(up_knn_clf)

Perform hyperparameter tuning using tuneGrid.

up_knn_clf <- train_model_clf(
  "knn",
  train_set_clf_up,
  tuneGrid = expand.grid(k = c(1, 3, 5))
)
## k-Nearest Neighbors 
## 
## 880 samples
##  32 predictor
##   2 classes: 'fail', 'pass' 
## 
## Pre-processing: centered (72), scaled (72) 
## Resampling: Cross-Validated (10 fold, repeated 10 times) 
## Summary of sample sizes: 792, 792, 792, 792, 792, 792, ... 
## Resampling results across tuning parameters:
## 
##   k  Accuracy   Kappa    
##   1  0.9648864  0.9297727
##   3  0.8918182  0.7836364
##   5  0.8507955  0.7015909
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 1.
plot(up_knn_clf)

Tuning done.

up_knn_metrics_clf <- evaluate_model_clf(up_knn_clf, "K-Nearest Neighbors")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction fail pass
##       fail   10    4
##       pass   10  105
##                                           
##                Accuracy : 0.8915          
##                  95% CI : (0.8246, 0.9394)
##     No Information Rate : 0.845           
##     P-Value [Acc > NIR] : 0.08614         
##                                           
##                   Kappa : 0.528           
##                                           
##  Mcnemar's Test P-Value : 0.18145         
##                                           
##               Precision : 0.71429         
##                  Recall : 0.50000         
##                      F1 : 0.58824         
##              Prevalence : 0.15504         
##          Detection Rate : 0.07752         
##    Detection Prevalence : 0.10853         
##       Balanced Accuracy : 0.73165         
##                                           
##        'Positive' Class : fail            
## 

5. Logistic Regression

  • method = ‘glmnet’
  • Type: Regression, Classification
  • Tuning parameters:
    • alpha (Mixing Percentage)
    • lambda (Regularization Parameter)
  • Required packages: glmnet, Matrix
  • A model-specific variable importance metric is available.
Original Train Set
lr_clf <- train_model_clf("glmnet")
## glmnet 
## 
## 520 samples
##  32 predictor
##   2 classes: 'fail', 'pass' 
## 
## Pre-processing: centered (72), scaled (72) 
## Resampling: Cross-Validated (10 fold, repeated 10 times) 
## Summary of sample sizes: 468, 468, 468, 468, 468, 468, ... 
## Resampling results across tuning parameters:
## 
##   alpha  lambda        Accuracy   Kappa    
##   0.10   0.0004322165  0.9017308  0.6256426
##   0.10   0.0043221649  0.9159615  0.6555571
##   0.10   0.0432216490  0.9109615  0.5802518
##   0.55   0.0004322165  0.9030769  0.6312819
##   0.55   0.0043221649  0.9261538  0.6950640
##   0.55   0.0432216490  0.9128846  0.5716085
##   1.00   0.0004322165  0.9015385  0.6298148
##   1.00   0.0043221649  0.9284615  0.7077239
##   1.00   0.0432216490  0.9203846  0.6030193
## 
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were alpha = 1 and lambda = 0.004322165.
plot(lr_clf)

Check for tuning using tuneLength.

lr_clf <- train_model_clf("glmnet", tuneLength = 10)
## glmnet 
## 
## 520 samples
##  32 predictor
##   2 classes: 'fail', 'pass' 
## 
## Pre-processing: centered (72), scaled (72) 
## Resampling: Cross-Validated (10 fold, repeated 10 times) 
## Summary of sample sizes: 468, 468, 468, 468, 468, 468, ... 
## Resampling results across tuning parameters:
## 
##   alpha  lambda        Accuracy   Kappa     
##   0.1    9.984762e-05  0.8967308  0.60779310
##   0.1    2.306609e-04  0.8969231  0.60876498
##   0.1    5.328567e-04  0.9040385  0.63297908
##   0.1    1.230968e-03  0.9090385  0.64326112
##   0.1    2.843696e-03  0.9151923  0.65804902
##   0.1    6.569306e-03  0.9184615  0.65913957
##   0.1    1.517595e-02  0.9198077  0.65281748
##   0.1    3.505841e-02  0.9130769  0.60057173
##   0.1    8.098948e-02  0.8980769  0.48352832
##   0.1    1.870962e-01  0.8776923  0.31219660
##   0.2    9.984762e-05  0.8913462  0.58816286
##   0.2    2.306609e-04  0.8967308  0.60780293
##   0.2    5.328567e-04  0.9044231  0.63496540
##   0.2    1.230968e-03  0.9092308  0.64428003
##   0.2    2.843696e-03  0.9159615  0.66058001
##   0.2    6.569306e-03  0.9211538  0.66934419
##   0.2    1.517595e-02  0.9203846  0.65342998
##   0.2    3.505841e-02  0.9157692  0.61160217
##   0.2    8.098948e-02  0.8973077  0.47412016
##   0.2    1.870962e-01  0.8661538  0.20053350
##   0.3    9.984762e-05  0.8901923  0.58434279
##   0.3    2.306609e-04  0.8971154  0.60961696
##   0.3    5.328567e-04  0.9048077  0.63646919
##   0.3    1.230968e-03  0.9115385  0.65316063
##   0.3    2.843696e-03  0.9192308  0.67163219
##   0.3    6.569306e-03  0.9228846  0.67443623
##   0.3    1.517595e-02  0.9228846  0.66222271
##   0.3    3.505841e-02  0.9163462  0.61071768
##   0.3    8.098948e-02  0.8998077  0.48691947
##   0.3    1.870962e-01  0.8623077  0.15480630
##   0.4    9.984762e-05  0.8892308  0.58177551
##   0.4    2.306609e-04  0.8965385  0.60750958
##   0.4    5.328567e-04  0.9048077  0.63681858
##   0.4    1.230968e-03  0.9134615  0.66207686
##   0.4    2.843696e-03  0.9203846  0.67517958
##   0.4    6.569306e-03  0.9269231  0.69280361
##   0.4    1.517595e-02  0.9240385  0.66710357
##   0.4    3.505841e-02  0.9163462  0.60437666
##   0.4    8.098948e-02  0.8976923  0.46924372
##   0.4    1.870962e-01  0.8576923  0.11227740
##   0.5    9.984762e-05  0.8892308  0.58247849
##   0.5    2.306609e-04  0.8963462  0.60773161
##   0.5    5.328567e-04  0.9055769  0.64021281
##   0.5    1.230968e-03  0.9150000  0.66832647
##   0.5    2.843696e-03  0.9225000  0.68493490
##   0.5    6.569306e-03  0.9273077  0.69380308
##   0.5    1.517595e-02  0.9236538  0.66472678
##   0.5    3.505841e-02  0.9159615  0.59925649
##   0.5    8.098948e-02  0.8946154  0.43913684
##   0.5    1.870962e-01  0.8575000  0.11061774
##   0.6    9.984762e-05  0.8888462  0.58048769
##   0.6    2.306609e-04  0.8951923  0.60338744
##   0.6    5.328567e-04  0.9063462  0.64280205
##   0.6    1.230968e-03  0.9159615  0.67148649
##   0.6    2.843696e-03  0.9238462  0.69090400
##   0.6    6.569306e-03  0.9276923  0.69504169
##   0.6    1.517595e-02  0.9251923  0.67162014
##   0.6    3.505841e-02  0.9171154  0.60269059
##   0.6    8.098948e-02  0.8905769  0.40008936
##   0.6    1.870962e-01  0.8555769  0.09172321
##   0.7    9.984762e-05  0.8880769  0.57807889
##   0.7    2.306609e-04  0.8953846  0.60424625
##   0.7    5.328567e-04  0.9063462  0.64383917
##   0.7    1.230968e-03  0.9171154  0.67646171
##   0.7    2.843696e-03  0.9234615  0.69064605
##   0.7    6.569306e-03  0.9301923  0.70510589
##   0.7    1.517595e-02  0.9261538  0.67808667
##   0.7    3.505841e-02  0.9188462  0.61092352
##   0.7    8.098948e-02  0.8836538  0.34632410
##   0.7    1.870962e-01  0.8515385  0.05342349
##   0.8    9.984762e-05  0.8873077  0.57666165
##   0.8    2.306609e-04  0.8942308  0.60060697
##   0.8    5.328567e-04  0.9053846  0.64064662
##   0.8    1.230968e-03  0.9184615  0.68234425
##   0.8    2.843696e-03  0.9244231  0.69508433
##   0.8    6.569306e-03  0.9298077  0.70629359
##   0.8    1.517595e-02  0.9284615  0.69149841
##   0.8    3.505841e-02  0.9205769  0.62215185
##   0.8    8.098948e-02  0.8794231  0.31326341
##   0.8    1.870962e-01  0.8461538  0.00000000
##   0.9    9.984762e-05  0.8880769  0.58075317
##   0.9    2.306609e-04  0.8938462  0.60166628
##   0.9    5.328567e-04  0.9063462  0.64605661
##   0.9    1.230968e-03  0.9188462  0.68554659
##   0.9    2.843696e-03  0.9242308  0.69488854
##   0.9    6.569306e-03  0.9311538  0.71251098
##   0.9    1.517595e-02  0.9290385  0.69747879
##   0.9    3.505841e-02  0.9251923  0.64333826
##   0.9    8.098948e-02  0.8773077  0.29712216
##   0.9    1.870962e-01  0.8461538  0.00000000
##   1.0    9.984762e-05  0.8896154  0.58740072
##   1.0    2.306609e-04  0.8930769  0.59910168
##   1.0    5.328567e-04  0.9051923  0.64221568
##   1.0    1.230968e-03  0.9180769  0.68455273
##   1.0    2.843696e-03  0.9257692  0.70368923
##   1.0    6.569306e-03  0.9305769  0.71375472
##   1.0    1.517595e-02  0.9296154  0.70099041
##   1.0    3.505841e-02  0.9275000  0.65591012
##   1.0    8.098948e-02  0.8740385  0.26875886
##   1.0    1.870962e-01  0.8461538  0.00000000
## 
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were alpha = 0.9 and lambda = 0.006569306.
plot(lr_clf)

No further tuning necessary.

lr_metrics_clf <- evaluate_model_clf(lr_clf, "Logistic Regression")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction fail pass
##       fail   14    5
##       pass    6  104
##                                           
##                Accuracy : 0.9147          
##                  95% CI : (0.8525, 0.9567)
##     No Information Rate : 0.845           
##     P-Value [Acc > NIR] : 0.01442         
##                                           
##                   Kappa : 0.6678          
##                                           
##  Mcnemar's Test P-Value : 1.00000         
##                                           
##               Precision : 0.7368          
##                  Recall : 0.7000          
##                      F1 : 0.7179          
##              Prevalence : 0.1550          
##          Detection Rate : 0.1085          
##    Detection Prevalence : 0.1473          
##       Balanced Accuracy : 0.8271          
##                                           
##        'Positive' Class : fail            
## 
Down-Sampled Train Set
down_lr_clf <- train_model_clf("glmnet", train_set_clf_down)
## glmnet 
## 
## 160 samples
##  32 predictor
##   2 classes: 'fail', 'pass' 
## 
## Pre-processing: centered (72), scaled (72) 
## Resampling: Cross-Validated (10 fold, repeated 10 times) 
## Summary of sample sizes: 144, 144, 144, 144, 144, 144, ... 
## Resampling results across tuning parameters:
## 
##   alpha  lambda        Accuracy  Kappa  
##   0.10   0.0007071498  0.775000  0.55000
##   0.10   0.0070714983  0.792500  0.58500
##   0.10   0.0707149828  0.821875  0.64375
##   0.55   0.0007071498  0.791875  0.58375
##   0.55   0.0070714983  0.833125  0.66625
##   0.55   0.0707149828  0.858125  0.71625
##   1.00   0.0007071498  0.809375  0.61875
##   1.00   0.0070714983  0.839375  0.67875
##   1.00   0.0707149828  0.875000  0.75000
## 
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were alpha = 1 and lambda = 0.07071498.
plot(down_lr_clf)

Perform hyperparameter tuning using tuneLength.

down_lr_clf <- train_model_clf("glmnet", train_set_clf_down, tuneLength = 10)
## glmnet 
## 
## 160 samples
##  32 predictor
##   2 classes: 'fail', 'pass' 
## 
## Pre-processing: centered (72), scaled (72) 
## Resampling: Cross-Validated (10 fold, repeated 10 times) 
## Summary of sample sizes: 144, 144, 144, 144, 144, 144, ... 
## Resampling results across tuning parameters:
## 
##   alpha  lambda        Accuracy  Kappa  
##   0.1    0.0001633608  0.772500  0.54500
##   0.1    0.0003773846  0.773125  0.54625
##   0.1    0.0008718074  0.775625  0.55125
##   0.1    0.0020139881  0.775000  0.55000
##   0.1    0.0046525737  0.781875  0.56375
##   0.1    0.0107480486  0.798125  0.59625
##   0.1    0.0248293863  0.808750  0.61750
##   0.1    0.0573591027  0.823750  0.64750
##   0.1    0.1325069668  0.821875  0.64375
##   0.1    0.3061082795  0.829375  0.65875
##   0.2    0.0001633608  0.770000  0.54000
##   0.2    0.0003773846  0.773125  0.54625
##   0.2    0.0008718074  0.777500  0.55500
##   0.2    0.0020139881  0.783125  0.56625
##   0.2    0.0046525737  0.800000  0.60000
##   0.2    0.0107480486  0.811875  0.62375
##   0.2    0.0248293863  0.820000  0.64000
##   0.2    0.0573591027  0.831875  0.66375
##   0.2    0.1325069668  0.833750  0.66750
##   0.2    0.3061082795  0.856250  0.71250
##   0.3    0.0001633608  0.770625  0.54125
##   0.3    0.0003773846  0.777500  0.55500
##   0.3    0.0008718074  0.783125  0.56625
##   0.3    0.0020139881  0.793125  0.58625
##   0.3    0.0046525737  0.809375  0.61875
##   0.3    0.0107480486  0.820625  0.64125
##   0.3    0.0248293863  0.831875  0.66375
##   0.3    0.0573591027  0.843125  0.68625
##   0.3    0.1325069668  0.836875  0.67375
##   0.3    0.3061082795  0.867500  0.73500
##   0.4    0.0001633608  0.772500  0.54500
##   0.4    0.0003773846  0.779375  0.55875
##   0.4    0.0008718074  0.786250  0.57250
##   0.4    0.0020139881  0.801250  0.60250
##   0.4    0.0046525737  0.816250  0.63250
##   0.4    0.0107480486  0.828750  0.65750
##   0.4    0.0248293863  0.839375  0.67875
##   0.4    0.0573591027  0.850625  0.70125
##   0.4    0.1325069668  0.870625  0.74125
##   0.4    0.3061082795  0.873750  0.74750
##   0.5    0.0001633608  0.780000  0.56000
##   0.5    0.0003773846  0.783750  0.56750
##   0.5    0.0008718074  0.791250  0.58250
##   0.5    0.0020139881  0.803125  0.60625
##   0.5    0.0046525737  0.820625  0.64125
##   0.5    0.0107480486  0.834375  0.66875
##   0.5    0.0248293863  0.845625  0.69125
##   0.5    0.0573591027  0.851875  0.70375
##   0.5    0.1325069668  0.871875  0.74375
##   0.5    0.3061082795  0.875000  0.75000
##   0.6    0.0001633608  0.783125  0.56625
##   0.6    0.0003773846  0.788750  0.57750
##   0.6    0.0008718074  0.797500  0.59500
##   0.6    0.0020139881  0.812500  0.62500
##   0.6    0.0046525737  0.826875  0.65375
##   0.6    0.0107480486  0.841250  0.68250
##   0.6    0.0248293863  0.846250  0.69250
##   0.6    0.0573591027  0.861875  0.72375
##   0.6    0.1325069668  0.871875  0.74375
##   0.6    0.3061082795  0.875000  0.75000
##   0.7    0.0001633608  0.788125  0.57625
##   0.7    0.0003773846  0.796250  0.59250
##   0.7    0.0008718074  0.808125  0.61625
##   0.7    0.0020139881  0.814375  0.62875
##   0.7    0.0046525737  0.831875  0.66375
##   0.7    0.0107480486  0.846250  0.69250
##   0.7    0.0248293863  0.851250  0.70250
##   0.7    0.0573591027  0.863125  0.72625
##   0.7    0.1325069668  0.875000  0.75000
##   0.7    0.3061082795  0.875000  0.75000
##   0.8    0.0001633608  0.798750  0.59750
##   0.8    0.0003773846  0.805000  0.61000
##   0.8    0.0008718074  0.815000  0.63000
##   0.8    0.0020139881  0.821875  0.64375
##   0.8    0.0046525737  0.833750  0.66750
##   0.8    0.0107480486  0.846875  0.69375
##   0.8    0.0248293863  0.855625  0.71125
##   0.8    0.0573591027  0.862500  0.72500
##   0.8    0.1325069668  0.875000  0.75000
##   0.8    0.3061082795  0.874375  0.74875
##   0.9    0.0001633608  0.809375  0.61875
##   0.9    0.0003773846  0.812500  0.62500
##   0.9    0.0008718074  0.814375  0.62875
##   0.9    0.0020139881  0.820000  0.64000
##   0.9    0.0046525737  0.831875  0.66375
##   0.9    0.0107480486  0.845625  0.69125
##   0.9    0.0248293863  0.861875  0.72375
##   0.9    0.0573591027  0.866250  0.73250
##   0.9    0.1325069668  0.874375  0.74875
##   0.9    0.3061082795  0.871250  0.74250
##   1.0    0.0001633608  0.801875  0.60375
##   1.0    0.0003773846  0.808750  0.61750
##   1.0    0.0008718074  0.808125  0.61625
##   1.0    0.0020139881  0.815625  0.63125
##   1.0    0.0046525737  0.826875  0.65375
##   1.0    0.0107480486  0.846875  0.69375
##   1.0    0.0248293863  0.863750  0.72750
##   1.0    0.0573591027  0.866875  0.73375
##   1.0    0.1325069668  0.874375  0.74875
##   1.0    0.3061082795  0.860000  0.72000
## 
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were alpha = 0.5 and lambda = 0.3061083.
plot(down_lr_clf)

No further tuning required.

down_lr_metrics_clf <- evaluate_model_clf(down_lr_clf, "Logistic Regression")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction fail pass
##       fail   18   14
##       pass    2   95
##                                           
##                Accuracy : 0.876           
##                  95% CI : (0.8064, 0.9274)
##     No Information Rate : 0.845           
##     P-Value [Acc > NIR] : 0.19935         
##                                           
##                   Kappa : 0.6197          
##                                           
##  Mcnemar's Test P-Value : 0.00596         
##                                           
##               Precision : 0.5625          
##                  Recall : 0.9000          
##                      F1 : 0.6923          
##              Prevalence : 0.1550          
##          Detection Rate : 0.1395          
##    Detection Prevalence : 0.2481          
##       Balanced Accuracy : 0.8858          
##                                           
##        'Positive' Class : fail            
## 
Up-Sampled Train Set
up_lr_clf <- train_model_clf("glmnet", train_set_clf_up)
## glmnet 
## 
## 880 samples
##  32 predictor
##   2 classes: 'fail', 'pass' 
## 
## Pre-processing: centered (72), scaled (72) 
## Resampling: Cross-Validated (10 fold, repeated 10 times) 
## Summary of sample sizes: 792, 792, 792, 792, 792, 792, ... 
## Resampling results across tuning parameters:
## 
##   alpha  lambda        Accuracy   Kappa    
##   0.10   0.0007324895  0.9494318  0.8988636
##   0.10   0.0073248947  0.9317045  0.8634091
##   0.10   0.0732489472  0.9007955  0.8015909
##   0.55   0.0007324895  0.9505682  0.9011364
##   0.55   0.0073248947  0.9282955  0.8565909
##   0.55   0.0732489472  0.9080682  0.8161364
##   1.00   0.0007324895  0.9540909  0.9081818
##   1.00   0.0073248947  0.9336364  0.8672727
##   1.00   0.0732489472  0.9146591  0.8293182
## 
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were alpha = 1 and lambda = 0.0007324895.
plot(up_lr_clf)

Perform hyperparameter tuning using tuneLength.

up_lr_clf <- train_model_clf("glmnet", train_set_clf_up, tuneLength = 10)
## glmnet 
## 
## 880 samples
##  32 predictor
##   2 classes: 'fail', 'pass' 
## 
## Pre-processing: centered (72), scaled (72) 
## Resampling: Cross-Validated (10 fold, repeated 10 times) 
## Summary of sample sizes: 792, 792, 792, 792, 792, 792, ... 
## Resampling results across tuning parameters:
## 
##   alpha  lambda        Accuracy   Kappa    
##   0.1    0.0001692146  0.9523864  0.9047727
##   0.1    0.0003909076  0.9522727  0.9045455
##   0.1    0.0009030473  0.9486364  0.8972727
##   0.1    0.0020861563  0.9423864  0.8847727
##   0.1    0.0048192916  0.9367045  0.8734091
##   0.1    0.0111331887  0.9256818  0.8513636
##   0.1    0.0257191098  0.9120455  0.8240909
##   0.1    0.0594144794  0.9030682  0.8061364
##   0.1    0.1372551534  0.8931818  0.7863636
##   0.1    0.3170772064  0.8892045  0.7784091
##   0.2    0.0001692146  0.9532955  0.9065909
##   0.2    0.0003909076  0.9525000  0.9050000
##   0.2    0.0009030473  0.9484091  0.8968182
##   0.2    0.0020861563  0.9431818  0.8863636
##   0.2    0.0048192916  0.9364773  0.8729545
##   0.2    0.0111331887  0.9221591  0.8443182
##   0.2    0.0257191098  0.9101136  0.8202273
##   0.2    0.0594144794  0.9002273  0.8004545
##   0.2    0.1372551534  0.8981818  0.7963636
##   0.2    0.3170772064  0.8769318  0.7538636
##   0.3    0.0001692146  0.9540909  0.9081818
##   0.3    0.0003909076  0.9527273  0.9054545
##   0.3    0.0009030473  0.9484091  0.8968182
##   0.3    0.0020861563  0.9444318  0.8888636
##   0.3    0.0048192916  0.9356818  0.8713636
##   0.3    0.0111331887  0.9207955  0.8415909
##   0.3    0.0257191098  0.9115909  0.8231818
##   0.3    0.0594144794  0.9045455  0.8090909
##   0.3    0.1372551534  0.8906818  0.7813636
##   0.3    0.3170772064  0.8995455  0.7990909
##   0.4    0.0001692146  0.9543182  0.9086364
##   0.4    0.0003909076  0.9527273  0.9054545
##   0.4    0.0009030473  0.9492045  0.8984091
##   0.4    0.0020861563  0.9452273  0.8904545
##   0.4    0.0048192916  0.9354545  0.8709091
##   0.4    0.0111331887  0.9213636  0.8427273
##   0.4    0.0257191098  0.9188636  0.8377273
##   0.4    0.0594144794  0.9055682  0.8111364
##   0.4    0.1372551534  0.9021591  0.8043182
##   0.4    0.3170772064  0.9069318  0.8138636
##   0.5    0.0001692146  0.9547727  0.9095455
##   0.5    0.0003909076  0.9532955  0.9065909
##   0.5    0.0009030473  0.9496591  0.8993182
##   0.5    0.0020861563  0.9452273  0.8904545
##   0.5    0.0048192916  0.9348864  0.8697727
##   0.5    0.0111331887  0.9238636  0.8477273
##   0.5    0.0257191098  0.9237500  0.8475000
##   0.5    0.0594144794  0.9053409  0.8106818
##   0.5    0.1372551534  0.9079545  0.8159091
##   0.5    0.3170772064  0.9060227  0.8120455
##   0.6    0.0001692146  0.9553409  0.9106818
##   0.6    0.0003909076  0.9537500  0.9075000
##   0.6    0.0009030473  0.9506818  0.9013636
##   0.6    0.0020861563  0.9456818  0.8913636
##   0.6    0.0048192916  0.9340909  0.8681818
##   0.6    0.0111331887  0.9264773  0.8529545
##   0.6    0.0257191098  0.9260227  0.8520455
##   0.6    0.0594144794  0.9092045  0.8184091
##   0.6    0.1372551534  0.9076136  0.8152273
##   0.6    0.3170772064  0.9055682  0.8111364
##   0.7    0.0001692146  0.9551136  0.9102273
##   0.7    0.0003909076  0.9544318  0.9088636
##   0.7    0.0009030473  0.9514773  0.9029545
##   0.7    0.0020861563  0.9461364  0.8922727
##   0.7    0.0048192916  0.9331818  0.8663636
##   0.7    0.0111331887  0.9311364  0.8622727
##   0.7    0.0257191098  0.9279545  0.8559091
##   0.7    0.0594144794  0.9130682  0.8261364
##   0.7    0.1372551534  0.9128409  0.8256818
##   0.7    0.3170772064  0.9045455  0.8090909
##   0.8    0.0001692146  0.9553409  0.9106818
##   0.8    0.0003909076  0.9552273  0.9104545
##   0.8    0.0009030473  0.9521591  0.9043182
##   0.8    0.0020861563  0.9453409  0.8906818
##   0.8    0.0048192916  0.9337500  0.8675000
##   0.8    0.0111331887  0.9338636  0.8677273
##   0.8    0.0257191098  0.9297727  0.8595455
##   0.8    0.0594144794  0.9135227  0.8270455
##   0.8    0.1372551534  0.9112500  0.8225000
##   0.8    0.3170772064  0.9040909  0.8081818
##   0.9    0.0001692146  0.9554545  0.9109091
##   0.9    0.0003909076  0.9555682  0.9111364
##   0.9    0.0009030473  0.9525000  0.9050000
##   0.9    0.0020861563  0.9463636  0.8927273
##   0.9    0.0048192916  0.9353409  0.8706818
##   0.9    0.0111331887  0.9346591  0.8693182
##   0.9    0.0257191098  0.9310227  0.8620455
##   0.9    0.0594144794  0.9143182  0.8286364
##   0.9    0.1372551534  0.9071591  0.8143182
##   0.9    0.3170772064  0.9013636  0.8027273
##   1.0    0.0001692146  0.9561364  0.9122727
##   1.0    0.0003909076  0.9563636  0.9127273
##   1.0    0.0009030473  0.9543182  0.9086364
##   1.0    0.0020861563  0.9460227  0.8920455
##   1.0    0.0048192916  0.9359091  0.8718182
##   1.0    0.0111331887  0.9344318  0.8688636
##   1.0    0.0257191098  0.9306818  0.8613636
##   1.0    0.0594144794  0.9147727  0.8295455
##   1.0    0.1372551534  0.9043182  0.8086364
##   1.0    0.3170772064  0.8839773  0.7679545
## 
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were alpha = 1 and lambda = 0.0003909076.
plot(up_lr_clf)

Tuning done.

up_lr_metrics_clf <- evaluate_model_clf(up_lr_clf, "Logistic Regression")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction fail pass
##       fail   12    7
##       pass    8  102
##                                           
##                Accuracy : 0.8837          
##                  95% CI : (0.8155, 0.9334)
##     No Information Rate : 0.845           
##     P-Value [Acc > NIR] : 0.1352          
##                                           
##                   Kappa : 0.5469          
##                                           
##  Mcnemar's Test P-Value : 1.0000          
##                                           
##               Precision : 0.63158         
##                  Recall : 0.60000         
##                      F1 : 0.61538         
##              Prevalence : 0.15504         
##          Detection Rate : 0.09302         
##    Detection Prevalence : 0.14729         
##       Balanced Accuracy : 0.76789         
##                                           
##        'Positive' Class : fail            
## 

Evaluation

Regression

By RMSE
df_eval_reg <- rbind(
  dt_reg$metrics,
  rf_reg_manual_tune$metrics,
  svm_reg$metrics,
  xgb_reg_manual_tune$metrics,
  lr_reg_manual_tune$metrics
)

df_eval_reg <- df_eval_reg[order(df_eval_reg$RMSE), ]
row.names(df_eval_reg) <- NULL
df_eval_reg
##                    Model     RMSE  Rsquared       MAE
## 1          Random Forest 1.106174 0.8666637 0.7717890
## 2      Linear Regression 1.117378 0.8643585 0.7226734
## 3 Support Vector Machine 1.177426 0.8521626 0.7618882
## 4                XGBoost 1.250256 0.8338254 0.8697652
## 5          Decision Tree 1.340746 0.8114244 0.8614620
df_eval_reg["RMSE"] <- round(df_eval_reg["RMSE"], 4)
df_eval_reg %>%
  ggplot(aes(x = reorder(Model, RMSE), y = RMSE, fill = Model)) +
  geom_bar(stat = "identity") +
  geom_text(aes(label = RMSE), vjust = -0.5) +
  ggtitle("RMSE by Model") +
  xlab("Model") +
  ylim(0, 1.5) +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1))

Classification

Define a function to plot accuracy.

show_accuracy <- function(df) {
  df_accuracy <- df %>% arrange(desc(Accuracy))
  print(df_accuracy)

  plot_accuracy <- df_accuracy %>%
    ggplot(aes(x = reorder(Model, -Accuracy), y = Accuracy, fill = Model)) +
    geom_bar(stat = "identity") +
    geom_text(aes(label = Accuracy), vjust = -0.5) +
    ggtitle("Accuracy by Model") +
    xlab("Model") +
    ylim(0, 1) +
    labs(fill = "Model") +
    theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1))

  print(plot_accuracy)
}

Define a function to plot recall for fail class.

show_recall_fail <- function(df) {
  df_recall_fail <- df %>% arrange(desc(Recall.Fail))
  print(df_recall_fail)

  plot_recall_fail <- df_recall_fail %>%
    ggplot(
      aes(x = reorder(Model, -Recall.Fail), y = Recall.Fail, fill = Model)
    ) +
    geom_bar(stat = "identity") +
    geom_text(aes(label = Recall.Fail), vjust = -0.5) +
    ggtitle("Recall (Fail Class) by Model") +
    xlab("Model") +
    ylab("Recall") +
    ylim(0, 1) +
    labs(fill = "Model") +
    theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1))

  print(plot_recall_fail)
}

1. Original Train Set

df_eval_clf <- rbind(
  dt_metrics_clf,
  rf_metrics_clf,
  svm_poly_metrics_clf,
  knn_metrics_clf,
  lr_metrics_clf
)

df_eval_clf
##                    Model Accuracy Precision.Fail Recall.Fail F1.Fail
## 1          Decision Tree    0.938          0.929        0.65   0.765
## 2          Random Forest    0.922          0.778        0.70   0.737
## 3 Support Vector Machine    0.915          0.737        0.70   0.718
## 4    K-Nearest Neighbors    0.868          0.615        0.40   0.485
## 5    Logistic Regression    0.915          0.737        0.70   0.718
##   Precision.Pass Recall.Pass F1.Pass
## 1          0.939       0.991   0.964
## 2          0.946       0.963   0.955
## 3          0.945       0.954   0.950
## 4          0.897       0.954   0.924
## 5          0.945       0.954   0.950
By Accuracy
show_accuracy(df_eval_clf)
##                    Model Accuracy Precision.Fail Recall.Fail F1.Fail
## 1          Decision Tree    0.938          0.929        0.65   0.765
## 2          Random Forest    0.922          0.778        0.70   0.737
## 3 Support Vector Machine    0.915          0.737        0.70   0.718
## 4    Logistic Regression    0.915          0.737        0.70   0.718
## 5    K-Nearest Neighbors    0.868          0.615        0.40   0.485
##   Precision.Pass Recall.Pass F1.Pass
## 1          0.939       0.991   0.964
## 2          0.946       0.963   0.955
## 3          0.945       0.954   0.950
## 4          0.945       0.954   0.950
## 5          0.897       0.954   0.924

By Recall (Fail Class)
show_recall_fail(df_eval_clf)
##                    Model Accuracy Precision.Fail Recall.Fail F1.Fail
## 1          Random Forest    0.922          0.778        0.70   0.737
## 2 Support Vector Machine    0.915          0.737        0.70   0.718
## 3    Logistic Regression    0.915          0.737        0.70   0.718
## 4          Decision Tree    0.938          0.929        0.65   0.765
## 5    K-Nearest Neighbors    0.868          0.615        0.40   0.485
##   Precision.Pass Recall.Pass F1.Pass
## 1          0.946       0.963   0.955
## 2          0.945       0.954   0.950
## 3          0.945       0.954   0.950
## 4          0.939       0.991   0.964
## 5          0.897       0.954   0.924

2. Down-Sampled Train Set

down_df_eval_clf <- rbind(
  down_dt_metrics_clf,
  down_rf_metrics_clf,
  down_svm_poly_metrics_clf,
  down_knn_metrics_clf,
  down_lr_metrics_clf
)

down_df_eval_clf
##                    Model Accuracy Precision.Fail Recall.Fail F1.Fail
## 1          Decision Tree    0.860          0.531        0.85   0.654
## 2          Random Forest    0.876          0.567        0.85   0.680
## 3 Support Vector Machine    0.860          0.529        0.90   0.667
## 4    K-Nearest Neighbors    0.853          0.516        0.80   0.627
## 5    Logistic Regression    0.876          0.562        0.90   0.692
##   Precision.Pass Recall.Pass F1.Pass
## 1          0.969       0.862   0.913
## 2          0.970       0.881   0.923
## 3          0.979       0.853   0.912
## 4          0.959       0.862   0.908
## 5          0.979       0.872   0.922
By Accuracy
show_accuracy(down_df_eval_clf)
##                    Model Accuracy Precision.Fail Recall.Fail F1.Fail
## 1          Random Forest    0.876          0.567        0.85   0.680
## 2    Logistic Regression    0.876          0.562        0.90   0.692
## 3          Decision Tree    0.860          0.531        0.85   0.654
## 4 Support Vector Machine    0.860          0.529        0.90   0.667
## 5    K-Nearest Neighbors    0.853          0.516        0.80   0.627
##   Precision.Pass Recall.Pass F1.Pass
## 1          0.970       0.881   0.923
## 2          0.979       0.872   0.922
## 3          0.969       0.862   0.913
## 4          0.979       0.853   0.912
## 5          0.959       0.862   0.908

By Recall (Fail Class)
show_recall_fail(down_df_eval_clf)
##                    Model Accuracy Precision.Fail Recall.Fail F1.Fail
## 1 Support Vector Machine    0.860          0.529        0.90   0.667
## 2    Logistic Regression    0.876          0.562        0.90   0.692
## 3          Decision Tree    0.860          0.531        0.85   0.654
## 4          Random Forest    0.876          0.567        0.85   0.680
## 5    K-Nearest Neighbors    0.853          0.516        0.80   0.627
##   Precision.Pass Recall.Pass F1.Pass
## 1          0.979       0.853   0.912
## 2          0.979       0.872   0.922
## 3          0.969       0.862   0.913
## 4          0.970       0.881   0.923
## 5          0.959       0.862   0.908

3. Up-Sampled Train Set

up_df_eval_clf <- rbind(
  up_dt_metrics_clf,
  up_rf_metrics_clf,
  up_svm_poly_metrics_clf,
  up_knn_metrics_clf,
  up_lr_metrics_clf
)

up_df_eval_clf
##                    Model Accuracy Precision.Fail Recall.Fail F1.Fail
## 1          Decision Tree    0.884          0.593        0.80   0.681
## 2          Random Forest    0.915          0.800        0.60   0.686
## 3 Support Vector Machine    0.891          0.875        0.35   0.500
## 4    K-Nearest Neighbors    0.891          0.714        0.50   0.588
## 5    Logistic Regression    0.884          0.632        0.60   0.615
##   Precision.Pass Recall.Pass F1.Pass
## 1          0.961       0.899   0.929
## 2          0.930       0.972   0.951
## 3          0.893       0.991   0.939
## 4          0.913       0.963   0.938
## 5          0.927       0.936   0.932
By Accuracy
show_accuracy(up_df_eval_clf)
##                    Model Accuracy Precision.Fail Recall.Fail F1.Fail
## 1          Random Forest    0.915          0.800        0.60   0.686
## 2 Support Vector Machine    0.891          0.875        0.35   0.500
## 3    K-Nearest Neighbors    0.891          0.714        0.50   0.588
## 4          Decision Tree    0.884          0.593        0.80   0.681
## 5    Logistic Regression    0.884          0.632        0.60   0.615
##   Precision.Pass Recall.Pass F1.Pass
## 1          0.930       0.972   0.951
## 2          0.893       0.991   0.939
## 3          0.913       0.963   0.938
## 4          0.961       0.899   0.929
## 5          0.927       0.936   0.932

By Recall (Fail Class)
show_recall_fail(up_df_eval_clf)
##                    Model Accuracy Precision.Fail Recall.Fail F1.Fail
## 1          Decision Tree    0.884          0.593        0.80   0.681
## 2          Random Forest    0.915          0.800        0.60   0.686
## 3    Logistic Regression    0.884          0.632        0.60   0.615
## 4    K-Nearest Neighbors    0.891          0.714        0.50   0.588
## 5 Support Vector Machine    0.891          0.875        0.35   0.500
##   Precision.Pass Recall.Pass F1.Pass
## 1          0.961       0.899   0.929
## 2          0.930       0.972   0.951
## 3          0.927       0.936   0.932
## 4          0.913       0.963   0.938
## 5          0.893       0.991   0.939

Conclusion

In this study, we develop and evaluate the performance of different regression models and binary classification models to predict student performance in secondary education of two Portugese schools. G3, the final grade result is the target attribute in regression models while in binary classification models, the target attribute “result” is created by segregating the G3 result into binary labels “pass” and “fail”. All the models are trained with hyperparameter tuning to optimize their performance.

From the EDA result, we discovered that the attributes G1 and G2 are highly correlated to the target attribute G3. It shown that students that achieved higher G1 and G2 results are most likely to achieve higher G3 result.

We aim to achieve better prediction result compared to previous study by Cortez and Silva (2008) where the study achieved RMSE of 1.32 and accuracy of 93% from their best regression model and binary classification model respectively. The research outcome shown that our best models slightly outperformed the result from previous study with RMSE of 1.11 from regression model using Random Forest and accuracy of 93.8% from binary classification model using Decision Tree.

In addition, we noticed the issue of imbalanced data with 440 pass label and 80 fail label in training dataset. Thus we have also trained the binary classification models using down-sampling and up-sampling data. We have compared the result of classification models using original sample data, down-sampled data, and up-sampled data.

Due to the imbalanced classes in the original sample data, accuracy is not the best metric to evaluate the model performance. Furthermore, we would like to emphasize our prediction in the fail class. Hence, we decided to use recall as the main metric.

For classification using the original data, Random Forest, Support Vector Machine and Logistic Regression were performing equally well in terms of recall. In comparison to the models from down-sampled and up-sampled data, Logistic Regression and Support Vector Machine from down-sampled data produced the best results with 90% recall.

In future work, G1 and G2 attributes can be removed from the features so that we can further study the impact of other attributes to student performance to address a more specific problem. For instance, future work can be conducted to study whether family background affects student’s academic result by predicting student performance with their family ecological factors.