In this group assignment for WQD7004: Programming for Data Science, the scope is to use R Programming to create both a classification model as well as a regression model using a single dataset for prediction purposes. After careful consideration and discussion, the chosen dataset is determined to be Student Data collected by Cortez and Silva (2008) in their research.
For context, the research conducted by Cortez and Silva (2008) was to assess the educational level of the Portuguese population from two different schools. This was presented as a concern as statistics have shown that Portugal has one of the highest failure rates in Europe despite their continuous improvements in the last decade. In 2006, the early school leaving rate in Portugal was 40% for those aged 18-24 years old, while the European Union average was only 15% (Eurostat, 2007). The two subjects most important are Mathematics and Portuguese as they are core classes that provide students with the fundamental information for success in other school subjects, such as Chemistry and History.
Modeling student performance has consequently always been an important tool for both educators, students, and researchers alike. It can help to further research in the domain of factors that affect student achievement. Thus, the researchers from previous study addressed the prediction of secondary student grades by using past school performance, demographic, social, and other school related data. They then tested out three different data mining goals (binary classification, five level classification, and regression) and four data mining methods (decision trees, random forests, neural networks, and support vector machines).
In concluding the previous study, the past researchers have noted that the best predictive accuracy of the final grade can be achieved if past performance grades are evaluated and that predictive performance decreases when each subsequent past performance is not used. However, the researchers have also postulated that the effects of other variables, such as number of school absences, parent’s job and education, and alcohol consumption, also influence the prediction of a student’s final grade, just not as much as past performance.
For this group assignment, the goal is to improve the performance of the prediction. We focused only on the Portuguese language dataset as the number of record is higher compared to the Mathematics dataset.
Education plays a vital role in the growth and development of a country. Therefore, it is crucial to predict student performance so that education practitioners can formulate plans to improve the education system. Furthermore, it is also essential to identify the factors that affects the student performance in education. Hence, we aimed to explore the correlation between student grades, demographic, social and school related features, and the student’s final grade. We also develop the regression and classification models to predict student performance.
This dataset consists of student performance in secondary education of two Portuguese schools in year 2005 - 2006. It was collected through school reports and questionnaires.
In this project, we are interested in the student performance in the Portuguese subject. It consists of 649 observations and 33 variables.
| Variable | Description |
|---|---|
| school | student’s school (binary: ‘GP’ - Gabriel Pereira or ‘MS’ - Mousinho da Silveira) |
| sex | student’s sex (binary: ‘F’ - female or ‘M’ - male) |
| age | student’s age (numeric: from 15 to 22) |
| address | student’s home address type (binary: ‘U’ - urban or ‘R’ - rural) |
| famsize | family size (binary: ‘LE3’ - less or equal to 3 or ‘GT3’ - greater than 3) |
| Pstatus | parent’s cohabitation status (binary: ‘T’ - living together or ‘A’ - apart) |
| Medu | mother’s education (numeric: 0 - none, 1 - primary education (4th grade), 2 - 5th to 9th grade, 3 - secondary education or 4 - higher education) |
| Fedu | father’s education (numeric: 0 - none, 1 - primary education (4th grade), 2 - 5th to 9th grade, 3 - secondary education or 4 - higher education) |
| Mjob | mother’s job (nominal: ‘teacher’, ‘health’ care related, civil ‘services’ (e.g. administrative or police), ‘at_home’ or ‘other’) |
| Fjob | father’s job (nominal: ‘teacher’, ‘health’ care related, civil ‘services’ (e.g. administrative or police), ‘at_home’ or ‘other’) |
| reason | reason to choose this school (nominal: close to ‘home’, school ‘reputation’, ‘course’ preference or ‘other’) |
| guardian | student’s guardian (nominal: ‘mother’, ‘father’ or ‘other’) |
| traveltime | home to school travel time (numeric: 1 - <15 min., 2 - 15 to 30 min., 3 - 30 min. to 1 hour, or 4 - >1 hour) |
| studytime | weekly study time (numeric: 1 - <2 hours, 2 - 2 to 5 hours, 3 - 5 to 10 hours, or 4 - >10 hours) |
| failures | number of past class failures (numeric: n if 0<=n<=2, else 3) |
| schoolsup | extra educational support (binary: yes or no) |
| famsup | family educational support (binary: yes or no) |
| paid | extra paid classes within the course subject (Math or Portuguese) (binary: yes or no) |
| activities | extra-curricular activities (binary: yes or no) |
| nursery | attended nursery school (binary: yes or no) |
| higher | wants to take higher education (binary: yes or no) |
| internet | Internet access at home (binary: yes or no) |
| romantic | with a romantic relationship (binary: yes or no) |
| famrel | quality of family relationships (numeric: from 1 - very bad to 5 - excellent) |
| freetime | free time after school (numeric: from 1 - very low to 5 - very high) |
| goout | going out with friends (numeric: from 1 - very low to 5 - very high) |
| Dalc | workday alcohol consumption (numeric: from 1 - very low to 5 - very high) |
| Walc | weekend alcohol consumption (numeric: from 1 - very low to 5 - very high) |
| health | current health status (numeric: from 1 - very bad to 5 - very good) |
| absences | number of school absences (numeric: from 0 to 93) |
| G1 | first period grade (numeric: from 0 to 20) |
| G2 | second period grade (numeric: from 0 to 20) |
| G3 | final grade (numeric: from 0 to 20, output target) |
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
library(caret)
## Loading required package: lattice
library(reshape)
##
## Attaching package: 'reshape'
## The following object is masked from 'package:dplyr':
##
## rename
df <- read.csv("../data/student-por.csv", sep = ";", stringsAsFactors = TRUE)
head(df)
## school sex age address famsize Pstatus Medu Fedu Mjob Fjob reason
## 1 GP F 18 U GT3 A 4 4 at_home teacher course
## 2 GP F 17 U GT3 T 1 1 at_home other course
## 3 GP F 15 U LE3 T 1 1 at_home other other
## 4 GP F 15 U GT3 T 4 2 health services home
## 5 GP F 16 U GT3 T 3 3 other other home
## 6 GP M 16 U LE3 T 4 3 services other reputation
## guardian traveltime studytime failures schoolsup famsup paid activities
## 1 mother 2 2 0 yes no no no
## 2 father 1 2 0 no yes no no
## 3 mother 1 2 0 yes no no no
## 4 mother 1 3 0 no yes no yes
## 5 father 1 2 0 no yes no no
## 6 mother 1 2 0 no yes no yes
## nursery higher internet romantic famrel freetime goout Dalc Walc health
## 1 yes yes no no 4 3 4 1 1 3
## 2 no yes yes no 5 3 3 1 1 3
## 3 yes yes yes no 4 3 2 2 3 3
## 4 yes yes yes yes 3 2 2 1 1 5
## 5 yes yes no no 4 3 2 1 2 5
## 6 yes yes yes no 5 4 2 1 2 5
## absences G1 G2 G3
## 1 4 0 11 11
## 2 2 9 11 11
## 3 6 12 13 12
## 4 0 14 14 14
## 5 0 11 13 13
## 6 6 12 12 13
str(df)
## 'data.frame': 649 obs. of 33 variables:
## $ school : Factor w/ 2 levels "GP","MS": 1 1 1 1 1 1 1 1 1 1 ...
## $ sex : Factor w/ 2 levels "F","M": 1 1 1 1 1 2 2 1 2 2 ...
## $ age : int 18 17 15 15 16 16 16 17 15 15 ...
## $ address : Factor w/ 2 levels "R","U": 2 2 2 2 2 2 2 2 2 2 ...
## $ famsize : Factor w/ 2 levels "GT3","LE3": 1 1 2 1 1 2 2 1 2 1 ...
## $ Pstatus : Factor w/ 2 levels "A","T": 1 2 2 2 2 2 2 1 1 2 ...
## $ Medu : int 4 1 1 4 3 4 2 4 3 3 ...
## $ Fedu : int 4 1 1 2 3 3 2 4 2 4 ...
## $ Mjob : Factor w/ 5 levels "at_home","health",..: 1 1 1 2 3 4 3 3 4 3 ...
## $ Fjob : Factor w/ 5 levels "at_home","health",..: 5 3 3 4 3 3 3 5 3 3 ...
## $ reason : Factor w/ 4 levels "course","home",..: 1 1 3 2 2 4 2 2 2 2 ...
## $ guardian : Factor w/ 3 levels "father","mother",..: 2 1 2 2 1 2 2 2 2 2 ...
## $ traveltime: int 2 1 1 1 1 1 1 2 1 1 ...
## $ studytime : int 2 2 2 3 2 2 2 2 2 2 ...
## $ failures : int 0 0 0 0 0 0 0 0 0 0 ...
## $ schoolsup : Factor w/ 2 levels "no","yes": 2 1 2 1 1 1 1 2 1 1 ...
## $ famsup : Factor w/ 2 levels "no","yes": 1 2 1 2 2 2 1 2 2 2 ...
## $ paid : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ activities: Factor w/ 2 levels "no","yes": 1 1 1 2 1 2 1 1 1 2 ...
## $ nursery : Factor w/ 2 levels "no","yes": 2 1 2 2 2 2 2 2 2 2 ...
## $ higher : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 2 ...
## $ internet : Factor w/ 2 levels "no","yes": 1 2 2 2 1 2 2 1 2 2 ...
## $ romantic : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 1 1 1 ...
## $ famrel : int 4 5 4 3 4 5 4 4 4 5 ...
## $ freetime : int 3 3 3 2 3 4 4 1 2 5 ...
## $ goout : int 4 3 2 2 2 2 4 4 2 1 ...
## $ Dalc : int 1 1 2 1 1 1 1 1 1 1 ...
## $ Walc : int 1 1 3 1 2 2 1 1 1 1 ...
## $ health : int 3 3 3 5 5 5 3 1 1 5 ...
## $ absences : int 4 2 6 0 0 6 0 2 0 0 ...
## $ G1 : int 0 9 12 14 11 12 13 10 15 12 ...
## $ G2 : int 11 11 13 14 13 12 12 13 16 12 ...
## $ G3 : int 11 11 12 14 13 13 13 13 17 13 ...
summary(df)
## school sex age address famsize Pstatus Medu
## GP:423 F:383 Min. :15.00 R:197 GT3:457 A: 80 Min. :0.000
## MS:226 M:266 1st Qu.:16.00 U:452 LE3:192 T:569 1st Qu.:2.000
## Median :17.00 Median :2.000
## Mean :16.74 Mean :2.515
## 3rd Qu.:18.00 3rd Qu.:4.000
## Max. :22.00 Max. :4.000
## Fedu Mjob Fjob reason guardian
## Min. :0.000 at_home :135 at_home : 42 course :285 father:153
## 1st Qu.:1.000 health : 48 health : 23 home :149 mother:455
## Median :2.000 other :258 other :367 other : 72 other : 41
## Mean :2.307 services:136 services:181 reputation:143
## 3rd Qu.:3.000 teacher : 72 teacher : 36
## Max. :4.000
## traveltime studytime failures schoolsup famsup paid
## Min. :1.000 Min. :1.000 Min. :0.0000 no :581 no :251 no :610
## 1st Qu.:1.000 1st Qu.:1.000 1st Qu.:0.0000 yes: 68 yes:398 yes: 39
## Median :1.000 Median :2.000 Median :0.0000
## Mean :1.569 Mean :1.931 Mean :0.2219
## 3rd Qu.:2.000 3rd Qu.:2.000 3rd Qu.:0.0000
## Max. :4.000 Max. :4.000 Max. :3.0000
## activities nursery higher internet romantic famrel
## no :334 no :128 no : 69 no :151 no :410 Min. :1.000
## yes:315 yes:521 yes:580 yes:498 yes:239 1st Qu.:4.000
## Median :4.000
## Mean :3.931
## 3rd Qu.:5.000
## Max. :5.000
## freetime goout Dalc Walc health
## Min. :1.00 Min. :1.000 Min. :1.000 Min. :1.00 Min. :1.000
## 1st Qu.:3.00 1st Qu.:2.000 1st Qu.:1.000 1st Qu.:1.00 1st Qu.:2.000
## Median :3.00 Median :3.000 Median :1.000 Median :2.00 Median :4.000
## Mean :3.18 Mean :3.185 Mean :1.502 Mean :2.28 Mean :3.536
## 3rd Qu.:4.00 3rd Qu.:4.000 3rd Qu.:2.000 3rd Qu.:3.00 3rd Qu.:5.000
## Max. :5.00 Max. :5.000 Max. :5.000 Max. :5.00 Max. :5.000
## absences G1 G2 G3
## Min. : 0.000 Min. : 0.0 Min. : 0.00 Min. : 0.00
## 1st Qu.: 0.000 1st Qu.:10.0 1st Qu.:10.00 1st Qu.:10.00
## Median : 2.000 Median :11.0 Median :11.00 Median :12.00
## Mean : 3.659 Mean :11.4 Mean :11.57 Mean :11.91
## 3rd Qu.: 6.000 3rd Qu.:13.0 3rd Qu.:13.00 3rd Qu.:14.00
## Max. :32.000 Max. :19.0 Max. :19.00 Max. :19.00
Convert ordinal variables into ordered factors.
df$famsize <- factor(df$famsize, ordered = TRUE, levels = c("LE3", "GT3"))
df$Medu <- factor(df$Medu, ordered = TRUE, levels = 0:4)
df$Fedu <- factor(df$Fedu, ordered = TRUE, levels = 0:4)
df$traveltime <- factor(df$traveltime, ordered = TRUE, levels = 1:4)
df$studytime <- factor(df$studytime, ordered = TRUE, levels = 1:4)
df$failures <- factor(df$failures, ordered = TRUE, levels = 0:4)
df$famrel <- factor(df$famrel, ordered = TRUE, levels = 1:5)
df$freetime <- factor(df$freetime, ordered = TRUE, levels = 1:5)
df$goout <- factor(df$goout, ordered = TRUE, levels = 1:5)
df$Dalc <- factor(df$Dalc, ordered = TRUE, levels = 1:5)
df$Walc <- factor(df$Walc, ordered = TRUE, levels = 1:5)
df$health <- factor(df$health, ordered = TRUE, levels = 1:5)
Create label from G3.
df$result <- ifelse(df$G3 >= 10, "pass", "fail")
df$result <- as.factor(df$result)
str(df)
## 'data.frame': 649 obs. of 34 variables:
## $ school : Factor w/ 2 levels "GP","MS": 1 1 1 1 1 1 1 1 1 1 ...
## $ sex : Factor w/ 2 levels "F","M": 1 1 1 1 1 2 2 1 2 2 ...
## $ age : int 18 17 15 15 16 16 16 17 15 15 ...
## $ address : Factor w/ 2 levels "R","U": 2 2 2 2 2 2 2 2 2 2 ...
## $ famsize : Ord.factor w/ 2 levels "LE3"<"GT3": 2 2 1 2 2 1 1 2 1 2 ...
## $ Pstatus : Factor w/ 2 levels "A","T": 1 2 2 2 2 2 2 1 1 2 ...
## $ Medu : Ord.factor w/ 5 levels "0"<"1"<"2"<"3"<..: 5 2 2 5 4 5 3 5 4 4 ...
## $ Fedu : Ord.factor w/ 5 levels "0"<"1"<"2"<"3"<..: 5 2 2 3 4 4 3 5 3 5 ...
## $ Mjob : Factor w/ 5 levels "at_home","health",..: 1 1 1 2 3 4 3 3 4 3 ...
## $ Fjob : Factor w/ 5 levels "at_home","health",..: 5 3 3 4 3 3 3 5 3 3 ...
## $ reason : Factor w/ 4 levels "course","home",..: 1 1 3 2 2 4 2 2 2 2 ...
## $ guardian : Factor w/ 3 levels "father","mother",..: 2 1 2 2 1 2 2 2 2 2 ...
## $ traveltime: Ord.factor w/ 4 levels "1"<"2"<"3"<"4": 2 1 1 1 1 1 1 2 1 1 ...
## $ studytime : Ord.factor w/ 4 levels "1"<"2"<"3"<"4": 2 2 2 3 2 2 2 2 2 2 ...
## $ failures : Ord.factor w/ 5 levels "0"<"1"<"2"<"3"<..: 1 1 1 1 1 1 1 1 1 1 ...
## $ schoolsup : Factor w/ 2 levels "no","yes": 2 1 2 1 1 1 1 2 1 1 ...
## $ famsup : Factor w/ 2 levels "no","yes": 1 2 1 2 2 2 1 2 2 2 ...
## $ paid : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ activities: Factor w/ 2 levels "no","yes": 1 1 1 2 1 2 1 1 1 2 ...
## $ nursery : Factor w/ 2 levels "no","yes": 2 1 2 2 2 2 2 2 2 2 ...
## $ higher : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 2 ...
## $ internet : Factor w/ 2 levels "no","yes": 1 2 2 2 1 2 2 1 2 2 ...
## $ romantic : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 1 1 1 ...
## $ famrel : Ord.factor w/ 5 levels "1"<"2"<"3"<"4"<..: 4 5 4 3 4 5 4 4 4 5 ...
## $ freetime : Ord.factor w/ 5 levels "1"<"2"<"3"<"4"<..: 3 3 3 2 3 4 4 1 2 5 ...
## $ goout : Ord.factor w/ 5 levels "1"<"2"<"3"<"4"<..: 4 3 2 2 2 2 4 4 2 1 ...
## $ Dalc : Ord.factor w/ 5 levels "1"<"2"<"3"<"4"<..: 1 1 2 1 1 1 1 1 1 1 ...
## $ Walc : Ord.factor w/ 5 levels "1"<"2"<"3"<"4"<..: 1 1 3 1 2 2 1 1 1 1 ...
## $ health : Ord.factor w/ 5 levels "1"<"2"<"3"<"4"<..: 3 3 3 5 5 5 3 1 1 5 ...
## $ absences : int 4 2 6 0 0 6 0 2 0 0 ...
## $ G1 : int 0 9 12 14 11 12 13 10 15 12 ...
## $ G2 : int 11 11 13 14 13 12 12 13 16 12 ...
## $ G3 : int 11 11 12 14 13 13 13 13 17 13 ...
## $ result : Factor w/ 2 levels "fail","pass": 2 2 2 2 2 2 2 2 2 2 ...
summary(df)
## school sex age address famsize Pstatus Medu Fedu
## GP:423 F:383 Min. :15.00 R:197 LE3:192 A: 80 0: 6 0: 7
## MS:226 M:266 1st Qu.:16.00 U:452 GT3:457 T:569 1:143 1:174
## Median :17.00 2:186 2:209
## Mean :16.74 3:139 3:131
## 3rd Qu.:18.00 4:175 4:128
## Max. :22.00
## Mjob Fjob reason guardian traveltime
## at_home :135 at_home : 42 course :285 father:153 1:366
## health : 48 health : 23 home :149 mother:455 2:213
## other :258 other :367 other : 72 other : 41 3: 54
## services:136 services:181 reputation:143 4: 16
## teacher : 72 teacher : 36
##
## studytime failures schoolsup famsup paid activities nursery
## 1:212 0:549 no :581 no :251 no :610 no :334 no :128
## 2:305 1: 70 yes: 68 yes:398 yes: 39 yes:315 yes:521
## 3: 97 2: 16
## 4: 35 3: 14
## 4: 0
##
## higher internet romantic famrel freetime goout Dalc Walc health
## no : 69 no :151 no :410 1: 22 1: 45 1: 48 1:451 1:247 1: 90
## yes:580 yes:498 yes:239 2: 29 2:107 2:145 2:121 2:150 2: 78
## 3:101 3:251 3:205 3: 43 3:120 3:124
## 4:317 4:178 4:141 4: 17 4: 87 4:108
## 5:180 5: 68 5:110 5: 17 5: 45 5:249
##
## absences G1 G2 G3 result
## Min. : 0.000 Min. : 0.0 Min. : 0.00 Min. : 0.00 fail:100
## 1st Qu.: 0.000 1st Qu.:10.0 1st Qu.:10.00 1st Qu.:10.00 pass:549
## Median : 2.000 Median :11.0 Median :11.00 Median :12.00
## Mean : 3.659 Mean :11.4 Mean :11.57 Mean :11.91
## 3rd Qu.: 6.000 3rd Qu.:13.0 3rd Qu.:13.00 3rd Qu.:14.00
## Max. :32.000 Max. :19.0 Max. :19.00 Max. :19.00
ggplot(df, aes(x = school)) +
geom_bar(fill = "blue") +
labs(
x = "School",
y = "No. of students",
title = "1. Student's school"
)
ggplot(df, aes(x = sex)) +
geom_bar(fill = "blue") +
labs(
x = "Sex",
y = "No. of students",
title = "2. Student's sex"
)
ggplot(df, aes(x = age)) +
geom_bar(
fill = "green",
color = "black"
) +
labs(
x = "Age",
y = "Frequency",
title = "3. Student's age"
)
ggplot(df, aes(x = address)) +
geom_bar(fill = "purple") +
labs(
x = "Address",
y = "Frequency",
title = "4. Student's home address type"
)
ggplot(df, aes(x = famsize)) +
geom_bar(fill = "purple") +
labs(
x = "Family size",
y = "Frequency",
title = "5. Family size "
)
ggplot(df, aes(x = Pstatus)) +
geom_bar(fill = "purple") +
labs(
x = "Parent's Status",
y = "Frequency",
title = "6. Parent's cohabitation status "
)
ggplot(df, aes(x = Medu)) +
geom_bar(
fill = "white",
color = "black"
) +
labs(
x = "Medu",
y = "Frequency",
title = "7. Mother's education level"
)
ggplot(df, aes(x = Fedu)) +
geom_bar(
fill = "white",
color = "black"
) +
labs(
x = "Fedu",
y = "Frequency",
title = "8. Father's education level"
)
ggplot(df, aes(x = Mjob)) +
geom_bar(fill = "purple") +
labs(
x = "Mjob",
y = "Frequency",
title = "9. Mother's job"
)
ggplot(df, aes(x = Fjob)) +
geom_bar(fill = "purple") +
labs(
x = "Fjob",
y = "Frequency",
title = "10. Father's job"
)
ggplot(df, aes(x = reason)) +
geom_bar(fill = "purple") +
labs(
x = "Reason",
y = "Frequency",
title = "11. Reason for selecting school"
)
ggplot(df, aes(x = guardian)) +
geom_bar(fill = "purple") +
labs(
x = "Guardian",
y = "Frequency",
title = "12. Student's guardian"
)
ggplot(df, aes(x = traveltime)) +
geom_bar(
fill = "white",
color = "black"
) +
labs(
x = "Traveltime",
y = "Frequency",
title = "13. Home to school travel time"
)
ggplot(df, aes(x = studytime)) +
geom_bar(
fill = "white",
color = "black"
) +
labs(
x = "Studytime",
y = "Frequency",
title = "14. Weekly study time"
)
ggplot(df, aes(x = failures)) +
geom_bar(
fill = "green",
color = "black"
) +
labs(title = "15. Number of past class failures")
ggplot(df, aes(x = schoolsup)) +
geom_bar(fill = "blue") +
labs(
x = "Schoolsup",
y = "Frequency",
title = "16. Extra educational support"
)
ggplot(df, aes(x = famsup)) +
geom_bar(fill = "blue") +
labs(
x = "Famsup",
y = "Frequency",
title = "17. Family educational support"
)
ggplot(df, aes(x = paid)) +
geom_bar(fill = "blue") +
labs(
x = "Paid",
y = "Frequency",
title = "18. Extra paid Portuguese classes "
)
ggplot(df, aes(x = activities)) +
geom_bar(fill = "blue") +
labs(
x = "Activities",
y = "Frequency",
title = "19. Extra-curricular activities"
)
ggplot(df, aes(x = nursery)) +
geom_bar(fill = "blue") +
labs(
x = "Nursery",
y = "Frequency",
title = "20. Attended nursery school"
)
ggplot(df, aes(x = higher)) +
geom_bar(fill = "blue") +
labs(
x = "Higher",
y = "Frequency",
title = "21. Wants to take higher education"
)
ggplot(df, aes(x = internet)) +
geom_bar(fill = "blue") +
labs(
x = "Internet",
y = "Frequency",
title = "22. Internet access at home"
)
ggplot(df, aes(x = romantic)) +
geom_bar(fill = "blue") +
labs(
x = "Romantic",
y = "Frequency",
title = "23. In a romantic relationship"
)
ggplot(df, aes(x = famrel)) +
geom_bar(
fill = "white",
color = "black"
) +
labs(
x = "Famrel",
y = "Frequency",
title = "24. Quality of family relationship"
)
ggplot(df, aes(x = freetime)) +
geom_bar(
fill = "white",
color = "black"
) +
labs(
x = "Freetime",
y = "Frequency",
title = "25. Free time after school"
)
ggplot(df, aes(x = goout)) +
geom_bar(
fill = "white",
color = "black"
) +
labs(
x = "Goout",
y = "Frequency",
title = "26. Going out with friends"
)
ggplot(df, aes(x = Dalc)) +
geom_bar(
fill = "white",
color = "black"
) +
labs(
x = "Dalc",
y = "Frequency",
title = "27. Workday alcohol consumption"
)
ggplot(df, aes(x = Walc)) +
geom_bar(
fill = "white",
color = "black"
) +
labs(
x = "Walc",
y = "Frequency",
title = "28. Weekend alcohol consumption"
)
ggplot(df, aes(x = health)) +
geom_bar(
fill = "white",
color = "black"
) +
labs(
x = "health",
y = "Frequency",
title = "29. Current health status"
)
ggplot(df, aes(x = absences)) +
geom_bar(
fill = "green",
color = "black"
) +
labs(
x = "Absences",
y = "Frequency",
title = "30. Number of school absences"
)
ggplot(df, aes(x = G1)) +
geom_bar(
fill = "pink",
color = "black"
) +
labs(
x = "G1",
y = "Frequency",
title = "31. First period grade"
)
ggplot(df, aes(x = G2)) +
geom_bar(
fill = "pink",
color = "black"
) +
labs(
x = "G2",
y = "Frequency",
title = "32. Second period grade"
)
ggplot(df, aes(x = G3)) +
geom_bar(
fill = "orange",
color = "black"
) +
labs(
x = "G3",
y = "Frequency",
title = "33. Final grade-output target"
)
ggplot(df, aes(x = result)) +
geom_bar(
fill = "blue",
color = "black"
) +
labs(
x = "Result",
y = "Frequency",
title = "34. Final grade-result"
)
General Function for Bar Graph Plotting
# bar graph for bivariate analysis
Unstacked_bi_bar_graph <- function(`x.axis` = "", Result = "") {
# Result represents dependent variable / usually being drawn in y-axis
# x-axis represent independent variable
# dodge is used to un-stake the bar graph
# 1. counts (or sums of weights)
bar <- ggplot(df, aes(`x.axis`))
# 2. Number of tuples in each class
graph_output <- bar + geom_bar(aes(fill = `Result`), position = "dodge") +
labs(y = "Frequency")
# 3. get data from the graph
graph_label <- layer_data(graph_output)
# 4. Annotate value at respective bar
graph_output <- graph_output + annotate(
geom = "text", label = graph_label$count,
x = graph_label$x, y = 15
)
return(graph_output)
}
bar.bivar <- Unstacked_bi_bar_graph(df$school, df$result)
bar.bivar <- bar.bivar + ggtitle(label = "1. School vs Final Grade Result") +
labs(x = "Student's School")
bar.bivar
bar.bivar <- Unstacked_bi_bar_graph(df$sex, df$result)
bar.bivar <- bar.bivar + ggtitle(label = "2. Gender vs Final Grade Result") +
labs(x = "Student's Sex")
bar.bivar
bar.bivar <- Unstacked_bi_bar_graph(df$age, df$result)
bar.bivar <- bar.bivar + ggtitle(label = "3. Age vs Final Grade Result") +
labs(x = "Student's Age")
bar.bivar
bar.bivar <- Unstacked_bi_bar_graph(df$address, df$result)
bar.bivar <- bar.bivar + ggtitle(label = "4. Student's Home Address Type vs Final Grade Result") +
labs(x = "Address Type")
bar.bivar
bar.bivar <- Unstacked_bi_bar_graph(df$famsize, df$result)
bar.bivar <- bar.bivar + ggtitle(label = "5. Student's Family Size vs Final Grade Result") +
labs(x = "Family Size")
bar.bivar
bar.bivar <- Unstacked_bi_bar_graph(df$Pstatus, df$result)
bar.bivar <- bar.bivar + ggtitle(label = "6. Parent's Cohabitation Status vs Final Grade Result") +
labs(x = "Parent's Cohabitation Status")
bar.bivar
bar.bivar <- Unstacked_bi_bar_graph(df$Medu, df$result)
bar.bivar <- bar.bivar + ggtitle(label = "7. Mother's Education vs Final Grade Result") +
labs(x = "Mother's Education Level")
bar.bivar
bar.bivar <- Unstacked_bi_bar_graph(df$Fedu, df$result)
bar.bivar <- bar.bivar + ggtitle(label = "8. Father's Education vs Final Grade Result") +
labs(x = "Father's Education Level")
bar.bivar
bar.bivar <- Unstacked_bi_bar_graph(df$Mjob, df$result)
bar.bivar <- bar.bivar + ggtitle(label = "9. Mother's Job vs Final Grade Result") +
labs(x = "Mother's Job")
bar.bivar
bar.bivar <- Unstacked_bi_bar_graph(df$Fjob, df$result)
bar.bivar <- bar.bivar + ggtitle(label = "10. Father's Job vs Final Grade Result") +
labs(x = "Father's Job")
bar.bivar
bar.bivar <- Unstacked_bi_bar_graph(df$reason, df$result)
bar.bivar <- bar.bivar + ggtitle(label = "11. Reason to Choose Selected School vs Final Grade Result") + labs(x = "Reason Types")
bar.bivar
bar.bivar <- Unstacked_bi_bar_graph(df$guardian, df$result)
bar.bivar <- bar.bivar + ggtitle(label = "12. Student's Guardian vs Final Grade Result") +
labs(x = "Guardian")
bar.bivar
bar.bivar <- Unstacked_bi_bar_graph(df$traveltime, df$result)
bar.bivar <- bar.bivar + ggtitle(label = "13. Home to School Travel Time vs Final Grade Result") +
labs(x = "Travel Time")
bar.bivar
bar.bivar <- Unstacked_bi_bar_graph(df$studytime, df$result)
bar.bivar <- bar.bivar + ggtitle(label = "14. Weekly Study Time vs Final Grade Result") +
labs(x = "Weekly Study Time")
bar.bivar
bar.bivar <- Unstacked_bi_bar_graph(df$failures, df$result)
bar.bivar <- bar.bivar + ggtitle(label = "15. Number of Past Class Failures vs Final Grade Result") +
labs(x = "Number of Past Class Failures")
bar.bivar
bar.bivar <- Unstacked_bi_bar_graph(df$schoolsup, df$result)
bar.bivar <- bar.bivar + ggtitle(label = "16. Extra Educational Support vs Final Grade Result") +
labs(x = "Extra Educational Support")
bar.bivar
bar.bivar <- Unstacked_bi_bar_graph(df$famsup, df$result)
bar.bivar <- bar.bivar + ggtitle(label = "17. Family Educational Support vs Final Grade Result") +
labs(x = "Family Educational Support")
bar.bivar
bar.bivar <- Unstacked_bi_bar_graph(df$paid, df$result)
bar.bivar <- bar.bivar +
ggtitle(label = "18. Extra Paid Classes for Portuguese Subject vs Final Grade Result") +
labs(x = "Joined Extra Paid Classes?")
bar.bivar
bar.bivar <- Unstacked_bi_bar_graph(df$activities, df$result)
bar.bivar <- bar.bivar + ggtitle(label = "19. Extra-Curricular Activities vs Final Grade Result") +
labs(x = "Joined Extra-Curricular Activities?")
bar.bivar
bar.bivar <- Unstacked_bi_bar_graph(df$activities, df$result)
bar.bivar <- bar.bivar + ggtitle(label = "20. Extra-Curricular Activities vs Final Grade Result") +
labs(x = "Joined Extra-Curricular Activities?")
bar.bivar
bar.bivar <- Unstacked_bi_bar_graph(df$internet, df$result)
bar.bivar <- bar.bivar + ggtitle(label = "21. Internet Access at Home vs Final Grade Result") +
labs(x = "Internet Access at Home")
bar.bivar
bar.bivar <- Unstacked_bi_bar_graph(df$romantic, df$result)
bar.bivar <- bar.bivar + ggtitle(label = "22. With a Romantic Relationship vs Final Grade Result") +
labs(x = "With a Romantic Relationship?")
bar.bivar
bar.bivar <- Unstacked_bi_bar_graph(df$famrel, df$result)
bar.bivar <- bar.bivar + ggtitle(label = "23. Quality of Family Relationships vs Final Grade Result") + labs(x = "Quality of Family Relationships")
bar.bivar
bar.bivar <- Unstacked_bi_bar_graph(df$freetime, df$result)
bar.bivar <- bar.bivar + ggtitle(label = "24. Free Time After School vs Final Grade Result") +
labs(x = "Free Time After School")
bar.bivar
bar.bivar <- Unstacked_bi_bar_graph(df$goout, df$result)
bar.bivar <- bar.bivar + ggtitle(label = "25. Going Out with Friends vs Final Grade Result") +
labs(x = "Going Out with Friends")
bar.bivar
bar.bivar <- Unstacked_bi_bar_graph(df$Dalc, df$result)
bar.bivar <- bar.bivar + ggtitle(label = "26. Workday Alcohol Consumption vs Final Grade Result") +
labs(x = "Workday Alcohol Consumption")
bar.bivar
bar.bivar <- Unstacked_bi_bar_graph(df$Walc, df$result)
bar.bivar <- bar.bivar + ggtitle(label = "27. Weekend Alcohol Consumption vs Final Grade Result") +
labs(x = "Weekend Alcohol Consumption")
bar.bivar
bar.bivar <- Unstacked_bi_bar_graph(df$health, df$result)
bar.bivar <- bar.bivar + ggtitle(label = "28. Current Health Status vs Final Grade Result") +
labs(x = "Current Health Status")
bar.bivar
ggplot(df, aes(y = absences, x = result)) +
geom_boxplot() +
ggtitle(label = "29. Number of School Absences vs Final Grade Result") +
labs(y = "Number of School Absences", x = "Result") +
scale_y_continuous(breaks = seq(0, 36, by = 2))
ggplot(df, aes(y = G1, x = result)) +
geom_boxplot() +
labs(
title = "30. First Period Grade vs Final Grade Result",
y = "First Period Grade",
x = "Result"
) +
scale_y_continuous(breaks = seq(0, 22, by = 2))
ggplot(df, aes(y = G2, x = result)) +
geom_boxplot() +
labs(
title = "31. Second Period Grade vs Final Grade Result",
y = "Second Period Grade",
x = "Result"
) +
scale_y_continuous(breaks = seq(0, 22, by = 2))
Change processed dataset to matrix as ‘heat_map_data’.
heat_map_data <- df %>% select(-1)
heat_map_data <- data.matrix(heat_map_data)
str(heat_map_data)
## int [1:649, 1:33] 1 1 1 1 1 2 2 1 2 2 ...
## - attr(*, "dimnames")=List of 2
## ..$ : NULL
## ..$ : chr [1:33] "sex" "age" "address" "famsize" ...
Calculate the correlation value for each attribute.
heat_map_data <- round(cor(heat_map_data, method = "pearson"), 2)
head(heat_map_data)
## sex age address famsize Pstatus Medu Fedu Mjob Fjob reason
## sex 1.00 -0.04 0.03 -0.10 0.06 0.12 0.08 0.15 0.08 0.01
## age -0.04 1.00 -0.03 0.00 -0.01 -0.11 -0.12 -0.07 -0.05 -0.03
## address 0.03 -0.03 1.00 -0.05 -0.09 0.19 0.14 0.16 -0.01 0.00
## famsize -0.10 0.00 -0.05 1.00 0.24 0.01 0.04 -0.02 0.06 -0.03
## Pstatus 0.06 -0.01 -0.09 0.24 1.00 -0.06 -0.03 -0.03 0.05 -0.03
## Medu 0.12 -0.11 0.19 0.01 -0.06 1.00 0.65 0.46 0.15 0.13
## guardian traveltime studytime failures schoolsup famsup paid
## sex -0.04 0.04 -0.21 0.07 -0.11 -0.13 0.08
## age 0.27 0.03 -0.01 0.32 -0.17 -0.10 -0.01
## address -0.02 -0.34 0.06 -0.06 0.02 0.01 -0.03
## famsize 0.00 -0.01 0.01 0.07 0.06 0.04 0.05
## Pstatus -0.17 0.04 -0.01 -0.01 -0.01 0.01 0.02
## Medu -0.01 -0.27 0.10 -0.17 -0.02 0.12 0.11
## activities nursery higher internet romantic famrel freetime goout Dalc
## sex 0.12 -0.04 -0.06 0.07 -0.11 0.08 0.15 0.06 0.28
## age -0.05 -0.02 -0.27 0.01 0.18 -0.02 0.00 0.11 0.13
## address -0.01 0.02 0.08 0.18 -0.03 -0.03 -0.04 0.02 -0.05
## famsize 0.01 -0.10 0.00 -0.01 0.03 0.00 0.02 0.00 -0.06
## Pstatus 0.10 -0.03 0.02 0.06 -0.05 0.05 0.04 0.03 0.04
## Medu 0.12 0.13 0.21 0.27 -0.03 0.02 -0.02 0.01 -0.01
## Walc health absences G1 G2 G3 result
## sex 0.32 0.14 0.02 -0.10 -0.10 -0.13 -0.08
## age 0.09 -0.01 0.15 -0.17 -0.11 -0.11 -0.11
## address -0.01 0.00 0.07 0.16 0.15 0.17 0.13
## famsize -0.08 0.00 0.00 -0.05 -0.04 -0.05 -0.05
## Pstatus 0.07 0.01 -0.12 0.02 0.02 0.00 0.00
## Medu -0.02 0.00 -0.01 0.26 0.26 0.24 0.14
Change the matrix student_pro from short data to long data.
heat_map_data <- melt(heat_map_data)
## Warning in type.convert.default(X[[i]], ...): 'as.is' should be specified by the
## caller; using TRUE
## Warning in type.convert.default(X[[i]], ...): 'as.is' should be specified by the
## caller; using TRUE
head(heat_map_data)
## X1 X2 value
## 1 sex sex 1.00
## 2 age sex -0.04
## 3 address sex 0.03
## 4 famsize sex -0.10
## 5 Pstatus sex 0.06
## 6 Medu sex 0.12
Plot the heat map.
heatmap <- ggplot(data = heat_map_data, aes(x = X1, y = X2, fill = value)) +
geom_tile(color = "white") +
scale_fill_gradient(low = "red", high = "darkblue") +
theme(
axis.text.x = element_text(angle = 90, size = 8),
axis.title.x = element_text(angle = 0, color = "red"),
axis.title.y = element_text(angle = 360, color = "blue")
) +
coord_equal()
heatmap
Only show the absolute correlation value of G3 with Descending order.
G3 <- subset(heat_map_data, X1 == "G3")
G3[3] <- abs(G3[3])
G3 <- G3[order(G3[, 3], decreasing = TRUE), ]
G3
## X1 X2 value
## 1055 G3 G3 1.00
## 1022 G3 G2 0.92
## 989 G3 G1 0.83
## 1088 G3 result 0.66
## 461 G3 failures 0.39
## 659 G3 higher 0.33
## 428 G3 studytime 0.25
## 197 G3 Medu 0.24
## 230 G3 Fedu 0.21
## 857 G3 Dalc 0.20
## 890 G3 Walc 0.18
## 98 G3 address 0.17
## 263 G3 Mjob 0.15
## 692 G3 internet 0.15
## 32 G3 sex 0.13
## 395 G3 traveltime 0.13
## 329 G3 reason 0.12
## 791 G3 freetime 0.12
## 65 G3 age 0.11
## 923 G3 health 0.10
## 725 G3 romantic 0.09
## 824 G3 goout 0.09
## 956 G3 absences 0.09
## 362 G3 guardian 0.08
## 494 G3 schoolsup 0.07
## 527 G3 famsup 0.06
## 593 G3 activities 0.06
## 758 G3 famrel 0.06
## 131 G3 famsize 0.05
## 296 G3 Fjob 0.05
## 560 G3 paid 0.05
## 626 G3 nursery 0.03
## 164 G3 Pstatus 0.00
Create train set and test set for regression and classification respectively.
df_reg <- df
df_reg$result <- NULL
set.seed(42)
train_index_reg <- createDataPartition(
df_reg$G3,
p = 0.8,
list = FALSE,
times = 1
)
train_set_reg <- df_reg[train_index_reg, ]
test_set_reg <- df_reg[-train_index_reg, ]
df_clf <- df
df_clf$G3 <- NULL
set.seed(42)
train_index_clf <- createDataPartition(
df_clf$result,
p = 0.8,
list = FALSE,
times = 1
)
train_set_clf <- df_clf[train_index_clf, ]
test_set_clf <- df_clf[-train_index_clf, ]
Since we are dealing with imbalanced classes, we explore subsampling techniques for classification.
table(train_set_clf$result)
##
## fail pass
## 80 440
set.seed(42)
train_set_clf_down <- downSample(
x = train_set_clf[, -ncol(train_set_clf)],
y = train_set_clf$result,
yname = "result"
)
table(train_set_clf_down$result)
##
## fail pass
## 80 80
set.seed(42)
train_set_clf_up <- upSample(
x = train_set_clf[, -ncol(train_set_clf)],
y = train_set_clf$result,
yname = "result"
)
table(train_set_clf_up$result)
##
## fail pass
## 440 440
Define repeated 10-fold cross validation.
fit_control <- trainControl(
method = "repeatedcv",
number = 10,
repeats = 10
)
Helper function for regression models
train_evaluate_reg <- function(method = "", data = train_set_reg, tuneGrid = NULL, tuneLength = NULL, name = "") {
set.seed(42)
fit <- train(
G3 ~ .,
data = data,
method = method,
trControl = fit_control,
preProcess = c("center", "scale"),
tuneGrid = tuneGrid,
tuneLength = tuneLength
)
print(fit)
pred <- predict(fit, test_set_reg)
result <- postResample(pred = pred, obs = test_set_reg$G3)
metrics <- data.frame(
Model = name,
RMSE = result[["RMSE"]],
Rsquared = result[["Rsquared"]],
MAE = result[["MAE"]]
)
return(
list(
model = fit,
metrics = metrics
)
)
}
get_best_result <- function(caret_fit) {
best <- which(rownames(caret_fit$results) == rownames(caret_fit$bestTune))
best_result <- caret_fit$results[best, ]
rownames(best_result) <- NULL
best_result
}
CART
Training model with auto tuning:
dt_reg <- train_evaluate_reg("rpart", name = "Decision Tree", tuneLength = 10)
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, :
## There were missing values in resampled performance measures.
## CART
##
## 520 samples
## 32 predictor
##
## Pre-processing: centered (72), scaled (72)
## Resampling: Cross-Validated (10 fold, repeated 10 times)
## Summary of sample sizes: 467, 469, 467, 468, 469, 469, ...
## Resampling results across tuning parameters:
##
## cp RMSE Rsquared MAE
## 0.002557381 1.511393 0.7924958 0.9036204
## 0.002711357 1.510197 0.7929802 0.9011272
## 0.005577788 1.480733 0.8000066 0.8637666
## 0.013277533 1.527893 0.7858778 0.9042992
## 0.013807340 1.534900 0.7842550 0.9149780
## 0.022095544 1.594516 0.7664695 1.0153027
## 0.053870748 1.745563 0.7258779 1.1300192
## 0.091924350 1.911984 0.6642470 1.3105574
## 0.127582534 2.194760 0.5633255 1.5491983
## 0.522696343 2.771181 0.4965485 2.0307630
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was cp = 0.005577788.
Best parameters from tuning:
get_best_result(dt_reg$model)
## cp RMSE Rsquared MAE RMSESD RsquaredSD MAESD
## 1 0.005577788 1.480733 0.8000066 0.8637666 0.3764418 0.08400127 0.1518122
Plot the trained model:
plot(dt_reg$model)
Evaluation of model:
dt_reg$metrics
## Model RMSE Rsquared MAE
## 1 Decision Tree 1.340746 0.8114244 0.861462
Random Forest
Training model with auto tuning:
rf_reg <- train_evaluate_reg("ranger", name = "Random Forest", tuneLength = 10)
## Random Forest
##
## 520 samples
## 32 predictor
##
## Pre-processing: centered (72), scaled (72)
## Resampling: Cross-Validated (10 fold, repeated 10 times)
## Summary of sample sizes: 467, 469, 467, 468, 469, 469, ...
## Resampling results across tuning parameters:
##
## mtry splitrule RMSE Rsquared MAE
## 2 variance 2.361082 0.5965337 1.6083866
## 2 extratrees 2.510659 0.5538075 1.7610333
## 9 variance 1.671533 0.7692800 1.0813811
## 9 extratrees 1.844648 0.7384919 1.2565375
## 17 variance 1.458251 0.8115059 0.9221814
## 17 extratrees 1.571260 0.7925544 1.0201987
## 25 variance 1.382811 0.8255016 0.8781849
## 25 extratrees 1.452196 0.8132295 0.9186745
## 33 variance 1.349965 0.8314621 0.8630600
## 33 extratrees 1.395685 0.8225973 0.8814960
## 40 variance 1.337174 0.8335206 0.8572666
## 40 extratrees 1.364251 0.8279827 0.8666393
## 48 variance 1.332195 0.8340520 0.8557826
## 48 extratrees 1.347049 0.8312339 0.8624743
## 56 variance 1.334484 0.8331868 0.8596320
## 56 extratrees 1.334864 0.8334659 0.8613067
## 64 variance 1.341194 0.8311826 0.8630732
## 64 extratrees 1.327816 0.8346144 0.8608048
## 72 variance 1.356029 0.8277105 0.8705985
## 72 extratrees 1.328030 0.8342539 0.8635932
##
## Tuning parameter 'min.node.size' was held constant at a value of 5
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were mtry = 64, splitrule = extratrees
## and min.node.size = 5.
Best parameters from tuning:
get_best_result(rf_reg$model)
## mtry min.node.size splitrule RMSE Rsquared MAE RMSESD
## 1 64 5 extratrees 1.327816 0.8346144 0.8608048 0.355189
## RsquaredSD MAESD
## 1 0.07199443 0.1303264
Plot the trained model:
plot(rf_reg$model)
Training model with manual tuning:
tuneGrid_rf_reg <- expand.grid(
splitrule = c("variance", "extratrees"),
mtry = seq(48, 72, by = 2),
min.node.size = seq(1, 10)
)
rf_reg_manual_tune <- train_evaluate_reg("ranger", name = "Random Forest", tuneGrid = tuneGrid_rf_reg)
## Random Forest
##
## 520 samples
## 32 predictor
##
## Pre-processing: centered (72), scaled (72)
## Resampling: Cross-Validated (10 fold, repeated 10 times)
## Summary of sample sizes: 467, 469, 467, 468, 469, 469, ...
## Resampling results across tuning parameters:
##
## splitrule mtry min.node.size RMSE Rsquared MAE
## variance 48 1 1.339687 0.8321916 0.8600324
## variance 48 2 1.336134 0.8332092 0.8592320
## variance 48 3 1.333500 0.8336435 0.8559939
## variance 48 4 1.334916 0.8332117 0.8589217
## variance 48 5 1.332246 0.8340008 0.8572885
## variance 48 6 1.331247 0.8343653 0.8556232
## variance 48 7 1.328390 0.8350604 0.8553474
## variance 48 8 1.328514 0.8350560 0.8533839
## variance 48 9 1.326800 0.8354832 0.8535098
## variance 48 10 1.324920 0.8360505 0.8524417
## variance 50 1 1.335152 0.8331818 0.8593207
## variance 50 2 1.333297 0.8335790 0.8567243
## variance 50 3 1.334035 0.8335643 0.8578439
## variance 50 4 1.332929 0.8338782 0.8566703
## variance 50 5 1.327261 0.8351136 0.8554863
## variance 50 6 1.332834 0.8337942 0.8569689
## variance 50 7 1.328753 0.8348733 0.8543881
## variance 50 8 1.328750 0.8347862 0.8552245
## variance 50 9 1.327696 0.8351239 0.8535439
## variance 50 10 1.326696 0.8352006 0.8531400
## variance 52 1 1.339058 0.8321510 0.8613817
## variance 52 2 1.335897 0.8328892 0.8592826
## variance 52 3 1.336815 0.8327205 0.8586926
## variance 52 4 1.334774 0.8333031 0.8586056
## variance 52 5 1.330411 0.8342853 0.8565100
## variance 52 6 1.332062 0.8338640 0.8567949
## variance 52 7 1.329409 0.8346121 0.8545169
## variance 52 8 1.326546 0.8353887 0.8544571
## variance 52 9 1.326604 0.8352086 0.8547701
## variance 52 10 1.324858 0.8358415 0.8515817
## variance 54 1 1.339438 0.8320941 0.8608356
## variance 54 2 1.336583 0.8326678 0.8588230
## variance 54 3 1.334571 0.8333159 0.8586398
## variance 54 4 1.333612 0.8333486 0.8575114
## variance 54 5 1.331628 0.8340053 0.8574301
## variance 54 6 1.331798 0.8338328 0.8569963
## variance 54 7 1.331468 0.8341805 0.8558717
## variance 54 8 1.328779 0.8346501 0.8552630
## variance 54 9 1.329067 0.8346248 0.8547711
## variance 54 10 1.325108 0.8356046 0.8532881
## variance 56 1 1.336098 0.8328061 0.8591514
## variance 56 2 1.336694 0.8327624 0.8602211
## variance 56 3 1.332034 0.8336632 0.8571458
## variance 56 4 1.335753 0.8327686 0.8595249
## variance 56 5 1.334269 0.8334682 0.8587135
## variance 56 6 1.332736 0.8334263 0.8575950
## variance 56 7 1.329434 0.8344379 0.8563923
## variance 56 8 1.325990 0.8351338 0.8543020
## variance 56 9 1.326542 0.8351796 0.8535194
## variance 56 10 1.331704 0.8342096 0.8568014
## variance 58 1 1.341212 0.8315597 0.8617310
## variance 58 2 1.338155 0.8323164 0.8598174
## variance 58 3 1.337651 0.8322878 0.8601191
## variance 58 4 1.335649 0.8330410 0.8588664
## variance 58 5 1.333211 0.8333436 0.8577182
## variance 58 6 1.334425 0.8331087 0.8588736
## variance 58 7 1.331933 0.8338809 0.8576046
## variance 58 8 1.330944 0.8338760 0.8571087
## variance 58 9 1.326875 0.8349690 0.8542378
## variance 58 10 1.326915 0.8349575 0.8531167
## variance 60 1 1.339907 0.8316996 0.8597623
## variance 60 2 1.339078 0.8319042 0.8614850
## variance 60 3 1.337377 0.8324505 0.8609218
## variance 60 4 1.335012 0.8328886 0.8585221
## variance 60 5 1.335956 0.8326814 0.8602238
## variance 60 6 1.336813 0.8324576 0.8610144
## variance 60 7 1.333635 0.8333510 0.8587829
## variance 60 8 1.333400 0.8333189 0.8577025
## variance 60 9 1.327190 0.8349132 0.8557101
## variance 60 10 1.329782 0.8342808 0.8562978
## variance 62 1 1.342856 0.8309539 0.8635399
## variance 62 2 1.342705 0.8313093 0.8636480
## variance 62 3 1.341050 0.8315549 0.8620549
## variance 62 4 1.341316 0.8315526 0.8625675
## variance 62 5 1.339484 0.8320184 0.8620530
## variance 62 6 1.335107 0.8329207 0.8597387
## variance 62 7 1.335599 0.8326076 0.8603705
## variance 62 8 1.334449 0.8328604 0.8586490
## variance 62 9 1.334823 0.8329697 0.8586450
## variance 62 10 1.334007 0.8333797 0.8577082
## variance 64 1 1.346323 0.8300364 0.8648164
## variance 64 2 1.345276 0.8303370 0.8643561
## variance 64 3 1.343563 0.8307721 0.8629250
## variance 64 4 1.342799 0.8310154 0.8635823
## variance 64 5 1.343023 0.8309613 0.8627392
## variance 64 6 1.339169 0.8317798 0.8604201
## variance 64 7 1.338549 0.8319755 0.8606543
## variance 64 8 1.335124 0.8327636 0.8587239
## variance 64 9 1.335856 0.8327751 0.8585104
## variance 64 10 1.335678 0.8325813 0.8591046
## variance 66 1 1.349577 0.8295539 0.8670316
## variance 66 2 1.345125 0.8302964 0.8635574
## variance 66 3 1.344393 0.8307291 0.8645434
## variance 66 4 1.342984 0.8308814 0.8631294
## variance 66 5 1.342412 0.8309094 0.8636108
## variance 66 6 1.342985 0.8308151 0.8638655
## variance 66 7 1.342860 0.8309417 0.8628196
## variance 66 8 1.339961 0.8316644 0.8617989
## variance 66 9 1.338182 0.8320436 0.8609783
## variance 66 10 1.337023 0.8325464 0.8606262
## variance 68 1 1.348511 0.8292667 0.8666414
## variance 68 2 1.350095 0.8293947 0.8677363
## variance 68 3 1.350463 0.8291745 0.8675020
## variance 68 4 1.348276 0.8296947 0.8671250
## variance 68 5 1.347295 0.8300122 0.8662821
## variance 68 6 1.347365 0.8300294 0.8663603
## variance 68 7 1.344357 0.8304300 0.8644353
## variance 68 8 1.342599 0.8309355 0.8631230
## variance 68 9 1.342045 0.8312192 0.8625727
## variance 68 10 1.340448 0.8314650 0.8621441
## variance 70 1 1.352269 0.8284792 0.8687205
## variance 70 2 1.353498 0.8285757 0.8686645
## variance 70 3 1.351162 0.8290253 0.8694881
## variance 70 4 1.351317 0.8289070 0.8685161
## variance 70 5 1.351815 0.8284775 0.8697243
## variance 70 6 1.350016 0.8293402 0.8673371
## variance 70 7 1.347022 0.8300723 0.8654462
## variance 70 8 1.347095 0.8295813 0.8657571
## variance 70 9 1.345928 0.8301734 0.8654948
## variance 70 10 1.344120 0.8306820 0.8638135
## variance 72 1 1.356962 0.8274546 0.8717786
## variance 72 2 1.354968 0.8278927 0.8699518
## variance 72 3 1.357489 0.8274426 0.8714654
## variance 72 4 1.356925 0.8276539 0.8724947
## variance 72 5 1.355856 0.8277021 0.8712722
## variance 72 6 1.352897 0.8285674 0.8705194
## variance 72 7 1.352086 0.8287302 0.8688157
## variance 72 8 1.351827 0.8288505 0.8677244
## variance 72 9 1.347522 0.8297915 0.8670259
## variance 72 10 1.347822 0.8296083 0.8667080
## extratrees 48 1 1.347521 0.8308507 0.8649293
## extratrees 48 2 1.348622 0.8306127 0.8655997
## extratrees 48 3 1.348978 0.8305022 0.8641479
## extratrees 48 4 1.346157 0.8314651 0.8626972
## extratrees 48 5 1.346153 0.8314617 0.8619114
## extratrees 48 6 1.345810 0.8316171 0.8636060
## extratrees 48 7 1.346626 0.8313734 0.8613714
## extratrees 48 8 1.343627 0.8324255 0.8609683
## extratrees 48 9 1.344505 0.8320092 0.8590062
## extratrees 48 10 1.344030 0.8322274 0.8587626
## extratrees 50 1 1.345761 0.8310370 0.8648987
## extratrees 50 2 1.346102 0.8309838 0.8647358
## extratrees 50 3 1.343078 0.8318616 0.8619370
## extratrees 50 4 1.344699 0.8317251 0.8635136
## extratrees 50 5 1.342836 0.8319075 0.8620110
## extratrees 50 6 1.341633 0.8324208 0.8613458
## extratrees 50 7 1.338845 0.8330308 0.8592019
## extratrees 50 8 1.340494 0.8328322 0.8592757
## extratrees 50 9 1.339838 0.8330136 0.8583695
## extratrees 50 10 1.339943 0.8331555 0.8587458
## extratrees 52 1 1.342341 0.8318224 0.8635294
## extratrees 52 2 1.342861 0.8314324 0.8644988
## extratrees 52 3 1.341512 0.8319194 0.8627107
## extratrees 52 4 1.341623 0.8319885 0.8621626
## extratrees 52 5 1.343451 0.8316812 0.8639717
## extratrees 52 6 1.339699 0.8326377 0.8620033
## extratrees 52 7 1.339884 0.8327289 0.8602154
## extratrees 52 8 1.338498 0.8329775 0.8595290
## extratrees 52 9 1.337864 0.8332135 0.8579209
## extratrees 52 10 1.338310 0.8331532 0.8587504
## extratrees 54 1 1.341816 0.8318146 0.8638307
## extratrees 54 2 1.340339 0.8320702 0.8652936
## extratrees 54 3 1.339778 0.8323366 0.8634080
## extratrees 54 4 1.338756 0.8324720 0.8627294
## extratrees 54 5 1.337887 0.8329298 0.8629117
## extratrees 54 6 1.337586 0.8331037 0.8603782
## extratrees 54 7 1.337037 0.8328204 0.8615658
## extratrees 54 8 1.336425 0.8333047 0.8592163
## extratrees 54 9 1.335152 0.8337030 0.8584354
## extratrees 54 10 1.334316 0.8338116 0.8578567
## extratrees 56 1 1.337571 0.8323544 0.8653848
## extratrees 56 2 1.335584 0.8331499 0.8624342
## extratrees 56 3 1.340494 0.8317463 0.8645034
## extratrees 56 4 1.336849 0.8328491 0.8621175
## extratrees 56 5 1.336295 0.8330677 0.8623037
## extratrees 56 6 1.335107 0.8332495 0.8619530
## extratrees 56 7 1.334273 0.8335571 0.8603396
## extratrees 56 8 1.334725 0.8336201 0.8595429
## extratrees 56 9 1.330419 0.8345859 0.8578281
## extratrees 56 10 1.329578 0.8347221 0.8573440
## extratrees 58 1 1.335413 0.8327573 0.8634593
## extratrees 58 2 1.339026 0.8322598 0.8646372
## extratrees 58 3 1.335810 0.8330549 0.8623792
## extratrees 58 4 1.335018 0.8329572 0.8626948
## extratrees 58 5 1.332236 0.8337868 0.8612051
## extratrees 58 6 1.334153 0.8334467 0.8612894
## extratrees 58 7 1.333057 0.8336303 0.8612747
## extratrees 58 8 1.331538 0.8339040 0.8585468
## extratrees 58 9 1.328958 0.8345418 0.8580625
## extratrees 58 10 1.330802 0.8344161 0.8585206
## extratrees 60 1 1.334224 0.8329314 0.8635392
## extratrees 60 2 1.334193 0.8331909 0.8635151
## extratrees 60 3 1.336306 0.8327998 0.8643734
## extratrees 60 4 1.332906 0.8336170 0.8626422
## extratrees 60 5 1.331913 0.8337660 0.8606709
## extratrees 60 6 1.333312 0.8335311 0.8605103
## extratrees 60 7 1.330675 0.8340662 0.8585362
## extratrees 60 8 1.328732 0.8346548 0.8589186
## extratrees 60 9 1.328413 0.8349366 0.8573767
## extratrees 60 10 1.325265 0.8355190 0.8565007
## extratrees 62 1 1.334612 0.8328889 0.8635929
## extratrees 62 2 1.335642 0.8326700 0.8639810
## extratrees 62 3 1.332889 0.8333563 0.8623418
## extratrees 62 4 1.335857 0.8327011 0.8640703
## extratrees 62 5 1.331430 0.8339716 0.8618315
## extratrees 62 6 1.329899 0.8341603 0.8606858
## extratrees 62 7 1.328153 0.8347101 0.8584517
## extratrees 62 8 1.328464 0.8344641 0.8596425
## extratrees 62 9 1.326372 0.8351879 0.8570727
## extratrees 62 10 1.324148 0.8357049 0.8556846
## extratrees 64 1 1.332992 0.8331434 0.8635162
## extratrees 64 2 1.331516 0.8333913 0.8625641
## extratrees 64 3 1.334705 0.8329079 0.8643609
## extratrees 64 4 1.331793 0.8333615 0.8627229
## extratrees 64 5 1.328893 0.8341418 0.8611065
## extratrees 64 6 1.326888 0.8347065 0.8609889
## extratrees 64 7 1.326945 0.8348037 0.8584164
## extratrees 64 8 1.328275 0.8345450 0.8590783
## extratrees 64 9 1.323051 0.8357564 0.8562132
## extratrees 64 10 1.323540 0.8357246 0.8573136
## extratrees 66 1 1.331054 0.8332347 0.8631637
## extratrees 66 2 1.330549 0.8337267 0.8643860
## extratrees 66 3 1.334523 0.8327725 0.8641653
## extratrees 66 4 1.330251 0.8340934 0.8625296
## extratrees 66 5 1.327571 0.8345165 0.8616170
## extratrees 66 6 1.326612 0.8346667 0.8606729
## extratrees 66 7 1.327485 0.8348269 0.8588624
## extratrees 66 8 1.322938 0.8357517 0.8565319
## extratrees 66 9 1.323364 0.8357619 0.8571829
## extratrees 66 10 1.320224 0.8363500 0.8553562
## extratrees 68 1 1.333478 0.8329734 0.8658531
## extratrees 68 2 1.332544 0.8332008 0.8624033
## extratrees 68 3 1.331038 0.8334506 0.8628630
## extratrees 68 4 1.329276 0.8339825 0.8626479
## extratrees 68 5 1.327515 0.8341302 0.8616624
## extratrees 68 6 1.328497 0.8342479 0.8616911
## extratrees 68 7 1.324771 0.8351084 0.8583103
## extratrees 68 8 1.325588 0.8349738 0.8595417
## extratrees 68 9 1.322489 0.8357658 0.8567311
## extratrees 68 10 1.320615 0.8362122 0.8551947
## extratrees 70 1 1.332192 0.8330204 0.8652999
## extratrees 70 2 1.332710 0.8331645 0.8658322
## extratrees 70 3 1.332521 0.8329066 0.8641441
## extratrees 70 4 1.328974 0.8338091 0.8625399
## extratrees 70 5 1.325444 0.8347988 0.8609953
## extratrees 70 6 1.326100 0.8346392 0.8601731
## extratrees 70 7 1.325402 0.8348417 0.8592183
## extratrees 70 8 1.322734 0.8354970 0.8581778
## extratrees 70 9 1.320029 0.8363663 0.8556533
## extratrees 70 10 1.319497 0.8363125 0.8562400
## extratrees 72 1 1.330268 0.8334010 0.8628994
## extratrees 72 2 1.329614 0.8335267 0.8637409
## extratrees 72 3 1.328619 0.8338012 0.8621581
## extratrees 72 4 1.325283 0.8348138 0.8597814
## extratrees 72 5 1.327494 0.8342769 0.8617857
## extratrees 72 6 1.327317 0.8342458 0.8615627
## extratrees 72 7 1.323704 0.8352427 0.8596404
## extratrees 72 8 1.323539 0.8352105 0.8584455
## extratrees 72 9 1.317547 0.8365602 0.8551783
## extratrees 72 10 1.319702 0.8362721 0.8554894
##
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were mtry = 72, splitrule = extratrees
## and min.node.size = 9.
Best parameters from tuning:
get_best_result(rf_reg_manual_tune$model)
## splitrule mtry min.node.size RMSE Rsquared MAE RMSESD
## 1 extratrees 72 9 1.317547 0.8365602 0.8551783 0.3567946
## RsquaredSD MAESD
## 1 0.07302355 0.1312441
Plot the trained model:
plot(rf_reg_manual_tune$model)
Optimal value for each parameters:
Evaluation of model:
rf_reg_manual_tune$metrics
## Model RMSE Rsquared MAE
## 1 Random Forest 1.106174 0.8666637 0.771789
Support Vector Machines with Polynomial Kernel
Training model with auto tuning:
svm_reg <- train_evaluate_reg("svmPoly", name = "Support Vector Machine", tuneLength = 5)
## Support Vector Machines with Polynomial Kernel
##
## 520 samples
## 32 predictor
##
## Pre-processing: centered (72), scaled (72)
## Resampling: Cross-Validated (10 fold, repeated 10 times)
## Summary of sample sizes: 467, 469, 467, 468, 469, 469, ...
## Resampling results across tuning parameters:
##
## degree scale C RMSE Rsquared MAE
## 1 1e-03 0.25 2.549829 0.6456328 1.7925306
## 1 1e-03 0.50 2.211695 0.7116538 1.5148416
## 1 1e-03 1.00 1.850422 0.7721718 1.2223146
## 1 1e-03 2.00 1.576803 0.8051574 0.9973955
## 1 1e-03 4.00 1.463469 0.8158565 0.9114372
## 1 1e-02 0.25 1.531635 0.8096725 0.9628491
## 1 1e-02 0.50 1.436807 0.8189442 0.8921404
## 1 1e-02 1.00 1.375945 0.8266689 0.8589232
## 1 1e-02 2.00 1.340942 0.8320214 0.8390972
## 1 1e-02 4.00 1.329444 0.8330228 0.8348422
## 1 1e-01 0.25 1.336747 0.8324161 0.8375515
## 1 1e-01 0.50 1.328011 0.8330293 0.8360650
## 1 1e-01 1.00 1.325745 0.8328028 0.8388928
## 1 1e-01 2.00 1.324562 0.8327915 0.8404564
## 1 1e-01 4.00 1.324348 0.8326577 0.8413139
## 1 1e+00 0.25 1.324293 0.8327982 0.8403833
## 1 1e+00 0.50 1.324081 0.8327034 0.8414354
## 1 1e+00 1.00 1.324012 0.8326233 0.8416821
## 1 1e+00 2.00 1.324193 0.8325544 0.8418343
## 1 1e+00 4.00 1.324190 0.8325819 0.8421801
## 1 1e+01 0.25 1.323996 0.8326103 0.8417843
## 1 1e+01 0.50 1.324120 0.8325511 0.8420901
## 1 1e+01 1.00 1.324485 0.8324661 0.8425282
## 1 1e+01 2.00 1.324095 0.8325926 0.8430595
## 1 1e+01 4.00 1.332690 0.8302568 0.8495567
## 2 1e-03 0.25 2.211160 0.7113696 1.5143892
## 2 1e-03 0.50 1.850651 0.7717793 1.2224658
## 2 1e-03 1.00 1.577994 0.8046823 0.9982192
## 2 1e-03 2.00 1.463282 0.8159675 0.9108371
## 2 1e-03 4.00 1.389879 0.8247852 0.8645875
## 2 1e-02 0.25 1.447221 0.8147122 0.9076882
## 2 1e-02 0.50 1.386995 0.8222164 0.8797818
## 2 1e-02 1.00 1.373413 0.8216364 0.8873369
## 2 1e-02 2.00 1.428855 0.8068908 0.9522472
## 2 1e-02 4.00 1.509366 0.7881941 1.0228407
## 2 1e-01 0.25 1.689545 0.7337991 1.1839978
## 2 1e-01 0.50 1.689545 0.7337991 1.1839978
## 2 1e-01 1.00 1.689545 0.7337991 1.1839978
## 2 1e-01 2.00 1.689545 0.7337991 1.1839978
## 2 1e-01 4.00 1.689545 0.7337991 1.1839978
## 2 1e+00 0.25 2.424154 0.4529740 1.8054147
## 2 1e+00 0.50 2.424154 0.4529740 1.8054147
## 2 1e+00 1.00 2.424154 0.4529740 1.8054147
## 2 1e+00 2.00 2.424154 0.4529740 1.8054147
## 2 1e+00 4.00 2.424154 0.4529740 1.8054147
## 2 1e+01 0.25 2.883325 0.2572559 2.1827675
## 2 1e+01 0.50 2.883325 0.2572559 2.1827675
## 2 1e+01 1.00 2.883325 0.2572559 2.1827675
## 2 1e+01 2.00 2.883325 0.2572559 2.1827675
## 2 1e+01 4.00 2.883325 0.2572559 2.1827675
## 3 1e-03 0.25 1.992879 0.7507860 1.3394679
## 3 1e-03 0.50 1.673788 0.7936734 1.0793318
## 3 1e-03 1.00 1.503178 0.8121150 0.9412749
## 3 1e-03 2.00 1.416378 0.8211781 0.8799348
## 3 1e-03 4.00 1.355783 0.8297120 0.8470772
## 3 1e-02 0.25 1.426639 0.8153584 0.9143771
## 3 1e-02 0.50 1.436434 0.8069174 0.9435222
## 3 1e-02 1.00 1.479660 0.7942536 0.9926754
## 3 1e-02 2.00 1.502743 0.7879333 1.0192699
## 3 1e-02 4.00 1.504598 0.7874536 1.0209965
## 3 1e-01 0.25 1.914433 0.7043706 1.3246757
## 3 1e-01 0.50 1.914433 0.7043706 1.3246757
## 3 1e-01 1.00 1.914433 0.7043706 1.3246757
## 3 1e-01 2.00 1.914433 0.7043706 1.3246757
## 3 1e-01 4.00 1.914433 0.7043706 1.3246757
## 3 1e+00 0.25 2.054610 0.6713533 1.4354513
## 3 1e+00 0.50 2.054610 0.6713533 1.4354513
## 3 1e+00 1.00 2.054610 0.6713533 1.4354513
## 3 1e+00 2.00 2.054610 0.6713533 1.4354513
## 3 1e+00 4.00 2.054610 0.6713533 1.4354513
## 3 1e+01 0.25 2.049759 0.6723365 1.4325597
## 3 1e+01 0.50 2.049759 0.6723365 1.4325597
## 3 1e+01 1.00 2.049759 0.6723365 1.4325597
## 3 1e+01 2.00 2.049759 0.6723365 1.4325597
## 3 1e+01 4.00 2.049759 0.6723365 1.4325597
##
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were degree = 1, scale = 10 and C = 0.25.
Best parameters from tuning:
get_best_result(svm_reg$model)
## degree scale C RMSE Rsquared MAE RMSESD RsquaredSD MAESD
## 1 1 10 0.25 1.323996 0.8326103 0.8417843 0.3969325 0.08127431 0.1390743
Plot the trained model:
plot(svm_reg$model)
Evaluation of model:
svm_reg$metrics
## Model RMSE Rsquared MAE
## 1 Support Vector Machine 1.177426 0.8521626 0.7618882
Training model with auto tuning:
xgb_reg <- train_evaluate_reg("xgbLinear", name = "XGBoost", tuneLength = 3)
## eXtreme Gradient Boosting
##
## 520 samples
## 32 predictor
##
## Pre-processing: centered (72), scaled (72)
## Resampling: Cross-Validated (10 fold, repeated 10 times)
## Summary of sample sizes: 467, 469, 467, 468, 469, 469, ...
## Resampling results across tuning parameters:
##
## lambda alpha nrounds RMSE Rsquared MAE
## 0e+00 0e+00 50 1.511613 0.7918378 0.9393092
## 0e+00 0e+00 100 1.512005 0.7917418 0.9397081
## 0e+00 0e+00 150 1.512005 0.7917418 0.9397081
## 0e+00 1e-04 50 1.505989 0.7930916 0.9383068
## 0e+00 1e-04 100 1.506339 0.7930240 0.9387463
## 0e+00 1e-04 150 1.506339 0.7930240 0.9387465
## 0e+00 1e-01 50 1.510758 0.7910783 0.9447155
## 0e+00 1e-01 100 1.510970 0.7910251 0.9449051
## 0e+00 1e-01 150 1.510970 0.7910251 0.9449051
## 1e-04 0e+00 50 1.506888 0.7926763 0.9373235
## 1e-04 0e+00 100 1.507267 0.7925891 0.9377648
## 1e-04 0e+00 150 1.507267 0.7925891 0.9377648
## 1e-04 1e-04 50 1.508816 0.7921554 0.9391927
## 1e-04 1e-04 100 1.509250 0.7920720 0.9397714
## 1e-04 1e-04 150 1.509250 0.7920720 0.9397714
## 1e-04 1e-01 50 1.507130 0.7922016 0.9429704
## 1e-04 1e-01 100 1.507336 0.7921494 0.9432016
## 1e-04 1e-01 150 1.507336 0.7921494 0.9432016
## 1e-01 0e+00 50 1.499107 0.7932489 0.9410474
## 1e-01 0e+00 100 1.499690 0.7931013 0.9416606
## 1e-01 0e+00 150 1.499690 0.7931014 0.9416605
## 1e-01 1e-04 50 1.499257 0.7930752 0.9417873
## 1e-01 1e-04 100 1.500019 0.7928812 0.9425275
## 1e-01 1e-04 150 1.500019 0.7928811 0.9425286
## 1e-01 1e-01 50 1.491578 0.7960916 0.9394836
## 1e-01 1e-01 100 1.492072 0.7959774 0.9399642
## 1e-01 1e-01 150 1.492072 0.7959774 0.9399642
##
## Tuning parameter 'eta' was held constant at a value of 0.3
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were nrounds = 50, lambda = 0.1, alpha
## = 0.1 and eta = 0.3.
Best parameters from tuning:
get_best_result(xgb_reg$model)
## lambda alpha nrounds eta RMSE Rsquared MAE RMSESD RsquaredSD
## 1 0.1 0.1 50 0.3 1.491578 0.7960916 0.9394836 0.3821073 0.09009631
## MAESD
## 1 0.1550427
Plot the trained model:
plot(xgb_reg$model)
Training model with manual tuning:
tuneGrid_xgb_reg <- expand.grid(
eta = 0.3,
lambda = 0.1,
alpha = c(0.001, 0.01, 0.1),
nrounds = seq(10, 70, 10)
)
xgb_reg_manual_tune <- train_evaluate_reg("xgbLinear", name = "XGBoost", tuneGrid = tuneGrid_xgb_reg)
## eXtreme Gradient Boosting
##
## 520 samples
## 32 predictor
##
## Pre-processing: centered (72), scaled (72)
## Resampling: Cross-Validated (10 fold, repeated 10 times)
## Summary of sample sizes: 467, 469, 467, 468, 469, 469, ...
## Resampling results across tuning parameters:
##
## alpha nrounds RMSE Rsquared MAE
## 0.001 10 1.518110 0.7981275 0.9601702
## 0.001 20 1.490676 0.7950565 0.9326202
## 0.001 30 1.496577 0.7936059 0.9404913
## 0.001 40 1.498049 0.7932848 0.9418572
## 0.001 50 1.499322 0.7929895 0.9431174
## 0.001 60 1.499623 0.7929325 0.9435549
## 0.001 70 1.499770 0.7929032 0.9436339
## 0.010 10 1.518822 0.7984699 0.9556418
## 0.010 20 1.487909 0.7964229 0.9250986
## 0.010 30 1.494958 0.7947124 0.9316516
## 0.010 40 1.497355 0.7941440 0.9342274
## 0.010 50 1.498089 0.7939856 0.9350578
## 0.010 60 1.498446 0.7938916 0.9354407
## 0.010 70 1.498518 0.7938762 0.9355422
## 0.100 10 1.515027 0.8002019 0.9603495
## 0.100 20 1.484927 0.7975879 0.9298917
## 0.100 30 1.489417 0.7965265 0.9365004
## 0.100 40 1.490933 0.7962359 0.9383227
## 0.100 50 1.491578 0.7960916 0.9394836
## 0.100 60 1.492002 0.7959988 0.9399262
## 0.100 70 1.492072 0.7959768 0.9399641
##
## Tuning parameter 'lambda' was held constant at a value of 0.1
## Tuning
## parameter 'eta' was held constant at a value of 0.3
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were nrounds = 20, lambda = 0.1, alpha
## = 0.1 and eta = 0.3.
Best parameters from tuning:
get_best_result(xgb_reg_manual_tune$model)
## eta lambda alpha nrounds RMSE Rsquared MAE RMSESD RsquaredSD
## 1 0.3 0.1 0.1 20 1.484927 0.7975879 0.9298917 0.3860466 0.09051595
## MAESD
## 1 0.1561429
Plot the trained model:
plot(xgb_reg_manual_tune$model)
Evaluation of model:
xgb_reg_manual_tune$metrics
## Model RMSE Rsquared MAE
## 1 XGBoost 1.250256 0.8338254 0.8697652
Training model with auto tuning:
lr_reg <- train_evaluate_reg("glmnet", name = "Linear Regression", tuneLength = 10)
## glmnet
##
## 520 samples
## 32 predictor
##
## Pre-processing: centered (72), scaled (72)
## Resampling: Cross-Validated (10 fold, repeated 10 times)
## Summary of sample sizes: 467, 469, 467, 468, 469, 469, ...
## Resampling results across tuning parameters:
##
## alpha lambda RMSE Rsquared MAE
## 0.1 0.001388258 1.395236 0.8165417 0.9295931
## 0.1 0.003207056 1.395236 0.8165417 0.9295931
## 0.1 0.007408715 1.395236 0.8165417 0.9295931
## 0.1 0.017115092 1.391001 0.8174735 0.9254500
## 0.1 0.039538082 1.378930 0.8201344 0.9135626
## 0.1 0.091338097 1.359258 0.8246227 0.8937805
## 0.1 0.211002851 1.336974 0.8305641 0.8640433
## 0.1 0.487443953 1.334994 0.8354640 0.8454409
## 0.1 1.126058753 1.416341 0.8334863 0.9057614
## 0.1 2.601341770 1.661772 0.8275410 1.0899052
## 0.2 0.001388258 1.394987 0.8166643 0.9296277
## 0.2 0.003207056 1.394987 0.8166643 0.9296277
## 0.2 0.007408715 1.393455 0.8170050 0.9281990
## 0.2 0.017115092 1.383181 0.8193159 0.9185476
## 0.2 0.039538082 1.364119 0.8236290 0.9006899
## 0.2 0.091338097 1.333282 0.8307732 0.8684140
## 0.2 0.211002851 1.303239 0.8388678 0.8293382
## 0.2 0.487443953 1.317026 0.8419026 0.8219884
## 0.2 1.126058753 1.430309 0.8390100 0.8952581
## 0.2 2.601341770 1.753044 0.8379850 1.1504334
## 0.3 0.001388258 1.394869 0.8167031 0.9296248
## 0.3 0.003207056 1.394869 0.8167031 0.9296248
## 0.3 0.007408715 1.389865 0.8178422 0.9250100
## 0.3 0.017115092 1.376191 0.8209509 0.9124625
## 0.3 0.039538082 1.350549 0.8267956 0.8883604
## 0.3 0.091338097 1.313632 0.8353737 0.8489229
## 0.3 0.211002851 1.285210 0.8434454 0.8122952
## 0.3 0.487443953 1.315619 0.8441240 0.8123446
## 0.3 1.126058753 1.447786 0.8432459 0.9018030
## 0.3 2.601341770 1.857464 0.8404136 1.2383812
## 0.4 0.001388258 1.394780 0.8167352 0.9296183
## 0.4 0.003207056 1.394619 0.8167700 0.9294568
## 0.4 0.007408715 1.386468 0.8186380 0.9220031
## 0.4 0.017115092 1.369658 0.8224754 0.9067854
## 0.4 0.039538082 1.338745 0.8295142 0.8769667
## 0.4 0.091338097 1.298348 0.8389306 0.8344893
## 0.4 0.211002851 1.276474 0.8458366 0.8036659
## 0.4 0.487443953 1.310285 0.8472172 0.8041133
## 0.4 1.126058753 1.473854 0.8455473 0.9206746
## 0.4 2.601341770 1.969191 0.8433381 1.3368708
## 0.5 0.001388258 1.394752 0.8167549 0.9295630
## 0.5 0.003207056 1.393504 0.8170373 0.9284576
## 0.5 0.007408715 1.383264 0.8193875 0.9191885
## 0.5 0.017115092 1.363393 0.8239354 0.9013340
## 0.5 0.039538082 1.328501 0.8318604 0.8668626
## 0.5 0.091338097 1.286959 0.8415481 0.8244149
## 0.5 0.211002851 1.272713 0.8470640 0.7997977
## 0.5 0.487443953 1.309547 0.8491135 0.8013188
## 0.5 1.126058753 1.501877 0.8478301 0.9448245
## 0.5 2.601341770 2.095044 0.8468928 1.4479330
## 0.6 0.001388258 1.394769 0.8167525 0.9296063
## 0.6 0.003207056 1.391976 0.8173928 0.9271044
## 0.6 0.007408715 1.380195 0.8201043 0.9165016
## 0.6 0.017115092 1.357236 0.8253618 0.8958307
## 0.6 0.039538082 1.319490 0.8339241 0.8581293
## 0.6 0.091338097 1.278487 0.8434941 0.8170438
## 0.6 0.211002851 1.269765 0.8480685 0.7968888
## 0.6 0.487443953 1.310810 0.8505788 0.8024391
## 0.6 1.126058753 1.534193 0.8497732 0.9727712
## 0.6 2.601341770 2.235487 0.8506684 1.5736649
## 0.7 0.001388258 1.394891 0.8167302 0.9297422
## 0.7 0.003207056 1.390445 0.8177502 0.9257484
## 0.7 0.007408715 1.377189 0.8208053 0.9139016
## 0.7 0.017115092 1.351399 0.8267086 0.8903188
## 0.7 0.039538082 1.311385 0.8357719 0.8505177
## 0.7 0.091338097 1.272503 0.8449121 0.8120996
## 0.7 0.211002851 1.265901 0.8492333 0.7939290
## 0.7 0.487443953 1.312701 0.8518868 0.8058587
## 0.7 1.126058753 1.569746 0.8514890 1.0052265
## 0.7 2.601341770 2.391487 0.8517634 1.7114318
## 0.8 0.001388258 1.394901 0.8167291 0.9297347
## 0.8 0.003207056 1.388944 0.8181009 0.9244193
## 0.8 0.007408715 1.374259 0.8214892 0.9113534
## 0.8 0.017115092 1.345954 0.8279555 0.8850432
## 0.8 0.039538082 1.304099 0.8374251 0.8439686
## 0.8 0.091338097 1.268101 0.8459609 0.8084763
## 0.8 0.211002851 1.262232 0.8503046 0.7919023
## 0.8 0.487443953 1.317599 0.8523296 0.8114253
## 0.8 1.126058753 1.608952 0.8522654 1.0414983
## 0.8 2.601341770 2.544513 0.8492426 1.8444158
## 0.9 0.001388258 1.394882 0.8167345 0.9297440
## 0.9 0.003207056 1.387486 0.8184429 0.9231338
## 0.9 0.007408715 1.371428 0.8221452 0.9088785
## 0.9 0.017115092 1.340883 0.8291087 0.8800074
## 0.9 0.039538082 1.297787 0.8388634 0.8384294
## 0.9 0.091338097 1.265039 0.8467000 0.8061509
## 0.9 0.211002851 1.260593 0.8508655 0.7918476
## 0.9 0.487443953 1.324513 0.8520985 0.8183692
## 0.9 1.126058753 1.652316 0.8503275 1.0865619
## 0.9 2.601341770 2.704508 0.8492426 1.9770810
## 1.0 0.001388258 1.394612 0.8167989 0.9295231
## 1.0 0.003207056 1.386058 0.8187784 0.9218740
## 1.0 0.007408715 1.368660 0.8227886 0.9064881
## 1.0 0.017115092 1.336081 0.8301996 0.8752796
## 1.0 0.039538082 1.292377 0.8400863 0.8339090
## 1.0 0.091338097 1.263038 0.8471983 0.8052305
## 1.0 0.211002851 1.260452 0.8510329 0.7928667
## 1.0 0.487443953 1.333989 0.8508628 0.8276583
## 1.0 1.126058753 1.686923 0.8492426 1.1216198
## 1.0 2.601341770 2.893369 0.8492426 2.1324632
##
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were alpha = 1 and lambda = 0.2110029.
Best parameters from tuning:
get_best_result(lr_reg$model)
## alpha lambda RMSE Rsquared MAE RMSESD RsquaredSD MAESD
## 1 1 0.2110029 1.260452 0.8510329 0.7928667 0.4019447 0.07640349 0.1336843
Plot the trained model:
plot(lr_reg$model)
Training model with manual tuning:
tuneGrid_lr_reg <- expand.grid(
alpha = c(1),
lambda = seq(0.00, 0.30, by = 0.05)
)
lr_reg_manual_tune <- train_evaluate_reg("glmnet", name = "Linear Regression", tuneGrid = tuneGrid_lr_reg)
## glmnet
##
## 520 samples
## 32 predictor
##
## Pre-processing: centered (72), scaled (72)
## Resampling: Cross-Validated (10 fold, repeated 10 times)
## Summary of sample sizes: 467, 469, 467, 468, 469, 469, ...
## Resampling results across tuning parameters:
##
## lambda RMSE Rsquared MAE
## 0.00 1.394938 0.8167240 0.9298049
## 0.05 1.281021 0.8426330 0.8235131
## 0.10 1.261976 0.8476099 0.8039718
## 0.15 1.258282 0.8496377 0.7969222
## 0.20 1.259368 0.8508915 0.7932318
## 0.25 1.265253 0.8514903 0.7922721
## 0.30 1.274531 0.8517463 0.7938045
##
## Tuning parameter 'alpha' was held constant at a value of 1
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were alpha = 1 and lambda = 0.15.
Best parameters from tuning:
get_best_result(lr_reg_manual_tune$model)
## alpha lambda RMSE Rsquared MAE RMSESD RsquaredSD MAESD
## 1 1 0.15 1.258282 0.8496377 0.7969222 0.398786 0.07721303 0.1329245
Plot the trained model:
plot(lr_reg_manual_tune$model)
Evaluation of model:
lr_reg_manual_tune$metrics
## Model RMSE Rsquared MAE
## 1 Linear Regression 1.117378 0.8643585 0.7226734
Define a helper function to train model.
train_model_clf <- function(method = "",
data = train_set_clf,
tuneGrid = NULL,
tuneLength = 3) {
set.seed(42)
fit <- train(
result ~ .,
data = data,
method = method,
trControl = fit_control,
preProcess = c("center", "scale"),
tuneGrid = tuneGrid,
tuneLength = tuneLength
)
print(fit)
return(fit)
}
Define a helper function to evaluate model.
evaluate_model_clf <- function(model = NULL, name = "") {
pred <- predict(model, test_set_clf)
cm_fail <- confusionMatrix(pred, test_set_clf$result, mode = "prec_recall")
print(cm_fail)
cm_pass <- confusionMatrix(
pred,
test_set_clf$result,
mode = "prec_recall",
positive = "pass"
)
return(data.frame(
Model = name,
Accuracy = round(cm_fail$overall[["Accuracy"]], 3),
Precision.Fail = round(cm_fail$byClass[["Precision"]], 3),
Recall.Fail = round(cm_fail$byClass[["Recall"]], 3),
F1.Fail = round(cm_fail$byClass[["F1"]], 3),
Precision.Pass = round(cm_pass$byClass[["Precision"]], 3),
Recall.Pass = round(cm_pass$byClass[["Recall"]], 3),
F1.Pass = round(cm_pass$byClass[["F1"]], 3)
))
}
dt_clf <- train_model_clf("rpart2")
## note: only 2 possible values of the max tree depth from the initial fit.
## Truncating the grid to 2 .
##
## CART
##
## 520 samples
## 32 predictor
## 2 classes: 'fail', 'pass'
##
## Pre-processing: centered (72), scaled (72)
## Resampling: Cross-Validated (10 fold, repeated 10 times)
## Summary of sample sizes: 468, 468, 468, 468, 468, 468, ...
## Resampling results across tuning parameters:
##
## maxdepth Accuracy Kappa
## 1 0.9365385 0.7199437
## 4 0.9286538 0.7122307
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was maxdepth = 1.
plot(dt_clf)
No further tuning required.
dt_metrics_clf <- evaluate_model_clf(dt_clf, "Decision Tree")
## Confusion Matrix and Statistics
##
## Reference
## Prediction fail pass
## fail 13 1
## pass 7 108
##
## Accuracy : 0.938
## 95% CI : (0.8815, 0.9728)
## No Information Rate : 0.845
## P-Value [Acc > NIR] : 0.001075
##
## Kappa : 0.7303
##
## Mcnemar's Test P-Value : 0.077100
##
## Precision : 0.9286
## Recall : 0.6500
## F1 : 0.7647
## Prevalence : 0.1550
## Detection Rate : 0.1008
## Detection Prevalence : 0.1085
## Balanced Accuracy : 0.8204
##
## 'Positive' Class : fail
##
down_dt_clf <- train_model_clf("rpart2", train_set_clf_down)
## note: only 2 possible values of the max tree depth from the initial fit.
## Truncating the grid to 2 .
##
## CART
##
## 160 samples
## 32 predictor
## 2 classes: 'fail', 'pass'
##
## Pre-processing: centered (72), scaled (72)
## Resampling: Cross-Validated (10 fold, repeated 10 times)
## Summary of sample sizes: 144, 144, 144, 144, 144, 144, ...
## Resampling results across tuning parameters:
##
## maxdepth Accuracy Kappa
## 1 0.838125 0.67625
## 2 0.823750 0.64750
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was maxdepth = 1.
plot(down_dt_clf)
No further tuning required.
down_dt_metrics_clf <- evaluate_model_clf(down_dt_clf, "Decision Tree")
## Confusion Matrix and Statistics
##
## Reference
## Prediction fail pass
## fail 17 15
## pass 3 94
##
## Accuracy : 0.8605
## 95% CI : (0.7885, 0.9152)
## No Information Rate : 0.845
## P-Value [Acc > NIR] : 0.366920
##
## Kappa : 0.5722
##
## Mcnemar's Test P-Value : 0.009522
##
## Precision : 0.5312
## Recall : 0.8500
## F1 : 0.6538
## Prevalence : 0.1550
## Detection Rate : 0.1318
## Detection Prevalence : 0.2481
## Balanced Accuracy : 0.8562
##
## 'Positive' Class : fail
##
up_dt_clf <- train_model_clf("rpart2", train_set_clf_up)
## CART
##
## 880 samples
## 32 predictor
## 2 classes: 'fail', 'pass'
##
## Pre-processing: centered (72), scaled (72)
## Resampling: Cross-Validated (10 fold, repeated 10 times)
## Summary of sample sizes: 792, 792, 792, 792, 792, 792, ...
## Resampling results across tuning parameters:
##
## maxdepth Accuracy Kappa
## 1 0.8892045 0.7784091
## 3 0.9173864 0.8347727
## 4 0.9296591 0.8593182
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was maxdepth = 4.
plot(up_dt_clf)
Perform hyperparameter tuning using tuneLength.
up_dt_clf <- train_model_clf("rpart2", train_set_clf_up, tuneLength = 10)
## note: only 7 possible values of the max tree depth from the initial fit.
## Truncating the grid to 7 .
##
## CART
##
## 880 samples
## 32 predictor
## 2 classes: 'fail', 'pass'
##
## Pre-processing: centered (72), scaled (72)
## Resampling: Cross-Validated (10 fold, repeated 10 times)
## Summary of sample sizes: 792, 792, 792, 792, 792, 792, ...
## Resampling results across tuning parameters:
##
## maxdepth Accuracy Kappa
## 1 0.8892045 0.7784091
## 3 0.9173864 0.8347727
## 4 0.9238636 0.8477273
## 7 0.9294318 0.8588636
## 8 0.9294318 0.8588636
## 11 0.9294318 0.8588636
## 12 0.9294318 0.8588636
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was maxdepth = 7.
plot(up_dt_clf)
Tuning done.
up_dt_metrics_clf <- evaluate_model_clf(up_dt_clf, "Decision Tree")
## Confusion Matrix and Statistics
##
## Reference
## Prediction fail pass
## fail 16 11
## pass 4 98
##
## Accuracy : 0.8837
## 95% CI : (0.8155, 0.9334)
## No Information Rate : 0.845
## P-Value [Acc > NIR] : 0.1352
##
## Kappa : 0.6117
##
## Mcnemar's Test P-Value : 0.1213
##
## Precision : 0.5926
## Recall : 0.8000
## F1 : 0.6809
## Prevalence : 0.1550
## Detection Rate : 0.1240
## Detection Prevalence : 0.2093
## Balanced Accuracy : 0.8495
##
## 'Positive' Class : fail
##
rf_clf <- train_model_clf("ranger")
## Random Forest
##
## 520 samples
## 32 predictor
## 2 classes: 'fail', 'pass'
##
## Pre-processing: centered (72), scaled (72)
## Resampling: Cross-Validated (10 fold, repeated 10 times)
## Summary of sample sizes: 468, 468, 468, 468, 468, 468, ...
## Resampling results across tuning parameters:
##
## mtry splitrule Accuracy Kappa
## 2 gini 0.8744231 0.2633357
## 2 extratrees 0.8594231 0.1295123
## 37 gini 0.9307692 0.7051503
## 37 extratrees 0.9275000 0.6794896
## 72 gini 0.9313462 0.7069938
## 72 extratrees 0.9350000 0.7230234
##
## Tuning parameter 'min.node.size' was held constant at a value of 1
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were mtry = 72, splitrule = extratrees
## and min.node.size = 1.
plot(rf_clf)
Perform hyperparameter tuning using tuneLength.
rf_clf <- train_model_clf("ranger", tuneLength = 7)
## Random Forest
##
## 520 samples
## 32 predictor
## 2 classes: 'fail', 'pass'
##
## Pre-processing: centered (72), scaled (72)
## Resampling: Cross-Validated (10 fold, repeated 10 times)
## Summary of sample sizes: 468, 468, 468, 468, 468, 468, ...
## Resampling results across tuning parameters:
##
## mtry splitrule Accuracy Kappa
## 2 gini 0.8753846 0.2723578
## 2 extratrees 0.8582692 0.1202263
## 13 gini 0.9336538 0.7118338
## 13 extratrees 0.9017308 0.5067021
## 25 gini 0.9311538 0.7082097
## 25 extratrees 0.9217308 0.6373321
## 37 gini 0.9313462 0.7085045
## 37 extratrees 0.9257692 0.6709289
## 48 gini 0.9296154 0.7010125
## 48 extratrees 0.9300000 0.6973189
## 60 gini 0.9315385 0.7095738
## 60 extratrees 0.9351923 0.7227439
## 72 gini 0.9311538 0.7077921
## 72 extratrees 0.9344231 0.7236122
##
## Tuning parameter 'min.node.size' was held constant at a value of 1
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were mtry = 60, splitrule = extratrees
## and min.node.size = 1.
plot(rf_clf)
Tuning done.
rf_metrics_clf <- evaluate_model_clf(rf_clf, "Random Forest")
## Confusion Matrix and Statistics
##
## Reference
## Prediction fail pass
## fail 14 4
## pass 6 105
##
## Accuracy : 0.9225
## 95% CI : (0.8621, 0.9622)
## No Information Rate : 0.845
## P-Value [Acc > NIR] : 0.006716
##
## Kappa : 0.6915
##
## Mcnemar's Test P-Value : 0.751830
##
## Precision : 0.7778
## Recall : 0.7000
## F1 : 0.7368
## Prevalence : 0.1550
## Detection Rate : 0.1085
## Detection Prevalence : 0.1395
## Balanced Accuracy : 0.8317
##
## 'Positive' Class : fail
##
down_rf_clf <- train_model_clf("ranger", train_set_clf_down)
## Random Forest
##
## 160 samples
## 32 predictor
## 2 classes: 'fail', 'pass'
##
## Pre-processing: centered (72), scaled (72)
## Resampling: Cross-Validated (10 fold, repeated 10 times)
## Summary of sample sizes: 144, 144, 144, 144, 144, 144, ...
## Resampling results across tuning parameters:
##
## mtry splitrule Accuracy Kappa
## 2 gini 0.795625 0.59125
## 2 extratrees 0.757500 0.51500
## 37 gini 0.879375 0.75875
## 37 extratrees 0.846875 0.69375
## 72 gini 0.880000 0.76000
## 72 extratrees 0.868125 0.73625
##
## Tuning parameter 'min.node.size' was held constant at a value of 1
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were mtry = 72, splitrule = gini
## and min.node.size = 1.
plot(down_rf_clf)
Perform hyperparameter tuning using tuneLength.
down_rf_clf <- train_model_clf("ranger", train_set_clf_down, tuneLength = 7)
## Random Forest
##
## 160 samples
## 32 predictor
## 2 classes: 'fail', 'pass'
##
## Pre-processing: centered (72), scaled (72)
## Resampling: Cross-Validated (10 fold, repeated 10 times)
## Summary of sample sizes: 144, 144, 144, 144, 144, 144, ...
## Resampling results across tuning parameters:
##
## mtry splitrule Accuracy Kappa
## 2 gini 0.801875 0.60375
## 2 extratrees 0.746875 0.49375
## 13 gini 0.876250 0.75250
## 13 extratrees 0.814375 0.62875
## 25 gini 0.891250 0.78250
## 25 extratrees 0.843750 0.68750
## 37 gini 0.886250 0.77250
## 37 extratrees 0.853750 0.70750
## 48 gini 0.880000 0.76000
## 48 extratrees 0.854375 0.70875
## 60 gini 0.881875 0.76375
## 60 extratrees 0.869375 0.73875
## 72 gini 0.879375 0.75875
## 72 extratrees 0.873125 0.74625
##
## Tuning parameter 'min.node.size' was held constant at a value of 1
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were mtry = 25, splitrule = gini
## and min.node.size = 1.
plot(down_rf_clf)
Tuning done.
down_rf_metrics_clf <- evaluate_model_clf(down_rf_clf, "Random Forest")
## Confusion Matrix and Statistics
##
## Reference
## Prediction fail pass
## fail 17 13
## pass 3 96
##
## Accuracy : 0.876
## 95% CI : (0.8064, 0.9274)
## No Information Rate : 0.845
## P-Value [Acc > NIR] : 0.19935
##
## Kappa : 0.6069
##
## Mcnemar's Test P-Value : 0.02445
##
## Precision : 0.5667
## Recall : 0.8500
## F1 : 0.6800
## Prevalence : 0.1550
## Detection Rate : 0.1318
## Detection Prevalence : 0.2326
## Balanced Accuracy : 0.8654
##
## 'Positive' Class : fail
##
up_rf_clf <- train_model_clf("ranger", train_set_clf_up)
## Random Forest
##
## 880 samples
## 32 predictor
## 2 classes: 'fail', 'pass'
##
## Pre-processing: centered (72), scaled (72)
## Resampling: Cross-Validated (10 fold, repeated 10 times)
## Summary of sample sizes: 792, 792, 792, 792, 792, 792, ...
## Resampling results across tuning parameters:
##
## mtry splitrule Accuracy Kappa
## 2 gini 0.9803409 0.9606818
## 2 extratrees 0.9795455 0.9590909
## 37 gini 0.9789773 0.9579545
## 37 extratrees 0.9820455 0.9640909
## 72 gini 0.9786364 0.9572727
## 72 extratrees 0.9786364 0.9572727
##
## Tuning parameter 'min.node.size' was held constant at a value of 1
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were mtry = 37, splitrule = extratrees
## and min.node.size = 1.
plot(up_rf_clf)
Perform hyperparameter tuning using tuneLength.
up_rf_clf <- train_model_clf("ranger", train_set_clf_up, tuneLength = 7)
## Random Forest
##
## 880 samples
## 32 predictor
## 2 classes: 'fail', 'pass'
##
## Pre-processing: centered (72), scaled (72)
## Resampling: Cross-Validated (10 fold, repeated 10 times)
## Summary of sample sizes: 792, 792, 792, 792, 792, 792, ...
## Resampling results across tuning parameters:
##
## mtry splitrule Accuracy Kappa
## 2 gini 0.9812500 0.9625000
## 2 extratrees 0.9801136 0.9602273
## 13 gini 0.9807955 0.9615909
## 13 extratrees 0.9819318 0.9638636
## 25 gini 0.9794318 0.9588636
## 25 extratrees 0.9802273 0.9604545
## 37 gini 0.9785227 0.9570455
## 37 extratrees 0.9815909 0.9631818
## 48 gini 0.9786364 0.9572727
## 48 extratrees 0.9817045 0.9634091
## 60 gini 0.9787500 0.9575000
## 60 extratrees 0.9803409 0.9606818
## 72 gini 0.9781818 0.9563636
## 72 extratrees 0.9780682 0.9561364
##
## Tuning parameter 'min.node.size' was held constant at a value of 1
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were mtry = 13, splitrule = extratrees
## and min.node.size = 1.
plot(up_rf_clf)
Tuning done.
up_rf_metrics_clf <- evaluate_model_clf(up_rf_clf, "Random Forest")
## Confusion Matrix and Statistics
##
## Reference
## Prediction fail pass
## fail 12 3
## pass 8 106
##
## Accuracy : 0.9147
## 95% CI : (0.8525, 0.9567)
## No Information Rate : 0.845
## P-Value [Acc > NIR] : 0.01442
##
## Kappa : 0.6375
##
## Mcnemar's Test P-Value : 0.22780
##
## Precision : 0.80000
## Recall : 0.60000
## F1 : 0.68571
## Prevalence : 0.15504
## Detection Rate : 0.09302
## Detection Prevalence : 0.11628
## Balanced Accuracy : 0.78624
##
## 'Positive' Class : fail
##
svm_poly_clf <- train_model_clf("svmPoly")
## Support Vector Machines with Polynomial Kernel
##
## 520 samples
## 32 predictor
## 2 classes: 'fail', 'pass'
##
## Pre-processing: centered (72), scaled (72)
## Resampling: Cross-Validated (10 fold, repeated 10 times)
## Summary of sample sizes: 468, 468, 468, 468, 468, 468, ...
## Resampling results across tuning parameters:
##
## degree scale C Accuracy Kappa
## 1 0.001 0.25 0.8461538 0.000000000
## 1 0.001 0.50 0.8461538 0.000000000
## 1 0.001 1.00 0.8461538 0.000000000
## 1 0.010 0.25 0.8546154 0.083940229
## 1 0.010 0.50 0.8732692 0.260599844
## 1 0.010 1.00 0.8817308 0.394972860
## 1 0.100 0.25 0.9042308 0.574306440
## 1 0.100 0.50 0.9138462 0.636752983
## 1 0.100 1.00 0.9167308 0.660537348
## 2 0.001 0.25 0.8461538 0.000000000
## 2 0.001 0.50 0.8461538 0.000000000
## 2 0.001 1.00 0.8492308 0.030863195
## 2 0.010 0.25 0.8748077 0.276356280
## 2 0.010 0.50 0.8878846 0.454151837
## 2 0.010 1.00 0.9117308 0.611795906
## 2 0.100 0.25 0.8886538 0.501281678
## 2 0.100 0.50 0.8886538 0.501281678
## 2 0.100 1.00 0.8886538 0.501281678
## 3 0.001 0.25 0.8461538 0.000000000
## 3 0.001 0.50 0.8465385 0.003893805
## 3 0.001 1.00 0.8598077 0.132831582
## 3 0.010 0.25 0.8892308 0.447824384
## 3 0.010 0.50 0.9067308 0.579955776
## 3 0.010 1.00 0.9036538 0.576579426
## 3 0.100 0.25 0.8775000 0.358828319
## 3 0.100 0.50 0.8775000 0.358828319
## 3 0.100 1.00 0.8775000 0.358828319
##
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were degree = 1, scale = 0.1 and C = 1.
plot(svm_poly_clf)
Perform hyperparameter tuning using tuneGrid.
svm_poly_tuneGrid_clf <- expand.grid(
degree = 1,
scale = c(0.01, 0.1),
C = seq(0.1, 1.5, 0.05)
)
svm_poly_clf <- train_model_clf("svmPoly", tuneGrid = svm_poly_tuneGrid_clf)
## Support Vector Machines with Polynomial Kernel
##
## 520 samples
## 32 predictor
## 2 classes: 'fail', 'pass'
##
## Pre-processing: centered (72), scaled (72)
## Resampling: Cross-Validated (10 fold, repeated 10 times)
## Summary of sample sizes: 468, 468, 468, 468, 468, 468, ...
## Resampling results across tuning parameters:
##
## scale C Accuracy Kappa
## 0.01 0.10 0.8461538 0.00000000
## 0.01 0.15 0.8461538 0.00000000
## 0.01 0.20 0.8492308 0.03086319
## 0.01 0.25 0.8546154 0.08394023
## 0.01 0.30 0.8576923 0.11365443
## 0.01 0.35 0.8638462 0.17052312
## 0.01 0.40 0.8682692 0.21034175
## 0.01 0.45 0.8713462 0.24060396
## 0.01 0.50 0.8732692 0.26059984
## 0.01 0.55 0.8755769 0.28188147
## 0.01 0.60 0.8753846 0.28132437
## 0.01 0.65 0.8773077 0.29947793
## 0.01 0.70 0.8782692 0.31299249
## 0.01 0.75 0.8803846 0.33723638
## 0.01 0.80 0.8819231 0.35690555
## 0.01 0.85 0.8821154 0.36999562
## 0.01 0.90 0.8826923 0.38047267
## 0.01 0.95 0.8825000 0.39115441
## 0.01 1.00 0.8817308 0.39497286
## 0.01 1.05 0.8821154 0.40146122
## 0.01 1.10 0.8838462 0.41961460
## 0.01 1.15 0.8842308 0.42965830
## 0.01 1.20 0.8853846 0.43706948
## 0.01 1.25 0.8878846 0.45460578
## 0.01 1.30 0.8900000 0.47114159
## 0.01 1.35 0.8917308 0.48367835
## 0.01 1.40 0.8919231 0.48581998
## 0.01 1.45 0.8938462 0.50044125
## 0.01 1.50 0.8953846 0.51256552
## 0.10 0.10 0.8817308 0.39497286
## 0.10 0.15 0.8953846 0.51256552
## 0.10 0.20 0.9011538 0.55073283
## 0.10 0.25 0.9042308 0.57430644
## 0.10 0.30 0.9073077 0.59887202
## 0.10 0.35 0.9082692 0.60766579
## 0.10 0.40 0.9094231 0.61349688
## 0.10 0.45 0.9109615 0.62042831
## 0.10 0.50 0.9138462 0.63675298
## 0.10 0.55 0.9163462 0.64742641
## 0.10 0.60 0.9182692 0.65957162
## 0.10 0.65 0.9200000 0.66690415
## 0.10 0.70 0.9201923 0.66990832
## 0.10 0.75 0.9200000 0.66753386
## 0.10 0.80 0.9196154 0.66678110
## 0.10 0.85 0.9188462 0.66656228
## 0.10 0.90 0.9186538 0.66640525
## 0.10 0.95 0.9173077 0.66144950
## 0.10 1.00 0.9167308 0.66053735
## 0.10 1.05 0.9163462 0.65870244
## 0.10 1.10 0.9155769 0.65680728
## 0.10 1.15 0.9161538 0.65811114
## 0.10 1.20 0.9163462 0.65900978
## 0.10 1.25 0.9167308 0.66071945
## 0.10 1.30 0.9161538 0.65865021
## 0.10 1.35 0.9155769 0.65711508
## 0.10 1.40 0.9148077 0.65527095
## 0.10 1.45 0.9140385 0.65206420
## 0.10 1.50 0.9140385 0.65320292
##
## Tuning parameter 'degree' was held constant at a value of 1
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were degree = 1, scale = 0.1 and C = 0.7.
plot(svm_poly_clf)
Tuning done.
svm_poly_metrics_clf <- evaluate_model_clf(svm_poly_clf, "Support Vector Machine")
## Confusion Matrix and Statistics
##
## Reference
## Prediction fail pass
## fail 14 5
## pass 6 104
##
## Accuracy : 0.9147
## 95% CI : (0.8525, 0.9567)
## No Information Rate : 0.845
## P-Value [Acc > NIR] : 0.01442
##
## Kappa : 0.6678
##
## Mcnemar's Test P-Value : 1.00000
##
## Precision : 0.7368
## Recall : 0.7000
## F1 : 0.7179
## Prevalence : 0.1550
## Detection Rate : 0.1085
## Detection Prevalence : 0.1473
## Balanced Accuracy : 0.8271
##
## 'Positive' Class : fail
##
down_svm_poly_clf <- train_model_clf("svmPoly", train_set_clf_down)
## Support Vector Machines with Polynomial Kernel
##
## 160 samples
## 32 predictor
## 2 classes: 'fail', 'pass'
##
## Pre-processing: centered (72), scaled (72)
## Resampling: Cross-Validated (10 fold, repeated 10 times)
## Summary of sample sizes: 144, 144, 144, 144, 144, 144, ...
## Resampling results across tuning parameters:
##
## degree scale C Accuracy Kappa
## 1 0.001 0.25 0.768750 0.53750
## 1 0.001 0.50 0.768750 0.53750
## 1 0.001 1.00 0.770000 0.54000
## 1 0.010 0.25 0.758750 0.51750
## 1 0.010 0.50 0.774375 0.54875
## 1 0.010 1.00 0.798750 0.59750
## 1 0.100 0.25 0.770000 0.54000
## 1 0.100 0.50 0.765625 0.53125
## 1 0.100 1.00 0.766250 0.53250
## 2 0.001 0.25 0.769375 0.53875
## 2 0.001 0.50 0.768750 0.53750
## 2 0.001 1.00 0.765625 0.53125
## 2 0.010 0.25 0.779375 0.55875
## 2 0.010 0.50 0.796875 0.59375
## 2 0.010 1.00 0.780000 0.56000
## 2 0.100 0.25 0.763125 0.52625
## 2 0.100 0.50 0.763125 0.52625
## 2 0.100 1.00 0.763125 0.52625
## 3 0.001 0.25 0.768750 0.53750
## 3 0.001 0.50 0.762500 0.52500
## 3 0.001 1.00 0.761250 0.52250
## 3 0.010 0.25 0.794375 0.58875
## 3 0.010 0.50 0.804375 0.60875
## 3 0.010 1.00 0.781250 0.56250
## 3 0.100 0.25 0.748750 0.49750
## 3 0.100 0.50 0.748750 0.49750
## 3 0.100 1.00 0.748750 0.49750
##
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were degree = 3, scale = 0.01 and C = 0.5.
plot(down_svm_poly_clf)
Perform hyperparameter tuning using tuneGrid.
down_svm_poly_tuneGrid_clf <- expand.grid(
degree = 3,
scale = 0.01,
C = seq(0.25, 1, 0.05)
)
down_svm_poly_clf <- train_model_clf(
"svmPoly",
train_set_clf_down,
tuneGrid = down_svm_poly_tuneGrid_clf
)
## Support Vector Machines with Polynomial Kernel
##
## 160 samples
## 32 predictor
## 2 classes: 'fail', 'pass'
##
## Pre-processing: centered (72), scaled (72)
## Resampling: Cross-Validated (10 fold, repeated 10 times)
## Summary of sample sizes: 144, 144, 144, 144, 144, 144, ...
## Resampling results across tuning parameters:
##
## C Accuracy Kappa
## 0.25 0.794375 0.58875
## 0.30 0.800000 0.60000
## 0.35 0.801875 0.60375
## 0.40 0.800625 0.60125
## 0.45 0.802500 0.60500
## 0.50 0.804375 0.60875
## 0.55 0.801875 0.60375
## 0.60 0.793125 0.58625
## 0.65 0.795000 0.59000
## 0.70 0.790625 0.58125
## 0.75 0.788750 0.57750
## 0.80 0.785000 0.57000
## 0.85 0.784375 0.56875
## 0.90 0.784375 0.56875
## 0.95 0.781250 0.56250
## 1.00 0.781250 0.56250
##
## Tuning parameter 'degree' was held constant at a value of 3
## Tuning
## parameter 'scale' was held constant at a value of 0.01
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were degree = 3, scale = 0.01 and C = 0.5.
plot(down_svm_poly_clf)
Tuning done.
down_svm_poly_metrics_clf <- evaluate_model_clf(down_svm_poly_clf, "Support Vector Machine")
## Confusion Matrix and Statistics
##
## Reference
## Prediction fail pass
## fail 18 16
## pass 2 93
##
## Accuracy : 0.8605
## 95% CI : (0.7885, 0.9152)
## No Information Rate : 0.845
## P-Value [Acc > NIR] : 0.366920
##
## Kappa : 0.5858
##
## Mcnemar's Test P-Value : 0.002183
##
## Precision : 0.5294
## Recall : 0.9000
## F1 : 0.6667
## Prevalence : 0.1550
## Detection Rate : 0.1395
## Detection Prevalence : 0.2636
## Balanced Accuracy : 0.8766
##
## 'Positive' Class : fail
##
up_svm_poly_clf <- train_model_clf("svmPoly", train_set_clf_up)
## Support Vector Machines with Polynomial Kernel
##
## 880 samples
## 32 predictor
## 2 classes: 'fail', 'pass'
##
## Pre-processing: centered (72), scaled (72)
## Resampling: Cross-Validated (10 fold, repeated 10 times)
## Summary of sample sizes: 792, 792, 792, 792, 792, 792, ...
## Resampling results across tuning parameters:
##
## degree scale C Accuracy Kappa
## 1 0.001 0.25 0.8207955 0.6415909
## 1 0.001 0.50 0.8480682 0.6961364
## 1 0.001 1.00 0.8709091 0.7418182
## 1 0.010 0.25 0.8906818 0.7813636
## 1 0.010 0.50 0.8967045 0.7934091
## 1 0.010 1.00 0.9063636 0.8127273
## 1 0.100 0.25 0.9256818 0.8513636
## 1 0.100 0.50 0.9351136 0.8702273
## 1 0.100 1.00 0.9400000 0.8800000
## 2 0.001 0.25 0.8502273 0.7004545
## 2 0.001 0.50 0.8738636 0.7477273
## 2 0.001 1.00 0.8902273 0.7804545
## 2 0.010 0.25 0.9232955 0.8465909
## 2 0.010 0.50 0.9387500 0.8775000
## 2 0.010 1.00 0.9576136 0.9152273
## 2 0.100 0.25 0.9738636 0.9477273
## 2 0.100 0.50 0.9738636 0.9477273
## 2 0.100 1.00 0.9738636 0.9477273
## 3 0.001 0.25 0.8648864 0.7297727
## 3 0.001 0.50 0.8860227 0.7720455
## 3 0.001 1.00 0.8954545 0.7909091
## 3 0.010 0.25 0.9564773 0.9129545
## 3 0.010 0.50 0.9711364 0.9422727
## 3 0.010 1.00 0.9750000 0.9500000
## 3 0.100 0.25 0.9913636 0.9827273
## 3 0.100 0.50 0.9913636 0.9827273
## 3 0.100 1.00 0.9913636 0.9827273
##
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were degree = 3, scale = 0.1 and C = 0.25.
plot(up_svm_poly_clf)
Perform hyperparameter tuning using tuneGrid.
up_svm_poly_tuneGrid_clf <- expand.grid(
degree = 3,
scale = 0.1,
C = seq(0.05, 0.25, 0.05)
)
up_svm_poly_clf <- train_model_clf(
"svmPoly",
train_set_clf_up,
tuneGrid = up_svm_poly_tuneGrid_clf
)
## Support Vector Machines with Polynomial Kernel
##
## 880 samples
## 32 predictor
## 2 classes: 'fail', 'pass'
##
## Pre-processing: centered (72), scaled (72)
## Resampling: Cross-Validated (10 fold, repeated 10 times)
## Summary of sample sizes: 792, 792, 792, 792, 792, 792, ...
## Resampling results across tuning parameters:
##
## C Accuracy Kappa
## 0.05 0.9913636 0.9827273
## 0.10 0.9913636 0.9827273
## 0.15 0.9913636 0.9827273
## 0.20 0.9913636 0.9827273
## 0.25 0.9913636 0.9827273
##
## Tuning parameter 'degree' was held constant at a value of 3
## Tuning
## parameter 'scale' was held constant at a value of 0.1
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were degree = 3, scale = 0.1 and C = 0.05.
plot(up_svm_poly_clf)
Tuning done.
up_svm_poly_metrics_clf <- evaluate_model_clf(up_svm_poly_clf, "Support Vector Machine")
## Confusion Matrix and Statistics
##
## Reference
## Prediction fail pass
## fail 7 1
## pass 13 108
##
## Accuracy : 0.8915
## 95% CI : (0.8246, 0.9394)
## No Information Rate : 0.845
## P-Value [Acc > NIR] : 0.086135
##
## Kappa : 0.4514
##
## Mcnemar's Test P-Value : 0.003283
##
## Precision : 0.87500
## Recall : 0.35000
## F1 : 0.50000
## Prevalence : 0.15504
## Detection Rate : 0.05426
## Detection Prevalence : 0.06202
## Balanced Accuracy : 0.67041
##
## 'Positive' Class : fail
##
knn_clf <- train_model_clf("knn")
## k-Nearest Neighbors
##
## 520 samples
## 32 predictor
## 2 classes: 'fail', 'pass'
##
## Pre-processing: centered (72), scaled (72)
## Resampling: Cross-Validated (10 fold, repeated 10 times)
## Summary of sample sizes: 468, 468, 468, 468, 468, 468, ...
## Resampling results across tuning parameters:
##
## k Accuracy Kappa
## 5 0.8580769 0.2849825
## 7 0.8525000 0.2177127
## 9 0.8575000 0.2242462
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 5.
plot(knn_clf)
Perform hyperparameter tuning using tuneGrid.
knn_clf <- train_model_clf("knn", tuneGrid = expand.grid(k = c(3, 5, 7, 11)))
## k-Nearest Neighbors
##
## 520 samples
## 32 predictor
## 2 classes: 'fail', 'pass'
##
## Pre-processing: centered (72), scaled (72)
## Resampling: Cross-Validated (10 fold, repeated 10 times)
## Summary of sample sizes: 468, 468, 468, 468, 468, 468, ...
## Resampling results across tuning parameters:
##
## k Accuracy Kappa
## 3 0.8526923 0.3232641
## 5 0.8580769 0.2849825
## 7 0.8525000 0.2177127
## 11 0.8551923 0.1845602
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 5.
plot(knn_clf)
No further tuning necessary.
knn_metrics_clf <- evaluate_model_clf(knn_clf, "K-Nearest Neighbors")
## Confusion Matrix and Statistics
##
## Reference
## Prediction fail pass
## fail 8 5
## pass 12 104
##
## Accuracy : 0.8682
## 95% CI : (0.7974, 0.9213)
## No Information Rate : 0.845
## P-Value [Acc > NIR] : 0.2776
##
## Kappa : 0.4132
##
## Mcnemar's Test P-Value : 0.1456
##
## Precision : 0.61538
## Recall : 0.40000
## F1 : 0.48485
## Prevalence : 0.15504
## Detection Rate : 0.06202
## Detection Prevalence : 0.10078
## Balanced Accuracy : 0.67706
##
## 'Positive' Class : fail
##
down_knn_clf <- train_model_clf("knn", train_set_clf_down)
## k-Nearest Neighbors
##
## 160 samples
## 32 predictor
## 2 classes: 'fail', 'pass'
##
## Pre-processing: centered (72), scaled (72)
## Resampling: Cross-Validated (10 fold, repeated 10 times)
## Summary of sample sizes: 144, 144, 144, 144, 144, 144, ...
## Resampling results across tuning parameters:
##
## k Accuracy Kappa
## 5 0.714375 0.42875
## 7 0.715625 0.43125
## 9 0.727500 0.45500
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 9.
plot(down_knn_clf)
Perform hyperparameter tuning using tuneGrid.
down_knn_clf <- train_model_clf(
"knn",
train_set_clf_down,
tuneGrid = expand.grid(k = c(9, 11, 13, 15))
)
## k-Nearest Neighbors
##
## 160 samples
## 32 predictor
## 2 classes: 'fail', 'pass'
##
## Pre-processing: centered (72), scaled (72)
## Resampling: Cross-Validated (10 fold, repeated 10 times)
## Summary of sample sizes: 144, 144, 144, 144, 144, 144, ...
## Resampling results across tuning parameters:
##
## k Accuracy Kappa
## 9 0.726875 0.45375
## 11 0.729375 0.45875
## 13 0.736875 0.47375
## 15 0.727500 0.45500
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 13.
plot(down_knn_clf)
No further tuning required.
down_knn_metrics_clf <- evaluate_model_clf(down_knn_clf, "K-Nearest Neighbors")
## Confusion Matrix and Statistics
##
## Reference
## Prediction fail pass
## fail 16 15
## pass 4 94
##
## Accuracy : 0.8527
## 95% CI : (0.7796, 0.9089)
## No Information Rate : 0.845
## P-Value [Acc > NIR] : 0.46267
##
## Kappa : 0.5409
##
## Mcnemar's Test P-Value : 0.02178
##
## Precision : 0.5161
## Recall : 0.8000
## F1 : 0.6275
## Prevalence : 0.1550
## Detection Rate : 0.1240
## Detection Prevalence : 0.2403
## Balanced Accuracy : 0.8312
##
## 'Positive' Class : fail
##
up_knn_clf <- train_model_clf("knn", train_set_clf_up)
## k-Nearest Neighbors
##
## 880 samples
## 32 predictor
## 2 classes: 'fail', 'pass'
##
## Pre-processing: centered (72), scaled (72)
## Resampling: Cross-Validated (10 fold, repeated 10 times)
## Summary of sample sizes: 792, 792, 792, 792, 792, 792, ...
## Resampling results across tuning parameters:
##
## k Accuracy Kappa
## 5 0.8503409 0.7006818
## 7 0.8335227 0.6670455
## 9 0.8397727 0.6795455
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 5.
plot(up_knn_clf)
Perform hyperparameter tuning using tuneGrid.
up_knn_clf <- train_model_clf(
"knn",
train_set_clf_up,
tuneGrid = expand.grid(k = c(1, 3, 5))
)
## k-Nearest Neighbors
##
## 880 samples
## 32 predictor
## 2 classes: 'fail', 'pass'
##
## Pre-processing: centered (72), scaled (72)
## Resampling: Cross-Validated (10 fold, repeated 10 times)
## Summary of sample sizes: 792, 792, 792, 792, 792, 792, ...
## Resampling results across tuning parameters:
##
## k Accuracy Kappa
## 1 0.9648864 0.9297727
## 3 0.8918182 0.7836364
## 5 0.8507955 0.7015909
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 1.
plot(up_knn_clf)
Tuning done.
up_knn_metrics_clf <- evaluate_model_clf(up_knn_clf, "K-Nearest Neighbors")
## Confusion Matrix and Statistics
##
## Reference
## Prediction fail pass
## fail 10 4
## pass 10 105
##
## Accuracy : 0.8915
## 95% CI : (0.8246, 0.9394)
## No Information Rate : 0.845
## P-Value [Acc > NIR] : 0.08614
##
## Kappa : 0.528
##
## Mcnemar's Test P-Value : 0.18145
##
## Precision : 0.71429
## Recall : 0.50000
## F1 : 0.58824
## Prevalence : 0.15504
## Detection Rate : 0.07752
## Detection Prevalence : 0.10853
## Balanced Accuracy : 0.73165
##
## 'Positive' Class : fail
##
lr_clf <- train_model_clf("glmnet")
## glmnet
##
## 520 samples
## 32 predictor
## 2 classes: 'fail', 'pass'
##
## Pre-processing: centered (72), scaled (72)
## Resampling: Cross-Validated (10 fold, repeated 10 times)
## Summary of sample sizes: 468, 468, 468, 468, 468, 468, ...
## Resampling results across tuning parameters:
##
## alpha lambda Accuracy Kappa
## 0.10 0.0004322165 0.9017308 0.6256426
## 0.10 0.0043221649 0.9159615 0.6555571
## 0.10 0.0432216490 0.9109615 0.5802518
## 0.55 0.0004322165 0.9030769 0.6312819
## 0.55 0.0043221649 0.9261538 0.6950640
## 0.55 0.0432216490 0.9128846 0.5716085
## 1.00 0.0004322165 0.9015385 0.6298148
## 1.00 0.0043221649 0.9284615 0.7077239
## 1.00 0.0432216490 0.9203846 0.6030193
##
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were alpha = 1 and lambda = 0.004322165.
plot(lr_clf)
Check for tuning using tuneLength.
lr_clf <- train_model_clf("glmnet", tuneLength = 10)
## glmnet
##
## 520 samples
## 32 predictor
## 2 classes: 'fail', 'pass'
##
## Pre-processing: centered (72), scaled (72)
## Resampling: Cross-Validated (10 fold, repeated 10 times)
## Summary of sample sizes: 468, 468, 468, 468, 468, 468, ...
## Resampling results across tuning parameters:
##
## alpha lambda Accuracy Kappa
## 0.1 9.984762e-05 0.8967308 0.60779310
## 0.1 2.306609e-04 0.8969231 0.60876498
## 0.1 5.328567e-04 0.9040385 0.63297908
## 0.1 1.230968e-03 0.9090385 0.64326112
## 0.1 2.843696e-03 0.9151923 0.65804902
## 0.1 6.569306e-03 0.9184615 0.65913957
## 0.1 1.517595e-02 0.9198077 0.65281748
## 0.1 3.505841e-02 0.9130769 0.60057173
## 0.1 8.098948e-02 0.8980769 0.48352832
## 0.1 1.870962e-01 0.8776923 0.31219660
## 0.2 9.984762e-05 0.8913462 0.58816286
## 0.2 2.306609e-04 0.8967308 0.60780293
## 0.2 5.328567e-04 0.9044231 0.63496540
## 0.2 1.230968e-03 0.9092308 0.64428003
## 0.2 2.843696e-03 0.9159615 0.66058001
## 0.2 6.569306e-03 0.9211538 0.66934419
## 0.2 1.517595e-02 0.9203846 0.65342998
## 0.2 3.505841e-02 0.9157692 0.61160217
## 0.2 8.098948e-02 0.8973077 0.47412016
## 0.2 1.870962e-01 0.8661538 0.20053350
## 0.3 9.984762e-05 0.8901923 0.58434279
## 0.3 2.306609e-04 0.8971154 0.60961696
## 0.3 5.328567e-04 0.9048077 0.63646919
## 0.3 1.230968e-03 0.9115385 0.65316063
## 0.3 2.843696e-03 0.9192308 0.67163219
## 0.3 6.569306e-03 0.9228846 0.67443623
## 0.3 1.517595e-02 0.9228846 0.66222271
## 0.3 3.505841e-02 0.9163462 0.61071768
## 0.3 8.098948e-02 0.8998077 0.48691947
## 0.3 1.870962e-01 0.8623077 0.15480630
## 0.4 9.984762e-05 0.8892308 0.58177551
## 0.4 2.306609e-04 0.8965385 0.60750958
## 0.4 5.328567e-04 0.9048077 0.63681858
## 0.4 1.230968e-03 0.9134615 0.66207686
## 0.4 2.843696e-03 0.9203846 0.67517958
## 0.4 6.569306e-03 0.9269231 0.69280361
## 0.4 1.517595e-02 0.9240385 0.66710357
## 0.4 3.505841e-02 0.9163462 0.60437666
## 0.4 8.098948e-02 0.8976923 0.46924372
## 0.4 1.870962e-01 0.8576923 0.11227740
## 0.5 9.984762e-05 0.8892308 0.58247849
## 0.5 2.306609e-04 0.8963462 0.60773161
## 0.5 5.328567e-04 0.9055769 0.64021281
## 0.5 1.230968e-03 0.9150000 0.66832647
## 0.5 2.843696e-03 0.9225000 0.68493490
## 0.5 6.569306e-03 0.9273077 0.69380308
## 0.5 1.517595e-02 0.9236538 0.66472678
## 0.5 3.505841e-02 0.9159615 0.59925649
## 0.5 8.098948e-02 0.8946154 0.43913684
## 0.5 1.870962e-01 0.8575000 0.11061774
## 0.6 9.984762e-05 0.8888462 0.58048769
## 0.6 2.306609e-04 0.8951923 0.60338744
## 0.6 5.328567e-04 0.9063462 0.64280205
## 0.6 1.230968e-03 0.9159615 0.67148649
## 0.6 2.843696e-03 0.9238462 0.69090400
## 0.6 6.569306e-03 0.9276923 0.69504169
## 0.6 1.517595e-02 0.9251923 0.67162014
## 0.6 3.505841e-02 0.9171154 0.60269059
## 0.6 8.098948e-02 0.8905769 0.40008936
## 0.6 1.870962e-01 0.8555769 0.09172321
## 0.7 9.984762e-05 0.8880769 0.57807889
## 0.7 2.306609e-04 0.8953846 0.60424625
## 0.7 5.328567e-04 0.9063462 0.64383917
## 0.7 1.230968e-03 0.9171154 0.67646171
## 0.7 2.843696e-03 0.9234615 0.69064605
## 0.7 6.569306e-03 0.9301923 0.70510589
## 0.7 1.517595e-02 0.9261538 0.67808667
## 0.7 3.505841e-02 0.9188462 0.61092352
## 0.7 8.098948e-02 0.8836538 0.34632410
## 0.7 1.870962e-01 0.8515385 0.05342349
## 0.8 9.984762e-05 0.8873077 0.57666165
## 0.8 2.306609e-04 0.8942308 0.60060697
## 0.8 5.328567e-04 0.9053846 0.64064662
## 0.8 1.230968e-03 0.9184615 0.68234425
## 0.8 2.843696e-03 0.9244231 0.69508433
## 0.8 6.569306e-03 0.9298077 0.70629359
## 0.8 1.517595e-02 0.9284615 0.69149841
## 0.8 3.505841e-02 0.9205769 0.62215185
## 0.8 8.098948e-02 0.8794231 0.31326341
## 0.8 1.870962e-01 0.8461538 0.00000000
## 0.9 9.984762e-05 0.8880769 0.58075317
## 0.9 2.306609e-04 0.8938462 0.60166628
## 0.9 5.328567e-04 0.9063462 0.64605661
## 0.9 1.230968e-03 0.9188462 0.68554659
## 0.9 2.843696e-03 0.9242308 0.69488854
## 0.9 6.569306e-03 0.9311538 0.71251098
## 0.9 1.517595e-02 0.9290385 0.69747879
## 0.9 3.505841e-02 0.9251923 0.64333826
## 0.9 8.098948e-02 0.8773077 0.29712216
## 0.9 1.870962e-01 0.8461538 0.00000000
## 1.0 9.984762e-05 0.8896154 0.58740072
## 1.0 2.306609e-04 0.8930769 0.59910168
## 1.0 5.328567e-04 0.9051923 0.64221568
## 1.0 1.230968e-03 0.9180769 0.68455273
## 1.0 2.843696e-03 0.9257692 0.70368923
## 1.0 6.569306e-03 0.9305769 0.71375472
## 1.0 1.517595e-02 0.9296154 0.70099041
## 1.0 3.505841e-02 0.9275000 0.65591012
## 1.0 8.098948e-02 0.8740385 0.26875886
## 1.0 1.870962e-01 0.8461538 0.00000000
##
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were alpha = 0.9 and lambda = 0.006569306.
plot(lr_clf)
No further tuning necessary.
lr_metrics_clf <- evaluate_model_clf(lr_clf, "Logistic Regression")
## Confusion Matrix and Statistics
##
## Reference
## Prediction fail pass
## fail 14 5
## pass 6 104
##
## Accuracy : 0.9147
## 95% CI : (0.8525, 0.9567)
## No Information Rate : 0.845
## P-Value [Acc > NIR] : 0.01442
##
## Kappa : 0.6678
##
## Mcnemar's Test P-Value : 1.00000
##
## Precision : 0.7368
## Recall : 0.7000
## F1 : 0.7179
## Prevalence : 0.1550
## Detection Rate : 0.1085
## Detection Prevalence : 0.1473
## Balanced Accuracy : 0.8271
##
## 'Positive' Class : fail
##
down_lr_clf <- train_model_clf("glmnet", train_set_clf_down)
## glmnet
##
## 160 samples
## 32 predictor
## 2 classes: 'fail', 'pass'
##
## Pre-processing: centered (72), scaled (72)
## Resampling: Cross-Validated (10 fold, repeated 10 times)
## Summary of sample sizes: 144, 144, 144, 144, 144, 144, ...
## Resampling results across tuning parameters:
##
## alpha lambda Accuracy Kappa
## 0.10 0.0007071498 0.775000 0.55000
## 0.10 0.0070714983 0.792500 0.58500
## 0.10 0.0707149828 0.821875 0.64375
## 0.55 0.0007071498 0.791875 0.58375
## 0.55 0.0070714983 0.833125 0.66625
## 0.55 0.0707149828 0.858125 0.71625
## 1.00 0.0007071498 0.809375 0.61875
## 1.00 0.0070714983 0.839375 0.67875
## 1.00 0.0707149828 0.875000 0.75000
##
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were alpha = 1 and lambda = 0.07071498.
plot(down_lr_clf)
Perform hyperparameter tuning using tuneLength.
down_lr_clf <- train_model_clf("glmnet", train_set_clf_down, tuneLength = 10)
## glmnet
##
## 160 samples
## 32 predictor
## 2 classes: 'fail', 'pass'
##
## Pre-processing: centered (72), scaled (72)
## Resampling: Cross-Validated (10 fold, repeated 10 times)
## Summary of sample sizes: 144, 144, 144, 144, 144, 144, ...
## Resampling results across tuning parameters:
##
## alpha lambda Accuracy Kappa
## 0.1 0.0001633608 0.772500 0.54500
## 0.1 0.0003773846 0.773125 0.54625
## 0.1 0.0008718074 0.775625 0.55125
## 0.1 0.0020139881 0.775000 0.55000
## 0.1 0.0046525737 0.781875 0.56375
## 0.1 0.0107480486 0.798125 0.59625
## 0.1 0.0248293863 0.808750 0.61750
## 0.1 0.0573591027 0.823750 0.64750
## 0.1 0.1325069668 0.821875 0.64375
## 0.1 0.3061082795 0.829375 0.65875
## 0.2 0.0001633608 0.770000 0.54000
## 0.2 0.0003773846 0.773125 0.54625
## 0.2 0.0008718074 0.777500 0.55500
## 0.2 0.0020139881 0.783125 0.56625
## 0.2 0.0046525737 0.800000 0.60000
## 0.2 0.0107480486 0.811875 0.62375
## 0.2 0.0248293863 0.820000 0.64000
## 0.2 0.0573591027 0.831875 0.66375
## 0.2 0.1325069668 0.833750 0.66750
## 0.2 0.3061082795 0.856250 0.71250
## 0.3 0.0001633608 0.770625 0.54125
## 0.3 0.0003773846 0.777500 0.55500
## 0.3 0.0008718074 0.783125 0.56625
## 0.3 0.0020139881 0.793125 0.58625
## 0.3 0.0046525737 0.809375 0.61875
## 0.3 0.0107480486 0.820625 0.64125
## 0.3 0.0248293863 0.831875 0.66375
## 0.3 0.0573591027 0.843125 0.68625
## 0.3 0.1325069668 0.836875 0.67375
## 0.3 0.3061082795 0.867500 0.73500
## 0.4 0.0001633608 0.772500 0.54500
## 0.4 0.0003773846 0.779375 0.55875
## 0.4 0.0008718074 0.786250 0.57250
## 0.4 0.0020139881 0.801250 0.60250
## 0.4 0.0046525737 0.816250 0.63250
## 0.4 0.0107480486 0.828750 0.65750
## 0.4 0.0248293863 0.839375 0.67875
## 0.4 0.0573591027 0.850625 0.70125
## 0.4 0.1325069668 0.870625 0.74125
## 0.4 0.3061082795 0.873750 0.74750
## 0.5 0.0001633608 0.780000 0.56000
## 0.5 0.0003773846 0.783750 0.56750
## 0.5 0.0008718074 0.791250 0.58250
## 0.5 0.0020139881 0.803125 0.60625
## 0.5 0.0046525737 0.820625 0.64125
## 0.5 0.0107480486 0.834375 0.66875
## 0.5 0.0248293863 0.845625 0.69125
## 0.5 0.0573591027 0.851875 0.70375
## 0.5 0.1325069668 0.871875 0.74375
## 0.5 0.3061082795 0.875000 0.75000
## 0.6 0.0001633608 0.783125 0.56625
## 0.6 0.0003773846 0.788750 0.57750
## 0.6 0.0008718074 0.797500 0.59500
## 0.6 0.0020139881 0.812500 0.62500
## 0.6 0.0046525737 0.826875 0.65375
## 0.6 0.0107480486 0.841250 0.68250
## 0.6 0.0248293863 0.846250 0.69250
## 0.6 0.0573591027 0.861875 0.72375
## 0.6 0.1325069668 0.871875 0.74375
## 0.6 0.3061082795 0.875000 0.75000
## 0.7 0.0001633608 0.788125 0.57625
## 0.7 0.0003773846 0.796250 0.59250
## 0.7 0.0008718074 0.808125 0.61625
## 0.7 0.0020139881 0.814375 0.62875
## 0.7 0.0046525737 0.831875 0.66375
## 0.7 0.0107480486 0.846250 0.69250
## 0.7 0.0248293863 0.851250 0.70250
## 0.7 0.0573591027 0.863125 0.72625
## 0.7 0.1325069668 0.875000 0.75000
## 0.7 0.3061082795 0.875000 0.75000
## 0.8 0.0001633608 0.798750 0.59750
## 0.8 0.0003773846 0.805000 0.61000
## 0.8 0.0008718074 0.815000 0.63000
## 0.8 0.0020139881 0.821875 0.64375
## 0.8 0.0046525737 0.833750 0.66750
## 0.8 0.0107480486 0.846875 0.69375
## 0.8 0.0248293863 0.855625 0.71125
## 0.8 0.0573591027 0.862500 0.72500
## 0.8 0.1325069668 0.875000 0.75000
## 0.8 0.3061082795 0.874375 0.74875
## 0.9 0.0001633608 0.809375 0.61875
## 0.9 0.0003773846 0.812500 0.62500
## 0.9 0.0008718074 0.814375 0.62875
## 0.9 0.0020139881 0.820000 0.64000
## 0.9 0.0046525737 0.831875 0.66375
## 0.9 0.0107480486 0.845625 0.69125
## 0.9 0.0248293863 0.861875 0.72375
## 0.9 0.0573591027 0.866250 0.73250
## 0.9 0.1325069668 0.874375 0.74875
## 0.9 0.3061082795 0.871250 0.74250
## 1.0 0.0001633608 0.801875 0.60375
## 1.0 0.0003773846 0.808750 0.61750
## 1.0 0.0008718074 0.808125 0.61625
## 1.0 0.0020139881 0.815625 0.63125
## 1.0 0.0046525737 0.826875 0.65375
## 1.0 0.0107480486 0.846875 0.69375
## 1.0 0.0248293863 0.863750 0.72750
## 1.0 0.0573591027 0.866875 0.73375
## 1.0 0.1325069668 0.874375 0.74875
## 1.0 0.3061082795 0.860000 0.72000
##
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were alpha = 0.5 and lambda = 0.3061083.
plot(down_lr_clf)
No further tuning required.
down_lr_metrics_clf <- evaluate_model_clf(down_lr_clf, "Logistic Regression")
## Confusion Matrix and Statistics
##
## Reference
## Prediction fail pass
## fail 18 14
## pass 2 95
##
## Accuracy : 0.876
## 95% CI : (0.8064, 0.9274)
## No Information Rate : 0.845
## P-Value [Acc > NIR] : 0.19935
##
## Kappa : 0.6197
##
## Mcnemar's Test P-Value : 0.00596
##
## Precision : 0.5625
## Recall : 0.9000
## F1 : 0.6923
## Prevalence : 0.1550
## Detection Rate : 0.1395
## Detection Prevalence : 0.2481
## Balanced Accuracy : 0.8858
##
## 'Positive' Class : fail
##
up_lr_clf <- train_model_clf("glmnet", train_set_clf_up)
## glmnet
##
## 880 samples
## 32 predictor
## 2 classes: 'fail', 'pass'
##
## Pre-processing: centered (72), scaled (72)
## Resampling: Cross-Validated (10 fold, repeated 10 times)
## Summary of sample sizes: 792, 792, 792, 792, 792, 792, ...
## Resampling results across tuning parameters:
##
## alpha lambda Accuracy Kappa
## 0.10 0.0007324895 0.9494318 0.8988636
## 0.10 0.0073248947 0.9317045 0.8634091
## 0.10 0.0732489472 0.9007955 0.8015909
## 0.55 0.0007324895 0.9505682 0.9011364
## 0.55 0.0073248947 0.9282955 0.8565909
## 0.55 0.0732489472 0.9080682 0.8161364
## 1.00 0.0007324895 0.9540909 0.9081818
## 1.00 0.0073248947 0.9336364 0.8672727
## 1.00 0.0732489472 0.9146591 0.8293182
##
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were alpha = 1 and lambda = 0.0007324895.
plot(up_lr_clf)
Perform hyperparameter tuning using tuneLength.
up_lr_clf <- train_model_clf("glmnet", train_set_clf_up, tuneLength = 10)
## glmnet
##
## 880 samples
## 32 predictor
## 2 classes: 'fail', 'pass'
##
## Pre-processing: centered (72), scaled (72)
## Resampling: Cross-Validated (10 fold, repeated 10 times)
## Summary of sample sizes: 792, 792, 792, 792, 792, 792, ...
## Resampling results across tuning parameters:
##
## alpha lambda Accuracy Kappa
## 0.1 0.0001692146 0.9523864 0.9047727
## 0.1 0.0003909076 0.9522727 0.9045455
## 0.1 0.0009030473 0.9486364 0.8972727
## 0.1 0.0020861563 0.9423864 0.8847727
## 0.1 0.0048192916 0.9367045 0.8734091
## 0.1 0.0111331887 0.9256818 0.8513636
## 0.1 0.0257191098 0.9120455 0.8240909
## 0.1 0.0594144794 0.9030682 0.8061364
## 0.1 0.1372551534 0.8931818 0.7863636
## 0.1 0.3170772064 0.8892045 0.7784091
## 0.2 0.0001692146 0.9532955 0.9065909
## 0.2 0.0003909076 0.9525000 0.9050000
## 0.2 0.0009030473 0.9484091 0.8968182
## 0.2 0.0020861563 0.9431818 0.8863636
## 0.2 0.0048192916 0.9364773 0.8729545
## 0.2 0.0111331887 0.9221591 0.8443182
## 0.2 0.0257191098 0.9101136 0.8202273
## 0.2 0.0594144794 0.9002273 0.8004545
## 0.2 0.1372551534 0.8981818 0.7963636
## 0.2 0.3170772064 0.8769318 0.7538636
## 0.3 0.0001692146 0.9540909 0.9081818
## 0.3 0.0003909076 0.9527273 0.9054545
## 0.3 0.0009030473 0.9484091 0.8968182
## 0.3 0.0020861563 0.9444318 0.8888636
## 0.3 0.0048192916 0.9356818 0.8713636
## 0.3 0.0111331887 0.9207955 0.8415909
## 0.3 0.0257191098 0.9115909 0.8231818
## 0.3 0.0594144794 0.9045455 0.8090909
## 0.3 0.1372551534 0.8906818 0.7813636
## 0.3 0.3170772064 0.8995455 0.7990909
## 0.4 0.0001692146 0.9543182 0.9086364
## 0.4 0.0003909076 0.9527273 0.9054545
## 0.4 0.0009030473 0.9492045 0.8984091
## 0.4 0.0020861563 0.9452273 0.8904545
## 0.4 0.0048192916 0.9354545 0.8709091
## 0.4 0.0111331887 0.9213636 0.8427273
## 0.4 0.0257191098 0.9188636 0.8377273
## 0.4 0.0594144794 0.9055682 0.8111364
## 0.4 0.1372551534 0.9021591 0.8043182
## 0.4 0.3170772064 0.9069318 0.8138636
## 0.5 0.0001692146 0.9547727 0.9095455
## 0.5 0.0003909076 0.9532955 0.9065909
## 0.5 0.0009030473 0.9496591 0.8993182
## 0.5 0.0020861563 0.9452273 0.8904545
## 0.5 0.0048192916 0.9348864 0.8697727
## 0.5 0.0111331887 0.9238636 0.8477273
## 0.5 0.0257191098 0.9237500 0.8475000
## 0.5 0.0594144794 0.9053409 0.8106818
## 0.5 0.1372551534 0.9079545 0.8159091
## 0.5 0.3170772064 0.9060227 0.8120455
## 0.6 0.0001692146 0.9553409 0.9106818
## 0.6 0.0003909076 0.9537500 0.9075000
## 0.6 0.0009030473 0.9506818 0.9013636
## 0.6 0.0020861563 0.9456818 0.8913636
## 0.6 0.0048192916 0.9340909 0.8681818
## 0.6 0.0111331887 0.9264773 0.8529545
## 0.6 0.0257191098 0.9260227 0.8520455
## 0.6 0.0594144794 0.9092045 0.8184091
## 0.6 0.1372551534 0.9076136 0.8152273
## 0.6 0.3170772064 0.9055682 0.8111364
## 0.7 0.0001692146 0.9551136 0.9102273
## 0.7 0.0003909076 0.9544318 0.9088636
## 0.7 0.0009030473 0.9514773 0.9029545
## 0.7 0.0020861563 0.9461364 0.8922727
## 0.7 0.0048192916 0.9331818 0.8663636
## 0.7 0.0111331887 0.9311364 0.8622727
## 0.7 0.0257191098 0.9279545 0.8559091
## 0.7 0.0594144794 0.9130682 0.8261364
## 0.7 0.1372551534 0.9128409 0.8256818
## 0.7 0.3170772064 0.9045455 0.8090909
## 0.8 0.0001692146 0.9553409 0.9106818
## 0.8 0.0003909076 0.9552273 0.9104545
## 0.8 0.0009030473 0.9521591 0.9043182
## 0.8 0.0020861563 0.9453409 0.8906818
## 0.8 0.0048192916 0.9337500 0.8675000
## 0.8 0.0111331887 0.9338636 0.8677273
## 0.8 0.0257191098 0.9297727 0.8595455
## 0.8 0.0594144794 0.9135227 0.8270455
## 0.8 0.1372551534 0.9112500 0.8225000
## 0.8 0.3170772064 0.9040909 0.8081818
## 0.9 0.0001692146 0.9554545 0.9109091
## 0.9 0.0003909076 0.9555682 0.9111364
## 0.9 0.0009030473 0.9525000 0.9050000
## 0.9 0.0020861563 0.9463636 0.8927273
## 0.9 0.0048192916 0.9353409 0.8706818
## 0.9 0.0111331887 0.9346591 0.8693182
## 0.9 0.0257191098 0.9310227 0.8620455
## 0.9 0.0594144794 0.9143182 0.8286364
## 0.9 0.1372551534 0.9071591 0.8143182
## 0.9 0.3170772064 0.9013636 0.8027273
## 1.0 0.0001692146 0.9561364 0.9122727
## 1.0 0.0003909076 0.9563636 0.9127273
## 1.0 0.0009030473 0.9543182 0.9086364
## 1.0 0.0020861563 0.9460227 0.8920455
## 1.0 0.0048192916 0.9359091 0.8718182
## 1.0 0.0111331887 0.9344318 0.8688636
## 1.0 0.0257191098 0.9306818 0.8613636
## 1.0 0.0594144794 0.9147727 0.8295455
## 1.0 0.1372551534 0.9043182 0.8086364
## 1.0 0.3170772064 0.8839773 0.7679545
##
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were alpha = 1 and lambda = 0.0003909076.
plot(up_lr_clf)
Tuning done.
up_lr_metrics_clf <- evaluate_model_clf(up_lr_clf, "Logistic Regression")
## Confusion Matrix and Statistics
##
## Reference
## Prediction fail pass
## fail 12 7
## pass 8 102
##
## Accuracy : 0.8837
## 95% CI : (0.8155, 0.9334)
## No Information Rate : 0.845
## P-Value [Acc > NIR] : 0.1352
##
## Kappa : 0.5469
##
## Mcnemar's Test P-Value : 1.0000
##
## Precision : 0.63158
## Recall : 0.60000
## F1 : 0.61538
## Prevalence : 0.15504
## Detection Rate : 0.09302
## Detection Prevalence : 0.14729
## Balanced Accuracy : 0.76789
##
## 'Positive' Class : fail
##
df_eval_reg <- rbind(
dt_reg$metrics,
rf_reg_manual_tune$metrics,
svm_reg$metrics,
xgb_reg_manual_tune$metrics,
lr_reg_manual_tune$metrics
)
df_eval_reg <- df_eval_reg[order(df_eval_reg$RMSE), ]
row.names(df_eval_reg) <- NULL
df_eval_reg
## Model RMSE Rsquared MAE
## 1 Random Forest 1.106174 0.8666637 0.7717890
## 2 Linear Regression 1.117378 0.8643585 0.7226734
## 3 Support Vector Machine 1.177426 0.8521626 0.7618882
## 4 XGBoost 1.250256 0.8338254 0.8697652
## 5 Decision Tree 1.340746 0.8114244 0.8614620
df_eval_reg["RMSE"] <- round(df_eval_reg["RMSE"], 4)
df_eval_reg %>%
ggplot(aes(x = reorder(Model, RMSE), y = RMSE, fill = Model)) +
geom_bar(stat = "identity") +
geom_text(aes(label = RMSE), vjust = -0.5) +
ggtitle("RMSE by Model") +
xlab("Model") +
ylim(0, 1.5) +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1))
Define a function to plot accuracy.
show_accuracy <- function(df) {
df_accuracy <- df %>% arrange(desc(Accuracy))
print(df_accuracy)
plot_accuracy <- df_accuracy %>%
ggplot(aes(x = reorder(Model, -Accuracy), y = Accuracy, fill = Model)) +
geom_bar(stat = "identity") +
geom_text(aes(label = Accuracy), vjust = -0.5) +
ggtitle("Accuracy by Model") +
xlab("Model") +
ylim(0, 1) +
labs(fill = "Model") +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1))
print(plot_accuracy)
}
Define a function to plot recall for fail class.
show_recall_fail <- function(df) {
df_recall_fail <- df %>% arrange(desc(Recall.Fail))
print(df_recall_fail)
plot_recall_fail <- df_recall_fail %>%
ggplot(
aes(x = reorder(Model, -Recall.Fail), y = Recall.Fail, fill = Model)
) +
geom_bar(stat = "identity") +
geom_text(aes(label = Recall.Fail), vjust = -0.5) +
ggtitle("Recall (Fail Class) by Model") +
xlab("Model") +
ylab("Recall") +
ylim(0, 1) +
labs(fill = "Model") +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1))
print(plot_recall_fail)
}
df_eval_clf <- rbind(
dt_metrics_clf,
rf_metrics_clf,
svm_poly_metrics_clf,
knn_metrics_clf,
lr_metrics_clf
)
df_eval_clf
## Model Accuracy Precision.Fail Recall.Fail F1.Fail
## 1 Decision Tree 0.938 0.929 0.65 0.765
## 2 Random Forest 0.922 0.778 0.70 0.737
## 3 Support Vector Machine 0.915 0.737 0.70 0.718
## 4 K-Nearest Neighbors 0.868 0.615 0.40 0.485
## 5 Logistic Regression 0.915 0.737 0.70 0.718
## Precision.Pass Recall.Pass F1.Pass
## 1 0.939 0.991 0.964
## 2 0.946 0.963 0.955
## 3 0.945 0.954 0.950
## 4 0.897 0.954 0.924
## 5 0.945 0.954 0.950
show_accuracy(df_eval_clf)
## Model Accuracy Precision.Fail Recall.Fail F1.Fail
## 1 Decision Tree 0.938 0.929 0.65 0.765
## 2 Random Forest 0.922 0.778 0.70 0.737
## 3 Support Vector Machine 0.915 0.737 0.70 0.718
## 4 Logistic Regression 0.915 0.737 0.70 0.718
## 5 K-Nearest Neighbors 0.868 0.615 0.40 0.485
## Precision.Pass Recall.Pass F1.Pass
## 1 0.939 0.991 0.964
## 2 0.946 0.963 0.955
## 3 0.945 0.954 0.950
## 4 0.945 0.954 0.950
## 5 0.897 0.954 0.924
show_recall_fail(df_eval_clf)
## Model Accuracy Precision.Fail Recall.Fail F1.Fail
## 1 Random Forest 0.922 0.778 0.70 0.737
## 2 Support Vector Machine 0.915 0.737 0.70 0.718
## 3 Logistic Regression 0.915 0.737 0.70 0.718
## 4 Decision Tree 0.938 0.929 0.65 0.765
## 5 K-Nearest Neighbors 0.868 0.615 0.40 0.485
## Precision.Pass Recall.Pass F1.Pass
## 1 0.946 0.963 0.955
## 2 0.945 0.954 0.950
## 3 0.945 0.954 0.950
## 4 0.939 0.991 0.964
## 5 0.897 0.954 0.924
down_df_eval_clf <- rbind(
down_dt_metrics_clf,
down_rf_metrics_clf,
down_svm_poly_metrics_clf,
down_knn_metrics_clf,
down_lr_metrics_clf
)
down_df_eval_clf
## Model Accuracy Precision.Fail Recall.Fail F1.Fail
## 1 Decision Tree 0.860 0.531 0.85 0.654
## 2 Random Forest 0.876 0.567 0.85 0.680
## 3 Support Vector Machine 0.860 0.529 0.90 0.667
## 4 K-Nearest Neighbors 0.853 0.516 0.80 0.627
## 5 Logistic Regression 0.876 0.562 0.90 0.692
## Precision.Pass Recall.Pass F1.Pass
## 1 0.969 0.862 0.913
## 2 0.970 0.881 0.923
## 3 0.979 0.853 0.912
## 4 0.959 0.862 0.908
## 5 0.979 0.872 0.922
show_accuracy(down_df_eval_clf)
## Model Accuracy Precision.Fail Recall.Fail F1.Fail
## 1 Random Forest 0.876 0.567 0.85 0.680
## 2 Logistic Regression 0.876 0.562 0.90 0.692
## 3 Decision Tree 0.860 0.531 0.85 0.654
## 4 Support Vector Machine 0.860 0.529 0.90 0.667
## 5 K-Nearest Neighbors 0.853 0.516 0.80 0.627
## Precision.Pass Recall.Pass F1.Pass
## 1 0.970 0.881 0.923
## 2 0.979 0.872 0.922
## 3 0.969 0.862 0.913
## 4 0.979 0.853 0.912
## 5 0.959 0.862 0.908
show_recall_fail(down_df_eval_clf)
## Model Accuracy Precision.Fail Recall.Fail F1.Fail
## 1 Support Vector Machine 0.860 0.529 0.90 0.667
## 2 Logistic Regression 0.876 0.562 0.90 0.692
## 3 Decision Tree 0.860 0.531 0.85 0.654
## 4 Random Forest 0.876 0.567 0.85 0.680
## 5 K-Nearest Neighbors 0.853 0.516 0.80 0.627
## Precision.Pass Recall.Pass F1.Pass
## 1 0.979 0.853 0.912
## 2 0.979 0.872 0.922
## 3 0.969 0.862 0.913
## 4 0.970 0.881 0.923
## 5 0.959 0.862 0.908
up_df_eval_clf <- rbind(
up_dt_metrics_clf,
up_rf_metrics_clf,
up_svm_poly_metrics_clf,
up_knn_metrics_clf,
up_lr_metrics_clf
)
up_df_eval_clf
## Model Accuracy Precision.Fail Recall.Fail F1.Fail
## 1 Decision Tree 0.884 0.593 0.80 0.681
## 2 Random Forest 0.915 0.800 0.60 0.686
## 3 Support Vector Machine 0.891 0.875 0.35 0.500
## 4 K-Nearest Neighbors 0.891 0.714 0.50 0.588
## 5 Logistic Regression 0.884 0.632 0.60 0.615
## Precision.Pass Recall.Pass F1.Pass
## 1 0.961 0.899 0.929
## 2 0.930 0.972 0.951
## 3 0.893 0.991 0.939
## 4 0.913 0.963 0.938
## 5 0.927 0.936 0.932
show_accuracy(up_df_eval_clf)
## Model Accuracy Precision.Fail Recall.Fail F1.Fail
## 1 Random Forest 0.915 0.800 0.60 0.686
## 2 Support Vector Machine 0.891 0.875 0.35 0.500
## 3 K-Nearest Neighbors 0.891 0.714 0.50 0.588
## 4 Decision Tree 0.884 0.593 0.80 0.681
## 5 Logistic Regression 0.884 0.632 0.60 0.615
## Precision.Pass Recall.Pass F1.Pass
## 1 0.930 0.972 0.951
## 2 0.893 0.991 0.939
## 3 0.913 0.963 0.938
## 4 0.961 0.899 0.929
## 5 0.927 0.936 0.932
show_recall_fail(up_df_eval_clf)
## Model Accuracy Precision.Fail Recall.Fail F1.Fail
## 1 Decision Tree 0.884 0.593 0.80 0.681
## 2 Random Forest 0.915 0.800 0.60 0.686
## 3 Logistic Regression 0.884 0.632 0.60 0.615
## 4 K-Nearest Neighbors 0.891 0.714 0.50 0.588
## 5 Support Vector Machine 0.891 0.875 0.35 0.500
## Precision.Pass Recall.Pass F1.Pass
## 1 0.961 0.899 0.929
## 2 0.930 0.972 0.951
## 3 0.927 0.936 0.932
## 4 0.913 0.963 0.938
## 5 0.893 0.991 0.939
In this study, we develop and evaluate the performance of different regression models and binary classification models to predict student performance in secondary education of two Portugese schools. G3, the final grade result is the target attribute in regression models while in binary classification models, the target attribute “result” is created by segregating the G3 result into binary labels “pass” and “fail”. All the models are trained with hyperparameter tuning to optimize their performance.
From the EDA result, we discovered that the attributes G1 and G2 are highly correlated to the target attribute G3. It shown that students that achieved higher G1 and G2 results are most likely to achieve higher G3 result.
We aim to achieve better prediction result compared to previous study by Cortez and Silva (2008) where the study achieved RMSE of 1.32 and accuracy of 93% from their best regression model and binary classification model respectively. The research outcome shown that our best models slightly outperformed the result from previous study with RMSE of 1.11 from regression model using Random Forest and accuracy of 93.8% from binary classification model using Decision Tree.
In addition, we noticed the issue of imbalanced data with 440 pass label and 80 fail label in training dataset. Thus we have also trained the binary classification models using down-sampling and up-sampling data. We have compared the result of classification models using original sample data, down-sampled data, and up-sampled data.
Due to the imbalanced classes in the original sample data, accuracy is not the best metric to evaluate the model performance. Furthermore, we would like to emphasize our prediction in the fail class. Hence, we decided to use recall as the main metric.
For classification using the original data, Random Forest, Support Vector Machine and Logistic Regression were performing equally well in terms of recall. In comparison to the models from down-sampled and up-sampled data, Logistic Regression and Support Vector Machine from down-sampled data produced the best results with 90% recall.
In future work, G1 and G2 attributes can be removed from the features so that we can further study the impact of other attributes to student performance to address a more specific problem. For instance, future work can be conducted to study whether family background affects student’s academic result by predicting student performance with their family ecological factors.