Alcohol drinking causes numerous problems to individuals, families and to society. Most people begin to drink alcohol in childhood. This study shows drinking habits of students math and portuguese language courses in secondary school. This data approach student achievement in secondary education of two Portuguese schools.
The data attributes include student grades, demographic, social and school related features and it was collected by using school reports and questionnaires. Two datasets are provided regarding the performance in two distinct subjects: Mathematics (mat) and Portuguese language (por). It contains a lot of interesting social, gender and study information about students. We can use it for some EDA and try to predict student alcohol consumption (specifically, problematic consumption). The aim of this report is to detect if there is any correlation between high school students environmental and personal factors and the likeability of these students to consume alcohol.
options(scipen = 999)
library(tidyverse)
library(gridExtra) #grid.arrange()
library(GGally)
library(e1071)
library(caret)
library(rsample) #splitting or resampling
library(tidymodels) #conf_mat
library(ROCR) #roc curve
library(rpart) #DT
library(rattle) #DT
library(rpart.plot) #DT
library(partykit) #DTDataset was obtained from Kaggle Dataset, containing a lot of interesting social, gender and study information about students in secondary education of two Portuguese schools.
sac_math <- read.csv("data_input/student-mat.csv")
sac_por <- read.csv("data_input/student-por.csv")
nrow(sac_math)#> [1] 395
#> [1] 649
We can see how many students are in each course, 395 for Math course and 649 for Portuguese language course. Then we combine the data to do the wrangling process for both math and por.
#> Observations: 1,044
#> Variables: 34
#> $ school <fct> GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, ...
#> $ sex <fct> F, F, F, F, F, M, M, F, M, M, F, F, M, M, M, F, F, F, M,...
#> $ age <int> 18, 17, 15, 15, 16, 16, 16, 17, 15, 15, 15, 15, 15, 15, ...
#> $ address <fct> U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U,...
#> $ famsize <fct> GT3, GT3, LE3, GT3, GT3, LE3, LE3, GT3, LE3, GT3, GT3, G...
#> $ Pstatus <fct> A, T, T, T, T, T, T, A, A, T, T, T, T, T, A, T, T, T, T,...
#> $ Medu <int> 4, 1, 1, 4, 3, 4, 2, 4, 3, 3, 4, 2, 4, 4, 2, 4, 4, 3, 3,...
#> $ Fedu <int> 4, 1, 1, 2, 3, 3, 2, 4, 2, 4, 4, 1, 4, 3, 2, 4, 4, 3, 2,...
#> $ Mjob <fct> at_home, at_home, at_home, health, other, services, othe...
#> $ Fjob <fct> teacher, other, other, services, other, other, other, te...
#> $ reason <fct> course, course, other, home, home, reputation, home, hom...
#> $ guardian <fct> mother, father, mother, mother, father, mother, mother, ...
#> $ traveltime <int> 2, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 3, 1, 2, 1, 1, 1, 3, 1,...
#> $ studytime <int> 2, 2, 2, 3, 2, 2, 2, 2, 2, 2, 2, 3, 1, 2, 3, 1, 3, 2, 1,...
#> $ failures <int> 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3,...
#> $ schoolsup <fct> yes, no, yes, no, no, no, no, yes, no, no, no, no, no, n...
#> $ famsup <fct> no, yes, no, yes, yes, yes, no, yes, yes, yes, yes, yes,...
#> $ paid <fct> no, no, yes, yes, yes, yes, no, no, yes, yes, yes, no, y...
#> $ activities <fct> no, no, no, yes, no, yes, no, no, no, yes, no, yes, yes,...
#> $ nursery <fct> yes, no, yes, yes, yes, yes, yes, yes, yes, yes, yes, ye...
#> $ higher <fct> yes, yes, yes, yes, yes, yes, yes, yes, yes, yes, yes, y...
#> $ internet <fct> no, yes, yes, yes, no, yes, yes, no, yes, yes, yes, yes,...
#> $ romantic <fct> no, no, no, yes, no, no, no, no, no, no, no, no, no, no,...
#> $ famrel <int> 4, 5, 4, 3, 4, 5, 4, 4, 4, 5, 3, 5, 4, 5, 4, 4, 3, 5, 5,...
#> $ freetime <int> 3, 3, 3, 2, 3, 4, 4, 1, 2, 5, 3, 2, 3, 4, 5, 4, 2, 3, 5,...
#> $ goout <int> 4, 3, 2, 2, 2, 2, 4, 4, 2, 1, 3, 2, 3, 3, 2, 4, 3, 2, 5,...
#> $ Dalc <int> 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2,...
#> $ Walc <int> 1, 1, 3, 1, 2, 2, 1, 1, 1, 1, 2, 1, 3, 2, 1, 2, 2, 1, 4,...
#> $ health <int> 3, 3, 3, 5, 5, 5, 3, 1, 1, 5, 2, 4, 5, 3, 3, 2, 2, 4, 5,...
#> $ absences <int> 6, 4, 10, 2, 4, 10, 0, 6, 0, 0, 0, 4, 2, 2, 0, 4, 6, 4, ...
#> $ G1 <int> 5, 5, 7, 15, 6, 15, 12, 6, 16, 14, 10, 10, 14, 10, 14, 1...
#> $ G2 <int> 6, 5, 8, 14, 10, 15, 12, 5, 18, 15, 8, 12, 14, 10, 16, 1...
#> $ G3 <int> 6, 6, 10, 15, 10, 15, 11, 6, 19, 15, 9, 12, 14, 11, 16, ...
#> $ subject <chr> "math", "math", "math", "math", "math", "math", "math", ...
Attributes for both student-mat.csv (Math course) and student-por.csv (Portuguese language course) datasets:
school : student’s school (binary: ‘GP’ - Gabriel Pereira or ‘MS’ - Mousinho da Silveira)sex : student’s sex (binary: ‘F’ - female or ‘M’ - male)age : student’s age (numeric: from 15 to 22)address : student’s home address type (binary: ‘U’ - urban or ‘R’ - rural)famsize : family size (binary: ‘LE3’ - less or equal to 3 or ‘GT3’ - greater than 3)Pstatus : parent’s cohabitation status (binary: ‘T’ - living together or ‘A’ - apart)Medu : mother’s education (numeric: 0 - none, 1 - primary education (4th grade), 2 – 5th to 9th grade, 3 – secondary education or 4 – higher education)Fedu : father’s education (numeric: 0 - none, 1 - primary education (4th grade), 2 – 5th to 9th grade, 3 – secondary education or 4 – higher education)Mjob : mother’s job (nominal: ‘teacher’, ‘health’ care related, civil ‘services’ (e.g. administrative or police), ‘at_home’ or ‘other’)Fjob : father’s job (nominal: ‘teacher’, ‘health’ care related, civil ‘services’ (e.g. administrative or police), ‘at_home’ or ‘other’)reason : reason to choose this school (nominal: close to ‘home’, school ‘reputation’, ‘course’ preference or ‘other’)guardian : student’s guardian (nominal: ‘mother’, ‘father’ or ‘other’)traveltime : home to school travel time (numeric: 1 - <15 min., 2 - 15 to 30 min., 3 - 30 min. to 1 hour, or 4 - >1 hour)studytime : weekly study time (numeric: 1 - <2 hours, 2 - 2 to 5 hours, 3 - 5 to 10 hours, or 4 - >10 hours)failures : number of past class failures (numeric: n if 1<=n<3, else 4)schoolsup : extra educational support (binary: yes or no)famsup : family educational support (binary: yes or no)paid : extra paid classes within the course subject (Math or Portuguese) (binary: yes or no)activities : extra-curricular activities (binary: yes or no)nursery : attended nursery school (binary: yes or no)higher : wants to take higher education (binary: yes or no)internet : Internet access at home (binary: yes or no)romantic : with a romantic relationship (binary: yes or no)famrel : quality of family relationships (numeric: from 1 - very bad to 5 - excellent)freetime : free time after school (numeric: from 1 - very low to 5 - very high)goout : going out with friends (numeric: from 1 - very low to 5 - very high)Dalc : workday alcohol consumption (numeric: from 1 - very low to 5 - very high)Walc : weekend alcohol consumption (numeric: from 1 - very low to 5 - very high)health : current health status (numeric: from 1 - very bad to 5 - very good)absences : number of school absences (numeric: from 0 to 93)These grades are related with the course subject, Math or Portuguese:
G1 : first period grade (numeric: from 0 to 20)G2 : second period grade (numeric: from 0 to 20)G3 : final grade (numeric: from 0 to 20)We can change the data types for several variables into factors and calculate the average grade of each student.
#> [1] "school" "sex" "age" "address" "famsize"
#> [6] "Pstatus" "Medu" "Fedu" "Mjob" "Fjob"
#> [11] "reason" "guardian" "traveltime" "studytime" "failures"
#> [16] "schoolsup" "famsup" "paid" "activities" "nursery"
#> [21] "higher" "internet" "romantic" "famrel" "freetime"
#> [26] "goout" "Dalc" "Walc" "health" "absences"
#> [31] "G1" "G2" "G3" "subject"
sac <- sac %>%
mutate(
Medu = as.factor(Medu),
Fedu = as.factor(Fedu),
traveltime = as.factor(traveltime),
studytime = as.factor(studytime),
failures = as.factor(failures),
famrel = as.factor(famrel),
freetime = as.factor(freetime),
goout = as.factor(goout),
Dalc = as.factor(Dalc),
Walc = as.factor(Walc),
health = as.factor(health),
subject = as.factor(subject),
average_grade = round((G1+G2+G3)/3,2)
) %>%
select(-c(G1, G2, G3))
head(sac)Before we explore further, it would be better if we check whether the data has NA value or not.
#> [1] FALSE
The first thing we explore is the variable we want to predict, the level of alcohol consumption both on workday and weekend.
A <- ggplot(data = sac, aes(x = Dalc, y = average_grade))+
geom_boxplot(aes(colour= Dalc))+
scale_y_continuous()+
facet_grid(~subject)+
scale_x_discrete(name = "Workday Alcohol Consumption",
labels = c("Very Low", "Low", "Medium", "High", "Very High"))+
theme_bw(base_size = 9,
base_family = "",
base_line_size = 0.5,
base_rect_size = 0.5)+
labs(title = "Average Grades grouped by the levels of Workday Alcohol Consumption",
y = "Average Grades")+
theme(legend.position = "none")B <- ggplot(data = sac, aes(x = Walc, y = average_grade))+
geom_boxplot(aes(colour= Walc))+
scale_y_continuous()+
facet_grid(~subject)+
scale_x_discrete(name = "Weekend Alcohol Consumption",
labels = c("Very Low", "Low", "Medium", "High", "Very High"))+
theme_bw(base_size = 9,
base_family = "",
base_line_size = 0.5,
base_rect_size = 0.5)+
labs(title = "Average Grades grouped by the levels of Weekend Alcohol Consumption",
y = "Average Grades")+
theme(legend.position = "none")The higher level of alcohol consumption of students results in a relatively reduced of students average grade
Feature Engineering : Total Alcohol Consumption
#>
#> 1 2 3 4 5
#> 0.69636015 0.18773946 0.06609195 0.02490421 0.02490421
#>
#> 1 2 3 4 5
#> 0.38122605 0.22509579 0.19157088 0.13218391 0.06992337
#>
#> 1 2 3 4 5
#> 727 196 69 26 26
#>
#> 1 2 3 4 5
#> 398 235 200 138 73
# # Binary Model target variabel
# sac$Talc <- (as.numeric(sac$Dalc) + as.numeric(sac$Walc))/2
# sac$Talc <- ifelse(sac$Talc <= mean(sac$Talc), "Low", "High")
# sac$Talc <- factor(sac$Talc, levels = c("Low", "High"))
# sac <- sac %>%
# select(-c("Dalc", "Walc"))
# head(sac)# Multiclass Model target variabel
# 1 : 1 proportion
# sac$Talc <- (as.numeric(sac$Dalc) + as.numeric(sac$Walc))/2
# 5 : 2 proportion
sac$Talc <- as.numeric(sac$Dalc)*(5/7) + as.numeric(sac$Walc)*(2/7)
sac$Talc <- ifelse(sac$Talc <=1, "Very Low",
ifelse(sac$Talc <=2, "Low",
ifelse(sac$Talc <=3, "Medium",
ifelse(sac$Talc <=4, "High", "Very High"))))
sac$Talc <- factor(sac$Talc, levels=c("Very Low", "Low", "Medium", "High", "Very High"))
head(sac)5:2 is better than 1:1 because their proportion is more suitable and further explains the value of Talc.
#>
#> Very Low Low Medium High Very High
#> 0.37452107 0.36685824 0.16187739 0.06321839 0.03352490
#>
#> Very Low Low Medium High Very High
#> 391 383 169 66 35
Split Math & Por Data
After the wrangling process has been completed, we can split the math and por dataset
#>
#> Very Low Low Medium High Very High
#> 0.37974684 0.35696203 0.17468354 0.05569620 0.03291139
#>
#> Very Low Low Medium High Very High
#> 0.37134052 0.37288136 0.15408320 0.06779661 0.03389831
Near-zero varianced columns
#> [1] 1
We can observe variables that have variances close to zero with nearZeroVar(). We exclude subject column in pre-process later.
Cross Validation & Pre-Processing
MATH
set.seed(100)
idx <- initial_split(data = sac_math, prop = 0.8, strata = "Talc")
train <- training(idx)
test <- testing(idx)
#proportion checking
prop.table(table(train$Talc))#>
#> Very Low Low Medium High Very High
#> 0.38050314 0.36163522 0.16981132 0.05345912 0.03459119
#>
#> Very Low Low Medium High Very High
#> 0.37662338 0.33766234 0.19480519 0.06493506 0.02597403
math_recipe <- recipe(Talc~., train) %>%
step_upsample(Talc, seed = 100) %>%
step_nzv(all_predictors()) %>% #removing `subject` column
prep()
math_train <- juice(math_recipe)
math_test <- bake(math_recipe, test)
prop.table(table(math_train$Talc)) #upsample#>
#> Very Low Low Medium High Very High
#> 0.2 0.2 0.2 0.2 0.2
POR
set.seed(100)
idx2 <- initial_split(data = sac_por, prop = 0.8, strata = "Talc")
train <- training(idx2)
test <- testing(idx2)
#cek proporsi diabetes_train
prop.table(table(train$Talc))#>
#> Very Low Low Medium High Very High
#> 0.37428023 0.37044146 0.15547025 0.06525912 0.03454894
#>
#> Very Low Low Medium High Very High
#> 0.3593750 0.3828125 0.1484375 0.0781250 0.0312500
por_recipe <- recipe(Talc~., train) %>%
step_upsample(Talc, seed = 100) %>%
step_nzv(all_predictors()) %>% #removing `subject` column
prep()
por_train <- juice(por_recipe)
por_test <- bake(por_recipe, test)
prop.table(table(por_train$Talc)) #upsample#>
#> Very Low Low Medium High Very High
#> 0.2 0.2 0.2 0.2 0.2
Build Model
Prediction
# # Binary Model
# # model fitting
# NB_math_pred <- predict(model_NB_math, math_test, type = "class") # for the class prediction
# NB_math_prob <- predict(model_NB_math, math_test, type = "raw") # for the probability
#
# # result
# NB_table_math <- select(math_test, Talc) %>%
# bind_cols(Talc_pred = NB_math_pred) %>%
# bind_cols(Talc_Lprob = round(NB_math_prob[,1],4)) %>%
# bind_cols(Talc_Hprob = round(NB_math_prob[,2],4))# Multiclass Model
# model fitting
NB_math_pred <- predict(model_NB_math, math_test, type = "class") # for the class prediction
NB_math_prob <- predict(model_NB_math, math_test, type = "raw") # for the probability
# result
NB_table_math <- select(math_test, Talc) %>%
bind_cols(Talc_pred = NB_math_pred) %>%
bind_cols(Talc_VLprob = round(NB_math_prob[,1],4)) %>%
bind_cols(Talc_Lprob = round(NB_math_prob[,2],4)) %>%
bind_cols(Talc_Mprob = round(NB_math_prob[,3],4)) %>%
bind_cols(Talc_Hprob = round(NB_math_prob[,4],4)) %>%
bind_cols(Talc_VHprob = round(NB_math_prob[,5],4))Model Evaluation
# performance evaluation - confusion matrix
NB_table_math %>%
conf_mat(Talc, Talc_pred) %>%
autoplot(type = "heatmap")Confusion matrix
# # Binary Model
# confusionMatrix(data = NB_math_pred, reference = math_test$Talc, positive = "High")
# Multiclass Model
confusionMatrix(data = NB_math_pred, reference = math_test$Talc)#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction Very Low Low Medium High Very High
#> Very Low 29 0 1 0 0
#> Low 0 26 0 1 0
#> Medium 0 0 14 0 0
#> High 0 0 0 4 0
#> Very High 0 0 0 0 2
#>
#> Overall Statistics
#>
#> Accuracy : 0.974
#> 95% CI : (0.9093, 0.9968)
#> No Information Rate : 0.3766
#> P-Value [Acc > NIR] : < 0.00000000000000022
#>
#> Kappa : 0.9626
#>
#> Mcnemar's Test P-Value : NA
#>
#> Statistics by Class:
#>
#> Class: Very Low Class: Low Class: Medium Class: High
#> Sensitivity 1.0000 1.0000 0.9333 0.80000
#> Specificity 0.9792 0.9804 1.0000 1.00000
#> Pos Pred Value 0.9667 0.9630 1.0000 1.00000
#> Neg Pred Value 1.0000 1.0000 0.9841 0.98630
#> Prevalence 0.3766 0.3377 0.1948 0.06494
#> Detection Rate 0.3766 0.3377 0.1818 0.05195
#> Detection Prevalence 0.3896 0.3506 0.1818 0.05195
#> Balanced Accuracy 0.9896 0.9902 0.9667 0.90000
#> Class: Very High
#> Sensitivity 1.00000
#> Specificity 1.00000
#> Pos Pred Value 1.00000
#> Neg Pred Value 1.00000
#> Prevalence 0.02597
#> Detection Rate 0.02597
#> Detection Prevalence 0.02597
#> Balanced Accuracy 1.00000
ROC (Receiver Operating Characteristic)
AUC - ROC curve is a performance measurement for classification problem at various thresholds settings. ROC is a probability curve and AUC represents degree or measure of separability.
# # Binary Model
# head(NB_math_prob)
# NB_roc <- data.frame(prediction = NB_math_prob[,2],
# actual = as.numeric(NB_table_math$Talc == "High"))#> Very Low Low Medium High Very High
#> [1,] 0.997999884 0.001580914 0.0004139177 0.000003950184 0.0000013342613665
#> [2,] 0.998082362 0.001633682 0.0002764905 0.000007298232 0.0000001675807808
#> [3,] 0.003388426 0.996103024 0.0005007134 0.000007836138 0.0000000003096437
#> [4,] 0.845009905 0.005411335 0.0027735149 0.146786544997 0.0000187004431713
#> [5,] 0.997359856 0.001351401 0.0002025819 0.001084922958 0.0000012375322506
#> [6,] 0.106325221 0.789514446 0.0996893776 0.000189879775 0.0042810751495299
NB_roc <- data.frame(prediction1 = NB_math_prob[,1],
actual1 = as.numeric(NB_table_math$Talc == " Very Low"),
prediction2 = NB_math_prob[,2],
actual2 = as.numeric(NB_table_math$Talc == "Low"),
prediction3 = NB_math_prob[,3],
actual3 = as.numeric(NB_table_math$Talc == "Medium"),
prediction4 = NB_math_prob[,4],
actual4 = as.numeric(NB_table_math$Talc == "High"),
prediction5 = NB_math_prob[,5],
actual5 = as.numeric(NB_table_math$Talc == "Very High"))# # Binary Model
# head(NB_roc)
# pred_roc <- prediction(NB_roc$prediction, NB_roc$actual)
# perf <- performance(prediction.obj = pred_roc, measure = "tpr",x.measure = "fpr")
# plot(perf)
# abline(a = 0, b = 1)# pred_roc1 <- prediction(NB_roc$prediction1, NB_roc$actual1)
pred_roc2 <- prediction(NB_roc$prediction2, NB_roc$actual2)
pred_roc3 <- prediction(NB_roc$prediction3, NB_roc$actual3)
pred_roc4 <- prediction(NB_roc$prediction4, NB_roc$actual4)
pred_roc5 <- prediction(NB_roc$prediction5, NB_roc$actual5)
# head(pred_roc1)
# perf1 <- performance(prediction.obj = pred_roc1, measure = "tpr",x.measure = "fpr")
perf2 <- performance(prediction.obj = pred_roc2, measure = "tpr",x.measure = "fpr")
perf3 <- performance(prediction.obj = pred_roc3, measure = "tpr",x.measure = "fpr")
perf4 <- performance(prediction.obj = pred_roc4, measure = "tpr",x.measure = "fpr")
perf5 <- performance(prediction.obj = pred_roc5, measure = "tpr",x.measure = "fpr")
# plot(perf1)
# abline(a = 0, b = 1)
plot(perf2)
abline(a = 0, b = 1)AUC (Area Under The Curve)
# # Binary Model
# AUC_NB_math_binary <- performance(pred_roc, measure = "auc")
# AUC_NB_math_binary@y.values[[1]]# Multiclass Model
library(pROC)
y_pred <- as.ordered(NB_math_pred)
AUC_NB_math_multiclass <- multiclass.roc(math_test$Talc, y_pred)
print(AUC_NB_math_multiclass)#>
#> Call:
#> multiclass.roc.default(response = math_test$Talc, predictor = y_pred)
#>
#> Data: y_pred with 5 levels of math_test$Talc: Very Low, Low, Medium, High, Very High.
#> Multi-class area under the curve: 0.9613
Build Model
Prediction
# # Binary Model
# # model fitting
# NB_por_pred <- predict(model_NB_por, por_test, type = "class") # for the class prediction
# NB_por_prob <- predict(model_NB_por, por_test, type = "raw") # for the probability
#
# # result
# NB_table_por <- select(por_test, Talc) %>%
# bind_cols(Talc_pred = NB_por_pred) %>%
# bind_cols(Talc_Lprob = round(NB_por_prob[,1],4)) %>%
# bind_cols(Talc_Hprob = round(NB_por_prob[,2],4))# Multiclass Model
# model fitting
NB_por_pred <- predict(model_NB_por, por_test, type = "class") # for the class prediction
NB_por_prob <- predict(model_NB_por, por_test, type = "raw") # for the probability
# result
NB_table_por <- select(por_test, Talc) %>%
bind_cols(Talc_pred = NB_por_pred) %>%
bind_cols(Talc_VLprob = round(NB_por_prob[,1],4)) %>%
bind_cols(Talc_Lprob = round(NB_por_prob[,2],4)) %>%
bind_cols(Talc_Mprob = round(NB_por_prob[,3],4)) %>%
bind_cols(Talc_Hprob = round(NB_por_prob[,4],4)) %>%
bind_cols(Talc_VHprob = round(NB_por_prob[,5],4))Model Evaluation
# performance evaluation - confusion matrix
NB_table_por %>%
conf_mat(Talc, Talc_pred) %>%
autoplot(type = "heatmap")Confusion matrix
# # Binary Model
# confusionMatrix(data = NB_por_pred, reference = por_test$Talc, positive = "High")
# Multiclass Model
confusionMatrix(data = NB_por_pred, reference = por_test$Talc)#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction Very Low Low Medium High Very High
#> Very Low 43 0 0 0 1
#> Low 0 48 0 0 1
#> Medium 1 0 19 0 0
#> High 1 0 0 10 0
#> Very High 1 1 0 0 2
#>
#> Overall Statistics
#>
#> Accuracy : 0.9531
#> 95% CI : (0.9008, 0.9826)
#> No Information Rate : 0.3828
#> P-Value [Acc > NIR] : < 0.00000000000000022
#>
#> Kappa : 0.9329
#>
#> Mcnemar's Test P-Value : NA
#>
#> Statistics by Class:
#>
#> Class: Very Low Class: Low Class: Medium Class: High
#> Sensitivity 0.9348 0.9796 1.0000 1.00000
#> Specificity 0.9878 0.9873 0.9908 0.99153
#> Pos Pred Value 0.9773 0.9796 0.9500 0.90909
#> Neg Pred Value 0.9643 0.9873 1.0000 1.00000
#> Prevalence 0.3594 0.3828 0.1484 0.07812
#> Detection Rate 0.3359 0.3750 0.1484 0.07812
#> Detection Prevalence 0.3438 0.3828 0.1562 0.08594
#> Balanced Accuracy 0.9613 0.9835 0.9954 0.99576
#> Class: Very High
#> Sensitivity 0.50000
#> Specificity 0.98387
#> Pos Pred Value 0.50000
#> Neg Pred Value 0.98387
#> Prevalence 0.03125
#> Detection Rate 0.01562
#> Detection Prevalence 0.03125
#> Balanced Accuracy 0.74194
ROC (Receiver Operating Characteristic)
AUC - ROC curve is a performance measurement for classification problem at various thresholds settings. ROC is a probability curve and AUC represents degree or measure of separability.
# # Binary Model
# head(NB_por_prob)
# NB_roc <- data.frame(prediction = NB_por_prob[,2],
# actual = as.numeric(NB_table_por$Talc == "High"))#> Very Low Low Medium High Very High
#> [1,] 0.9986728184 0.0008012609 0.00042334075 0.000102294063 0.00000028591796
#> [2,] 0.9987594918 0.0004767296 0.00066986002 0.000093882385 0.00000003618707
#> [3,] 0.0080327007 0.0134080249 0.97480815410 0.001500363288 0.00225075706231
#> [4,] 0.9996412280 0.0002615960 0.00009362201 0.000003538023 0.00000001597722
#> [5,] 0.0002625151 0.0005375040 0.99826238689 0.000936614627 0.00000097939785
#> [6,] 0.1049034925 0.2980870184 0.25529525769 0.136329451718 0.20538477965125
NB_roc <- data.frame(prediction1 = NB_por_prob[,1],
actual1 = as.numeric(NB_table_por$Talc == " Very Low"),
prediction2 = NB_por_prob[,2],
actual2 = as.numeric(NB_table_por$Talc == "Low"),
prediction3 = NB_por_prob[,3],
actual3 = as.numeric(NB_table_por$Talc == "Medium"),
prediction4 = NB_por_prob[,4],
actual4 = as.numeric(NB_table_por$Talc == "High"),
prediction5 = NB_por_prob[,5],
actual5 = as.numeric(NB_table_por$Talc == "Very High"))# # Binary Model
# head(NB_roc)
# pred_roc <- prediction(NB_roc$prediction, NB_roc$actual)
# perf <- performance(prediction.obj = pred_roc, measure = "tpr",x.measure = "fpr")
# plot(perf)
# abline(a = 0, b = 1)# pred_roc1 <- prediction(NB_roc$prediction1, NB_roc$actual1)
pred_roc2 <- prediction(NB_roc$prediction2, NB_roc$actual2)
pred_roc3 <- prediction(NB_roc$prediction3, NB_roc$actual3)
pred_roc4 <- prediction(NB_roc$prediction4, NB_roc$actual4)
pred_roc5 <- prediction(NB_roc$prediction5, NB_roc$actual5)
# head(pred_roc1)
# perf1 <- performance(prediction.obj = pred_roc1, measure = "tpr",x.measure = "fpr")
perf2 <- performance(prediction.obj = pred_roc2, measure = "tpr",x.measure = "fpr")
perf3 <- performance(prediction.obj = pred_roc3, measure = "tpr",x.measure = "fpr")
perf4 <- performance(prediction.obj = pred_roc4, measure = "tpr",x.measure = "fpr")
perf5 <- performance(prediction.obj = pred_roc5, measure = "tpr",x.measure = "fpr")
# plot(perf1)
# abline(a = 0, b = 1)
plot(perf2)
abline(a = 0, b = 1)AUC (Area Under The Curve)
# # Binary Model
# AUC_NB_por_binary <- performance(pred_roc, measure = "auc")
# AUC_NB_por_binary@y.values[[1]]# Multiclass Model
y_pred <- as.ordered(NB_por_pred)
AUC_NB_por_multiclass <- multiclass.roc(por_test$Talc, y_pred)
print(AUC_NB_por_multiclass)#>
#> Call:
#> multiclass.roc.default(response = por_test$Talc, predictor = y_pred)
#>
#> Data: y_pred with 5 levels of por_test$Talc: Very Low, Low, Medium, High, Very High.
#> Multi-class area under the curve: 0.8271
Build Model
CTREE default
# model_DT_math <- ctree(formula = Talc~.,data = math_train,
# control = ctree_control(mincriterion = 0.01))RPART
# model_DT_math <- rpart(formula = Talc~ ., data = math_train, method = "class",
# control = rpart.control(cp = 0.1))# model_DT_math <- rpart(formula = Talc~ ., data = math_train, method = "class",
# control = rpart.control(cp = 0.01))CTREE
RPART
Prediction CTREE
RPART
# # Binary Model
# # model fitting
# DT_math_pred <- predict(model_DT_math, math_test, type = "class") # for the class prediction
# DT_math_prob <- predict(model_DT_math, math_test, type = "prob") # for the probability
#
# # result
# DT_table_math <- select(math_test, Talc) %>%
# bind_cols(Talc_pred = DT_math_pred) %>%
# bind_cols(Talc_Lprob = round(DT_math_prob[,1],4)) %>%
# bind_cols(Talc_Hprob = round(DT_math_prob[,2],4))# # Multiclass Model
# # model fitting
# DT_math_pred <- predict(model_DT_math, math_test, type = "class") # for the class prediction
DT_math_prob <- predict(model_DT_math, math_test, type = "prob") # for the probability
# result
DT_table_math <- select(math_test, Talc) %>%
bind_cols(Talc_pred = DT_math_pred) %>%
bind_cols(Talc_VLprob = round(DT_math_prob[,1],4)) %>%
bind_cols(Talc_Lprob = round(DT_math_prob[,2],4)) %>%
bind_cols(Talc_Mprob = round(DT_math_prob[,3],4)) %>%
bind_cols(Talc_Hprob = round(DT_math_prob[,4],4)) %>%
bind_cols(Talc_VHprob = round(DT_math_prob[,5],4))Model Evaluation
# performance evaluation - confusion matrix
DT_table_math %>%
conf_mat(Talc, Talc_pred) %>%
autoplot(type = "heatmap")Confusion matrix
# # Binary Model
# confusionMatrix(data = DT_math_pred, reference = math_test$Talc, positive = "High")
# Multiclass Model
confusionMatrix(data = DT_math_pred, reference = math_test$Talc)#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction Very Low Low Medium High Very High
#> Very Low 4 6 5 0 0
#> Low 5 5 2 1 0
#> Medium 11 11 6 1 1
#> High 5 3 1 3 1
#> Very High 4 1 1 0 0
#>
#> Overall Statistics
#>
#> Accuracy : 0.2338
#> 95% CI : (0.1448, 0.3441)
#> No Information Rate : 0.3766
#> P-Value [Acc > NIR] : 0.99739
#>
#> Kappa : 0.0186
#>
#> Mcnemar's Test P-Value : 0.02429
#>
#> Statistics by Class:
#>
#> Class: Very Low Class: Low Class: Medium Class: High
#> Sensitivity 0.13793 0.19231 0.40000 0.60000
#> Specificity 0.77083 0.84314 0.61290 0.86111
#> Pos Pred Value 0.26667 0.38462 0.20000 0.23077
#> Neg Pred Value 0.59677 0.67188 0.80851 0.96875
#> Prevalence 0.37662 0.33766 0.19481 0.06494
#> Detection Rate 0.05195 0.06494 0.07792 0.03896
#> Detection Prevalence 0.19481 0.16883 0.38961 0.16883
#> Balanced Accuracy 0.45438 0.51772 0.50645 0.73056
#> Class: Very High
#> Sensitivity 0.00000
#> Specificity 0.92000
#> Pos Pred Value 0.00000
#> Neg Pred Value 0.97183
#> Prevalence 0.02597
#> Detection Rate 0.00000
#> Detection Prevalence 0.07792
#> Balanced Accuracy 0.46000
ROC (Receiver Operating Characteristic)
# # Binary Model
# options(scipen = 999)
# head(DT_math_prob)
# DT_roc <- data.frame(prediction = DT_math_prob[,2],
# actual = as.numeric(DT_table_math$Talc == "High"))#> Very Low Low Medium High Very High
#> 1 0.38888889 0.05555556 0.5555556 0.0000000 0
#> 2 0.38888889 0.05555556 0.5555556 0.0000000 0
#> 3 0.25000000 0.00000000 0.0000000 0.7500000 0
#> 4 0.05555556 0.00000000 0.1111111 0.8333333 0
#> 5 0.11111111 0.11111111 0.0000000 0.7777778 0
#> 6 0.30769231 0.69230769 0.0000000 0.0000000 0
DT_roc <- data.frame(prediction1 = DT_math_prob[,1],
actual1 = as.numeric(DT_table_math$Talc == " Very Low"),
prediction2 = DT_math_prob[,2],
actual2 = as.numeric(DT_table_math$Talc == "Low"),
prediction3 = DT_math_prob[,3],
actual3 = as.numeric(DT_table_math$Talc == "Medium"),
prediction4 = DT_math_prob[,4],
actual4 = as.numeric(DT_table_math$Talc == "High"),
prediction5 = DT_math_prob[,5],
actual5 = as.numeric(DT_table_math$Talc == "Very High"))# # Binary Model
# head(DT_roc)
# pred_roc <- prediction(DT_roc$prediction, DT_roc$actual)
# perf <- performance(prediction.obj = pred_roc, measure = "tpr",x.measure = "fpr")
# plot(perf)
# abline(a = 0, b = 1)# pred_roc1 <- prediction(DT_roc$prediction1, DT_roc$actual1)
pred_roc2 <- prediction(DT_roc$prediction2, DT_roc$actual2)
pred_roc3 <- prediction(DT_roc$prediction3, DT_roc$actual3)
pred_roc4 <- prediction(DT_roc$prediction4, DT_roc$actual4)
pred_roc5 <- prediction(DT_roc$prediction5, DT_roc$actual5)
# head(pred_roc1)
# perf1 <- performance(prediction.obj = pred_roc1, measure = "tpr",x.measure = "fpr")
perf2 <- performance(prediction.obj = pred_roc2, measure = "tpr",x.measure = "fpr")
perf3 <- performance(prediction.obj = pred_roc3, measure = "tpr",x.measure = "fpr")
perf4 <- performance(prediction.obj = pred_roc4, measure = "tpr",x.measure = "fpr")
perf5 <- performance(prediction.obj = pred_roc5, measure = "tpr",x.measure = "fpr")
# plot(perf1)
# abline(a = 0, b = 1)
plot(perf2)
abline(a = 0, b = 1)AUC (Area Under The Curve)
# # Binary Model
# AUC_DT_math_binary <- performance(pred_roc, measure = "auc")
# AUC_DT_math_binary@y.values[[1]]# Multiclass Model
y_pred <- as.ordered(NB_math_pred)
AUC_DT_math_multiclass <- multiclass.roc(math_test$Talc, y_pred)
print(AUC_DT_math_multiclass)#>
#> Call:
#> multiclass.roc.default(response = math_test$Talc, predictor = y_pred)
#>
#> Data: y_pred with 5 levels of math_test$Talc: Very Low, Low, Medium, High, Very High.
#> Multi-class area under the curve: 0.9613
Build Model
CTREE default
# model_DT_por <- ctree(formula = Talc~.,data = por_train,
# control = ctree_control(mincriterion = 0.01))RPART
# model_DT_por <- rpart(formula = Talc~ ., data = por_train, method = "class",
# control = rpart.control(cp = 0.1))# model_DT_por <- rpart(formula = Talc~ ., data = por_train, method = "class",
# control = rpart.control(cp = 0.01))CTREE
RPART
Prediction CTREE
RPART
# # Binary Model
# # model fitting
# DT_por_pred <- predict(model_DT_por, por_test, type = "class") # for the class prediction
# DT_por_prob <- predict(model_DT_por, por_test, type = "prob") # for the probability
#
# # result
# DT_table_por <- select(por_test, Talc) %>%
# bind_cols(Talc_pred = DT_por_pred) %>%
# bind_cols(Talc_Lprob = round(DT_por_prob[,1],4)) %>%
# bind_cols(Talc_Hprob = round(DT_por_prob[,2],4))# # Multiclass Model
# # model fitting
# DT_por_pred <- predict(model_DT_por, por_test, type = "class") # for the class prediction
DT_por_prob <- predict(model_DT_por, por_test, type = "prob") # for the probability
# result
DT_table_por <- select(por_test, Talc) %>%
bind_cols(Talc_pred = DT_por_pred) %>%
bind_cols(Talc_VLprob = round(DT_por_prob[,1],4)) %>%
bind_cols(Talc_Lprob = round(DT_por_prob[,2],4)) %>%
bind_cols(Talc_Mprob = round(DT_por_prob[,3],4)) %>%
bind_cols(Talc_Hprob = round(DT_por_prob[,4],4)) %>%
bind_cols(Talc_VHprob = round(DT_por_prob[,5],4))Model Evaluation
# performance evaluation - confusion matrix
DT_table_por %>%
conf_mat(Talc, Talc_pred) %>%
autoplot(type = "heatmap")Confusion matrix
# # Binary Model
# confusionMatrix(data = DT_por_pred, reference = por_test$Talc, positive = "High")
# Multiclass Model
confusionMatrix(data = DT_por_pred, reference = por_test$Talc)#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction Very Low Low Medium High Very High
#> Very Low 9 9 2 2 2
#> Low 18 23 5 1 1
#> Medium 9 11 8 2 1
#> High 7 5 3 3 0
#> Very High 3 1 1 2 0
#>
#> Overall Statistics
#>
#> Accuracy : 0.3359
#> 95% CI : (0.2549, 0.4248)
#> No Information Rate : 0.3828
#> P-Value [Acc > NIR] : 0.88204
#>
#> Kappa : 0.1031
#>
#> Mcnemar's Test P-Value : 0.06307
#>
#> Statistics by Class:
#>
#> Class: Very Low Class: Low Class: Medium Class: High
#> Sensitivity 0.19565 0.4694 0.4211 0.30000
#> Specificity 0.81707 0.6835 0.7890 0.87288
#> Pos Pred Value 0.37500 0.4792 0.2581 0.16667
#> Neg Pred Value 0.64423 0.6750 0.8866 0.93636
#> Prevalence 0.35938 0.3828 0.1484 0.07812
#> Detection Rate 0.07031 0.1797 0.0625 0.02344
#> Detection Prevalence 0.18750 0.3750 0.2422 0.14062
#> Balanced Accuracy 0.50636 0.5765 0.6050 0.58644
#> Class: Very High
#> Sensitivity 0.00000
#> Specificity 0.94355
#> Pos Pred Value 0.00000
#> Neg Pred Value 0.96694
#> Prevalence 0.03125
#> Detection Rate 0.00000
#> Detection Prevalence 0.05469
#> Balanced Accuracy 0.47177
ROC (Receiver Operating Characteristic)
# # Binary Model
# options(scipen = 999)
# head(DT_por_prob)
# DT_roc <- data.frame(prediction = DT_por_prob[,2],
# actual = as.numeric(DT_table_por$Talc == "High"))#> Very Low Low Medium High Very High
#> 1 0.3571429 0.6428571 0.0000000 0 0
#> 2 0.8333333 0.1666667 0.0000000 0 0
#> 3 0.0000000 0.1818182 0.8181818 0 0
#> 4 0.7777778 0.0000000 0.2222222 0 0
#> 5 0.2857143 0.7142857 0.0000000 0 0
#> 6 0.4000000 0.6000000 0.0000000 0 0
DT_roc <- data.frame(prediction1 = DT_por_prob[,1],
actual1 = as.numeric(DT_table_por$Talc == " Very Low"),
prediction2 = DT_por_prob[,2],
actual2 = as.numeric(DT_table_por$Talc == "Low"),
prediction3 = DT_por_prob[,3],
actual3 = as.numeric(DT_table_por$Talc == "Medium"),
prediction4 = DT_por_prob[,4],
actual4 = as.numeric(DT_table_por$Talc == "High"),
prediction5 = DT_por_prob[,5],
actual5 = as.numeric(DT_table_por$Talc == "Very High"))# # Binary Model
# head(DT_roc)
# pred_roc <- prediction(DT_roc$prediction, DT_roc$actual)
# perf <- performance(prediction.obj = pred_roc, measure = "tpr",x.measure = "fpr")
# plot(perf)
# abline(a = 0, b = 1)# pred_roc1 <- prediction(DT_roc$prediction1, DT_roc$actual1)
pred_roc2 <- prediction(DT_roc$prediction2, DT_roc$actual2)
pred_roc3 <- prediction(DT_roc$prediction3, DT_roc$actual3)
pred_roc4 <- prediction(DT_roc$prediction4, DT_roc$actual4)
pred_roc5 <- prediction(DT_roc$prediction5, DT_roc$actual5)
# head(pred_roc1)
# perf1 <- performance(prediction.obj = pred_roc1, measure = "tpr",x.measure = "fpr")
perf2 <- performance(prediction.obj = pred_roc2, measure = "tpr",x.measure = "fpr")
perf3 <- performance(prediction.obj = pred_roc3, measure = "tpr",x.measure = "fpr")
perf4 <- performance(prediction.obj = pred_roc4, measure = "tpr",x.measure = "fpr")
perf5 <- performance(prediction.obj = pred_roc5, measure = "tpr",x.measure = "fpr")
# plot(perf1)
# abline(a = 0, b = 1)
plot(perf2)
abline(a = 0, b = 1)AUC (Area Under The Curve)
# # Binary Model
# AUC_DT_por_binary <- performance(pred_roc, measure = "auc")
# AUC_DT_por_binary@y.values[[1]]# Multiclass Model
y_pred <- as.ordered(NB_por_pred)
AUC_DT_por_multiclass <- multiclass.roc(por_test$Talc, y_pred)
print(AUC_DT_por_multiclass)#>
#> Call:
#> multiclass.roc.default(response = por_test$Talc, predictor = y_pred)
#>
#> Data: y_pred with 5 levels of por_test$Talc: Very Low, Low, Medium, High, Very High.
#> Multi-class area under the curve: 0.8271
In this section I want to do proof to see whether the Decision Tree model is made vulnerable/sensitive to overfitting or not. It will be proven by trying to predict the model using data_test and also the data_train that we used to build the model.
#>
#> Very Low Low Medium High Very High
#> 29 26 15 5 2
#>
#> Very Low Low Medium High Very High
#> 46 49 19 10 4
pred_train_dt <- predict(object = model_dt, newdata = math_train, type = "response")
pred_test_dt <- predict(object = model_dt, newdata = math_test, type = "response")# confusionMatrix(data = pred_train_dt, reference = math_train$Talc, positive = "High")
confusionMatrix(data = pred_train_dt, reference = math_train$Talc)#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction Very Low Low Medium High Very High
#> Very Low 74 19 16 0 0
#> Low 17 74 14 7 0
#> Medium 18 13 78 0 0
#> High 8 6 9 114 0
#> Very High 4 9 4 0 121
#>
#> Overall Statistics
#>
#> Accuracy : 0.762
#> 95% CI : (0.726, 0.7954)
#> No Information Rate : 0.2
#> P-Value [Acc > NIR] : < 0.00000000000000022
#>
#> Kappa : 0.7025
#>
#> Mcnemar's Test P-Value : NA
#>
#> Statistics by Class:
#>
#> Class: Very Low Class: Low Class: Medium Class: High
#> Sensitivity 0.6116 0.6116 0.6446 0.9421
#> Specificity 0.9277 0.9215 0.9360 0.9525
#> Pos Pred Value 0.6789 0.6607 0.7156 0.8321
#> Neg Pred Value 0.9052 0.9047 0.9133 0.9850
#> Prevalence 0.2000 0.2000 0.2000 0.2000
#> Detection Rate 0.1223 0.1223 0.1289 0.1884
#> Detection Prevalence 0.1802 0.1851 0.1802 0.2264
#> Balanced Accuracy 0.7696 0.7665 0.7903 0.9473
#> Class: Very High
#> Sensitivity 1.0000
#> Specificity 0.9649
#> Pos Pred Value 0.8768
#> Neg Pred Value 1.0000
#> Prevalence 0.2000
#> Detection Rate 0.2000
#> Detection Prevalence 0.2281
#> Balanced Accuracy 0.9824
# confusionMatrix(data = pred_test_dt, reference = math_test$Talc, positive = "High")
confusionMatrix(data = pred_test_dt, reference = math_test$Talc)#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction Very Low Low Medium High Very High
#> Very Low 10 8 8 0 1
#> Low 6 5 2 1 0
#> Medium 3 8 2 1 0
#> High 6 4 2 3 1
#> Very High 4 1 1 0 0
#>
#> Overall Statistics
#>
#> Accuracy : 0.2597
#> 95% CI : (0.1664, 0.3723)
#> No Information Rate : 0.3766
#> P-Value [Acc > NIR] : 0.98890
#>
#> Kappa : 0.0203
#>
#> Mcnemar's Test P-Value : 0.03911
#>
#> Statistics by Class:
#>
#> Class: Very Low Class: Low Class: Medium Class: High
#> Sensitivity 0.3448 0.19231 0.13333 0.60000
#> Specificity 0.6458 0.82353 0.80645 0.81944
#> Pos Pred Value 0.3704 0.35714 0.14286 0.18750
#> Neg Pred Value 0.6200 0.66667 0.79365 0.96721
#> Prevalence 0.3766 0.33766 0.19481 0.06494
#> Detection Rate 0.1299 0.06494 0.02597 0.03896
#> Detection Prevalence 0.3506 0.18182 0.18182 0.20779
#> Balanced Accuracy 0.4953 0.50792 0.46989 0.70972
#> Class: Very High
#> Sensitivity 0.00000
#> Specificity 0.92000
#> Pos Pred Value 0.00000
#> Neg Pred Value 0.97183
#> Prevalence 0.02597
#> Detection Rate 0.00000
#> Detection Prevalence 0.07792
#> Balanced Accuracy 0.46000
Based on the above experiment, we can conclude that it is true that our Decision Tree model is vulnerable to overfitting.
Random forest is a classification method based on ensamble method. Random forest is built from several decision trees (default = 500) with different characteristics. For each tree model used observations and predictors that are different from the results of the sampling.
Ensemble learning is a machine learning paradigm where multiple models (often called “weak learners”) are trained to solve the same problem and combined to get better results. The main hypothesis is that when weak models are correctly combined we can obtain more accurate and/or robust models.
Cross-validation is a resampling procedure used to evaluate machine learning models on a limited data sample.
The procedure has a single parameter called k that refers to the number of groups that a given data sample is to be split into. As such, the procedure is often called k-fold cross-validation. When a specific value for k is chosen, it may be used in place of k in the reference to the model, such as k=10 becoming 10-fold cross-validation.
The k value must be chosen carefully for your data sample. The choice of k is usually 5 or 10, but there is no formal rule. As k gets larger, the difference in size between the training set and the resampling subsets gets smaller. As this difference decreases, the bias of the technique becomes smaller.
To summarize, there is a bias-variance trade-off associated with the choice of k in k-fold cross-validation. Typically, given these considerations, one performs k-fold cross-validation using k = 5 or k = 10, as these values have been shown empirically to yield test error rate estimates that suffer neither from excessively high bias nor from very high variance.
Build Model
set.seed(100)
ctrl <- trainControl(method = "repeatedcv", number = 5, repeats = 5)
# model_RF_math <- train(Talc~., data = math_train, method = "rf", trControl = ctrl)# saveRDS(model_RF_math, "model/RF_math_mc.RDS")
# model_RF_math <- readRDS("model/RF_math_bin.RDS") #binary model
model_RF_math <- readRDS("model/RF_math_mc.RDS") #multiclass model#> Random Forest
#>
#> 605 samples
#> 29 predictor
#> 5 classes: 'Very Low', 'Low', 'Medium', 'High', 'Very High'
#>
#> No pre-processing
#> Resampling: Cross-Validated (5 fold, repeated 5 times)
#> Summary of sample sizes: 485, 485, 483, 485, 482, 485, ...
#> Resampling results across tuning parameters:
#>
#> mtry Accuracy Kappa
#> 2 0.8515453 0.8144140
#> 32 0.8502179 0.8127533
#> 62 0.8273983 0.7842296
#>
#> Accuracy was used to select the optimal model using the largest value.
#> The final value used for the model was mtry = 2.
Final Model Inspection
# plot(model_RF_math$finalModel)
# legend("topright", colnames(model_RF_math$finalModel$err.rate),
# col=1:3, cex=0.8, fill=1:3) #binary
# legend("topright", colnames(model_RF_math$finalModel$err.rate),
# col=1:3, cex=0.8, fill=1:6) #multiclassOOB or Out-Of-Bag estimates values indicate error values from models accuracy on unseen set of data.
Variable Importance
#> rf variable importance
#>
#> only 20 most important variables shown (out of 62)
#>
#> Overall
#> average_grade 100.00
#> absences 99.98
#> age 73.71
#> sexM 66.40
#> goout5 42.74
#> paidyes 34.67
#> freetime4 32.88
#> famsupyes 32.46
#> addressU 31.71
#> famsizeLE3 29.71
#> romanticyes 29.49
#> activitiesyes 29.23
#> studytime2 28.29
#> Fjobservices 28.17
#> famrel4 27.11
#> Medu2 26.76
#> Fjobother 26.36
#> Fedu2 26.19
#> nurseryyes 25.55
#> health5 25.01
Prediction
# # Binary Model
# # model fitting
# RF_math_pred <- predict(model_RF_math, math_test, type = "raw") # for the class prediction
# RF_math_prob <- predict(model_RF_math, math_test, type = "prob") # for the probability
#
# # result
# RF_table_math <- select(math_test, Talc) %>%
# bind_cols(Talc_pred = RF_math_pred) %>%
# bind_cols(Talc_Lprob = round(RF_math_prob[,1],4)) %>%
# bind_cols(Talc_Hprob = round(RF_math_prob[,2],4))# Multiclass Model
# model fitting
RF_math_pred <- predict(model_RF_math, math_test, type = "raw") # for the class prediction
RF_math_prob <- predict(model_RF_math, math_test, type = "prob") # for the probability
# result
RF_table_math <- select(math_test, Talc) %>%
bind_cols(Talc_pred = RF_math_pred) %>%
bind_cols(Talc_VLprob = round(RF_math_prob[,1],4)) %>%
bind_cols(Talc_Lprob = round(RF_math_prob[,2],4)) %>%
bind_cols(Talc_Mprob = round(RF_math_prob[,3],4)) %>%
bind_cols(Talc_Hprob = round(RF_math_prob[,4],4)) %>%
bind_cols(Talc_VHprob = round(RF_math_prob[,5],4))Model Evaluation
# performance evaluation - confusion matrix
RF_table_math %>%
conf_mat(Talc, Talc_pred) %>%
autoplot(type = "heatmap")Confusion matrix
# # Binary Model
# confusionMatrix(data = RF_math_pred, reference = math_test$Talc, positive = "High")
# Multiclass Model
confusionMatrix(data = RF_math_pred, reference = math_test$Talc)#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction Very Low Low Medium High Very High
#> Very Low 20 20 7 0 0
#> Low 5 3 3 2 0
#> Medium 0 3 4 3 2
#> High 3 0 0 0 0
#> Very High 1 0 1 0 0
#>
#> Overall Statistics
#>
#> Accuracy : 0.3506
#> 95% CI : (0.2453, 0.4678)
#> No Information Rate : 0.3766
#> P-Value [Acc > NIR] : 0.7194
#>
#> Kappa : 0.0444
#>
#> Mcnemar's Test P-Value : NA
#>
#> Statistics by Class:
#>
#> Class: Very Low Class: Low Class: Medium Class: High
#> Sensitivity 0.6897 0.11538 0.26667 0.00000
#> Specificity 0.4375 0.80392 0.87097 0.95833
#> Pos Pred Value 0.4255 0.23077 0.33333 0.00000
#> Neg Pred Value 0.7000 0.64062 0.83077 0.93243
#> Prevalence 0.3766 0.33766 0.19481 0.06494
#> Detection Rate 0.2597 0.03896 0.05195 0.00000
#> Detection Prevalence 0.6104 0.16883 0.15584 0.03896
#> Balanced Accuracy 0.5636 0.45965 0.56882 0.47917
#> Class: Very High
#> Sensitivity 0.00000
#> Specificity 0.97333
#> Pos Pred Value 0.00000
#> Neg Pred Value 0.97333
#> Prevalence 0.02597
#> Detection Rate 0.00000
#> Detection Prevalence 0.02597
#> Balanced Accuracy 0.48667
ROC (Receiver Operating Characteristic)
# # Binary Model
# options(scipen = 999)
# head(RF_math_prob)
# RF_roc <- data.frame(prediction = RF_math_prob[,2],
# actual = as.numeric(RF_table_math$Talc == "High"))RF_roc <- data.frame(prediction1 = RF_math_prob[,1],
actual1 = as.numeric(RF_table_math$Talc == " Very Low"),
prediction2 = RF_math_prob[,2],
actual2 = as.numeric(RF_table_math$Talc == "Low"),
prediction3 = RF_math_prob[,3],
actual3 = as.numeric(RF_table_math$Talc == "Medium"),
prediction4 = RF_math_prob[,4],
actual4 = as.numeric(RF_table_math$Talc == "High"),
prediction5 = RF_math_prob[,5],
actual5 = as.numeric(RF_table_math$Talc == "Very High"))# # Binary Model
# head(RF_roc)
# pred_roc <- prediction(RF_roc$prediction, RF_roc$actual)
# perf <- performance(prediction.obj = pred_roc, measure = "tpr",x.measure = "fpr")
# plot(perf)
# abline(a = 0, b = 1)# pred_roc1 <- prediction(RF_roc$prediction1, RF_roc$actual1)
pred_roc2 <- prediction(RF_roc$prediction2, RF_roc$actual2)
pred_roc3 <- prediction(RF_roc$prediction3, RF_roc$actual3)
pred_roc4 <- prediction(RF_roc$prediction4, RF_roc$actual4)
pred_roc5 <- prediction(RF_roc$prediction5, RF_roc$actual5)
# head(pred_roc1)
# perf1 <- performance(prediction.obj = pred_roc1, measure = "tpr",x.measure = "fpr")
perf2 <- performance(prediction.obj = pred_roc2, measure = "tpr",x.measure = "fpr")
perf3 <- performance(prediction.obj = pred_roc3, measure = "tpr",x.measure = "fpr")
perf4 <- performance(prediction.obj = pred_roc4, measure = "tpr",x.measure = "fpr")
perf5 <- performance(prediction.obj = pred_roc5, measure = "tpr",x.measure = "fpr")
# plot(perf1)
# abline(a = 0, b = 1)
plot(perf2)
abline(a = 0, b = 1)AUC (Area Under The Curve)
# # Binary Model
# AUC_RF_math_binary <- performance(pred_roc, measure = "auc")
# AUC_RF_math_binary@y.values[[1]]# Multiclass Model
y_pred <- as.ordered(NB_math_pred)
AUC_RF_math_multiclass <- multiclass.roc(math_test$Talc, y_pred)
print(AUC_RF_math_multiclass)#>
#> Call:
#> multiclass.roc.default(response = math_test$Talc, predictor = y_pred)
#>
#> Data: y_pred with 5 levels of math_test$Talc: Very Low, Low, Medium, High, Very High.
#> Multi-class area under the curve: 0.9613
Build Model
set.seed(100)
ctrl <- trainControl(method = "repeatedcv", number = 5, repeats = 5)
# model_RF_por <- train(Talc~., data = por_train, method = "rf", trControl = ctrl)# saveRDS(model_RF_por, "model/RF_por_mc.RDS")
# model_RF_por <- readRDS("model/RF_por_bin.RDS") #binary model
model_RF_por <- readRDS("model/RF_por_mc.RDS") #multiclass model#> Random Forest
#>
#> 975 samples
#> 29 predictor
#> 5 classes: 'Very Low', 'Low', 'Medium', 'High', 'Very High'
#>
#> No pre-processing
#> Resampling: Cross-Validated (5 fold, repeated 5 times)
#> Summary of sample sizes: 780, 780, 780, 780, 780, 780, ...
#> Resampling results across tuning parameters:
#>
#> mtry Accuracy Kappa
#> 2 0.8354872 0.7943590
#> 32 0.8334359 0.7917949
#> 62 0.8227692 0.7784615
#>
#> Accuracy was used to select the optimal model using the largest value.
#> The final value used for the model was mtry = 2.
Final Model Inspection
plot(model_RF_por$finalModel)
# legend("topright", colnames(model_RF_por$finalModel$err.rate),
# col=1:3, cex=0.8, fill=1:3) #binary
legend("topright", colnames(model_RF_por$finalModel$err.rate),
col=1:3, cex=0.8, fill=1:6) #multiclassOOB or Out-Of-Bag estimates values indicate error values from models accuracy on unseen set of data.
Variable Importance
#> rf variable importance
#>
#> only 20 most important variables shown (out of 62)
#>
#> Overall
#> average_grade 100.00
#> absences 91.49
#> sexM 73.77
#> age 65.25
#> famsupyes 38.19
#> goout5 36.82
#> health5 31.54
#> activitiesyes 30.41
#> studytime2 29.72
#> romanticyes 29.61
#> famrel5 28.97
#> famsizeLE3 28.16
#> health3 27.69
#> Fjobservices 27.66
#> schoolMS 27.44
#> freetime3 27.18
#> addressU 26.93
#> guardianmother 26.23
#> famrel4 25.95
#> Mjobother 25.33
Prediction
# # Binary Model
# # model fitting
# RF_por_pred <- predict(model_RF_por, por_test, type = "raw") # for the class prediction
# RF_por_prob <- predict(model_RF_por, por_test, type = "prob") # for the probability
#
# # result
# RF_table_por <- select(por_test, Talc) %>%
# bind_cols(Talc_pred = RF_por_pred) %>%
# bind_cols(Talc_Lprob = round(RF_por_prob[,1],4)) %>%
# bind_cols(Talc_Hprob = round(RF_por_prob[,2],4))# Multiclass Model
# model fitting
RF_por_pred <- predict(model_RF_por, por_test, type = "raw") # for the class prediction
RF_por_prob <- predict(model_RF_por, por_test, type = "prob") # for the probability
# result
RF_table_por <- select(por_test, Talc) %>%
bind_cols(Talc_pred = RF_por_pred) %>%
bind_cols(Talc_VLprob = round(RF_por_prob[,1],4)) %>%
bind_cols(Talc_Lprob = round(RF_por_prob[,2],4)) %>%
bind_cols(Talc_Mprob = round(RF_por_prob[,3],4)) %>%
bind_cols(Talc_Hprob = round(RF_por_prob[,4],4)) %>%
bind_cols(Talc_VHprob = round(RF_por_prob[,5],4))Model Evaluation
# performance evaluation - confusion matrix
RF_table_por %>%
conf_mat(Talc, Talc_pred) %>%
autoplot(type = "heatmap")Confusion matrix
# # Binary Model
# confusionMatrix(data = RF_por_pred, reference = por_test$Talc, positive = "High")
# Multiclass Model
confusionMatrix(data = RF_por_pred, reference = por_test$Talc)#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction Very Low Low Medium High Very High
#> Very Low 27 37 11 4 2
#> Low 11 5 0 1 1
#> Medium 6 6 3 2 1
#> High 0 1 4 0 0
#> Very High 2 0 1 3 0
#>
#> Overall Statistics
#>
#> Accuracy : 0.2734
#> 95% CI : (0.1984, 0.3592)
#> No Information Rate : 0.3828
#> P-Value [Acc > NIR] : 0.9964613
#>
#> Kappa : -0.0479
#>
#> Mcnemar's Test P-Value : 0.0007883
#>
#> Statistics by Class:
#>
#> Class: Very Low Class: Low Class: Medium Class: High
#> Sensitivity 0.5870 0.10204 0.15789 0.00000
#> Specificity 0.3415 0.83544 0.86239 0.95763
#> Pos Pred Value 0.3333 0.27778 0.16667 0.00000
#> Neg Pred Value 0.5957 0.60000 0.85455 0.91870
#> Prevalence 0.3594 0.38281 0.14844 0.07812
#> Detection Rate 0.2109 0.03906 0.02344 0.00000
#> Detection Prevalence 0.6328 0.14062 0.14062 0.03906
#> Balanced Accuracy 0.4642 0.46874 0.51014 0.47881
#> Class: Very High
#> Sensitivity 0.00000
#> Specificity 0.95161
#> Pos Pred Value 0.00000
#> Neg Pred Value 0.96721
#> Prevalence 0.03125
#> Detection Rate 0.00000
#> Detection Prevalence 0.04688
#> Balanced Accuracy 0.47581
ROC (Receiver Operating Characteristic)
# # Binary Model
# options(scipen = 999)
# head(RF_por_prob)
# RF_roc <- data.frame(prediction = RF_por_prob[,2],
# actual = as.numeric(RF_table_por$Talc == "High"))RF_roc <- data.frame(prediction1 = RF_por_prob[,1],
actual1 = as.numeric(RF_table_por$Talc == " Very Low"),
prediction2 = RF_por_prob[,2],
actual2 = as.numeric(RF_table_por$Talc == "Low"),
prediction3 = RF_por_prob[,3],
actual3 = as.numeric(RF_table_por$Talc == "Medium"),
prediction4 = RF_por_prob[,4],
actual4 = as.numeric(RF_table_por$Talc == "High"),
prediction5 = RF_por_prob[,5],
actual5 = as.numeric(RF_table_por$Talc == "Very High"))# # Binary Model
# head(RF_roc)
# pred_roc <- prediction(RF_roc$prediction, RF_roc$actual)
# perf <- performance(prediction.obj = pred_roc, measure = "tpr",x.measure = "fpr")
# plot(perf)
# abline(a = 0, b = 1)# pred_roc1 <- prediction(RF_roc$prediction1, RF_roc$actual1)
pred_roc2 <- prediction(RF_roc$prediction2, RF_roc$actual2)
pred_roc3 <- prediction(RF_roc$prediction3, RF_roc$actual3)
pred_roc4 <- prediction(RF_roc$prediction4, RF_roc$actual4)
pred_roc5 <- prediction(RF_roc$prediction5, RF_roc$actual5)
# head(pred_roc1)
# perf1 <- performance(prediction.obj = pred_roc1, measure = "tpr",x.measure = "fpr")
perf2 <- performance(prediction.obj = pred_roc2, measure = "tpr",x.measure = "fpr")
perf3 <- performance(prediction.obj = pred_roc3, measure = "tpr",x.measure = "fpr")
perf4 <- performance(prediction.obj = pred_roc4, measure = "tpr",x.measure = "fpr")
perf5 <- performance(prediction.obj = pred_roc5, measure = "tpr",x.measure = "fpr")
# plot(perf1)
# abline(a = 0, b = 1)
plot(perf2)
abline(a = 0, b = 1)AUC (Area Under The Curve)
# # Binary Model
# AUC_RF_por_binary <- performance(pred_roc, measure = "auc")
# AUC_RF_por_binary@y.values[[1]]# Multiclass Model
y_pred <- as.ordered(NB_por_pred)
AUC_RF_por_multiclass <- multiclass.roc(por_test$Talc, y_pred)
print(AUC_RF_por_multiclass)#>
#> Call:
#> multiclass.roc.default(response = por_test$Talc, predictor = y_pred)
#>
#> Data: y_pred with 5 levels of por_test$Talc: Very Low, Low, Medium, High, Very High.
#> Multi-class area under the curve: 0.8271
In this study case, I did separate modeling for both math and por. So we can observe the result of our model separately or find the average value of our models performance.
Before proceeding to the cross validation process, I have done 2 options in determining the right formula for creating a Talc column. The first one by using averages of Dalc and Walc. And the other using proportions for Dalc (5/7) and Walc (2/7) based on the number of days on workdays and weekends. After I did a test or experiment on both formula, I prefer to use the second option by using the proportion of days. Because it is more suitable and more likely to transform the value of Talc.
In addition to the things already mentioned, I also conducted a number of combination experiments in the machine learning Decision Tree model. Experiments that I did by using several combination of mincriterion values and also the functions used to create the model are ctree() and rpart
The following is a combination of the mincriterion that I used on the DT math subject (multiclass target) model. We can analyze the ability of our model (Sensitivity) to predict in each target class.
"mcdefault" <- c(0.1034, 0.19231, 0.40000, 0.60000, 0)
"mc90%" <- c(0.3448, 0.19231, 0.20000, 0.60000, 0)
"mc95%" <- c(0.3448, 0.19231, 0.13333, 0.60000, 0)
"mc99%" <- c(0.3448, 0.19231, 0.20000, 0.60000, 0)
mincriterion <- rbind(mcdefault, `mc90%`, `mc95%`, `mc99%`)
mincriterion <- as.data.frame(mincriterion)
names(mincriterion) <- levels(sac$Talc)
mincriterionBased on experiments that have been done, we can conclude that the four models cannot predict well or capture values well from the Very High target level. And because of the exploratory data (Alcohol consumption level greatly affects student performance) that has been done, I want to prioritize Sensitivity for classes High and Very High so that we can use the model with default mincriterion.
The following is a combination of the mincriterion and model function that I used on the DT math subject (multiclass target) model. We can analyze our model evaluation metrics.
metrics <- c("Accuracy", "Sensitivity", "Specificity", "Precision", "AUC")
"rpart90%" <- c(0.2337662, 0.2172414, 0.8021667, 0.2083333, 0.9613333)
"rpart99%" <- c(0.2857143, 0.3068612, 0.8080253, 0.2387255, 0.9613333)
"ctree_default" <- c(0.2207792, 0.2591512, 0.7988192, 0.2026374, 0.9613333)
"ctree90%" <- c(0.2857143, 0.2743236, 0.8081593, 0.2336538, 0.9613333)
"ctree95%" <- c(0.2727273, 0.2674271, 0.8053815, 0.2258598, 0.9613333)
"ctree99%" <- c(0.2597403, 0.2540937, 0.8026037, 0.2121693, 0.9613333)
dtree <- as.data.frame(rbind(`rpart90%`, `rpart99%`, ctree_default, `ctree90%`, `ctree95%`, `ctree99%`))
names(dtree) <- metrics
dtreeBased on experiments that have been done, we can conclude that the function to make a better Decision Tree model is ctree() by using the mincriterion that we just mentioned above.
So the primary differences between conditional inference trees (ctree from party package in R) compared to the more traditional decision tree algorithms (such as rpart in R) :
Both rpart and ctree recursively perform univariate splits of the dependent variable based on values on a set of covariates. ctree comes with a number of possible transformations for both DV and covariates (see the help for Transformations in the party package).
So generally the main difference seems to be that ctree uses a covariate selection scheme that is based on statistical theory (i.e. selection by permutation-based significance tests) and thereby avoids a potential bias in rpart, otherwise they seem similar; e.g. conditional inference trees can be used as base learners for Random Forests.
Naive Bayes, Decision Tree and Random Forest), In this study case I also compared the performance of each model for Binary target variable and Multiclass target variable. You can observe the binary model that I made by uncommenting lines on the available chunk.a <- NB_table_math %>%
summarise(
Accuracy = accuracy_vec(Talc, Talc_pred),
Sensitivity = sens_vec(Talc, Talc_pred),
Specificity = spec_vec(Talc, Talc_pred),
Precision = precision_vec(Talc, Talc_pred)
)
# a$AUC <- AUC_NB_math_binary@y.values[[1]]
a$AUC <- AUC_NB_math_multiclass$auc[[1]]
a$Model <- "Naive Bayes Math"
a <- a[, c(6,1:5)]
b <- NB_table_por %>%
summarise(
Accuracy = accuracy_vec(Talc, Talc_pred),
Sensitivity = sens_vec(Talc, Talc_pred),
Specificity = spec_vec(Talc, Talc_pred),
Precision = precision_vec(Talc, Talc_pred)
)
# b$AUC <- AUC_NB_por_binary@y.values[[1]]
b$AUC <- AUC_NB_por_multiclass$auc[[1]]
b$Model <- "Naive Bayes Portugese"
b <- b[, c(6,1:5)]
g <- (a[,-1] + b[,-1])/2
g$Model <- "Naive Bayes Mean"
g <- g[, c(6,1:5)]
c <- DT_table_math %>%
summarise(
Accuracy = accuracy_vec(Talc, Talc_pred),
Sensitivity = sens_vec(Talc, Talc_pred),
Specificity = spec_vec(Talc, Talc_pred),
Precision = precision_vec(Talc, Talc_pred)
)
# c$AUC <- AUC_DT_math_binary@y.values[[1]]
c$AUC <- AUC_DT_math_multiclass$auc[[1]]
c$Model <- "Decision Tree Math"
c <- c[, c(6,1:5)]
d <- DT_table_por %>%
summarise(
Accuracy = accuracy_vec(Talc, Talc_pred),
Sensitivity = sens_vec(Talc, Talc_pred),
Specificity = spec_vec(Talc, Talc_pred),
Precision = precision_vec(Talc, Talc_pred)
)
# d$AUC <- AUC_DT_por_binary@y.values[[1]]
d$AUC <- AUC_DT_por_multiclass$auc[[1]]
d$Model <- "Decision Tree Portugese"
d <- d[, c(6,1:5)]
h <- (c[,-1] + d[,-1])/2
h$Model <- "Decision Tree Mean"
h <- h[, c(6,1:5)]
e <- RF_table_math %>%
summarise(
Accuracy = accuracy_vec(Talc, Talc_pred),
Sensitivity = sens_vec(Talc, Talc_pred),
Specificity = spec_vec(Talc, Talc_pred),
Precision = precision_vec(Talc, Talc_pred)
)
# e$AUC <- AUC_RF_math_binary@y.values[[1]]
e$AUC <- AUC_RF_math_multiclass$auc[[1]]
e$Model <- "Random Forest Math"
e <- e[, c(6,1:5)]
f <- RF_table_por %>%
summarise(
Accuracy = accuracy_vec(Talc, Talc_pred),
Sensitivity = sens_vec(Talc, Talc_pred),
Specificity = spec_vec(Talc, Talc_pred),
Precision = precision_vec(Talc, Talc_pred)
)
# f$AUC <- AUC_RF_por_binary@y.values[[1]]
f$AUC <- AUC_RF_por_multiclass$auc[[1]]
f$Model <- "Random Forest Portugese"
f <- f[, c(6,1:5)]
i <- (e[,-1] + f[,-1])/2
i$Model <- "Random Forest Mean"
i <- i[, c(6,1:5)]
Multiclass <- rbind(a, b, c, d, e, f, g, h, i)
Multiclass# Binary <- rbind(a, b, c, d, e, f, g, h, i)
# Binary
# saveRDS(Multiclass, "table/multiclass.RDS")
# saveRDS(Binary, "table/binary.RDS")The following are the results of the performance of each model to the binary class target
The following are the results of the performance of each model to the multiclass target
And for the final conclusion, we should use the
Naive Bayesmodel for this case. This could be due to the variables used to predict Student Alcohol Consumption levels that are independent of one another. And if we modify the target class frombinaryto bemulti-class, the performance of our tree based model will drastically decrease.