We will be using Programme for International Student Assessment (PISA) 2015 data set for turkey student. The data comes from a large-scale assessment in various knowledge domains of international students.
setwd("D:/Class Materials & Work/Summer 2020 practice/Educational Data Mining")
# Packages to be used
library("data.table")
library("tidyverse")
library("tidymodels")
library("e1071")
library("caret")
library("mlr")
library("modelr")
library("randomForest")
library("rpart")
library("rpart.plot")
library("GGally")
library("ggExtra")
After importing the data, it is recommended to check properties of the data for good measure. fread is more preferable when reading a big data set as it is faster and can control number of thread used. The function also allows us to read the missing data as NA with na.strings = "NA".
Another way we can deal with missing data is by checking for missing data with complete.cases() and remove it with na.omit().
# fread function from data.table (reading as f-read)
pisa <- fread("pisa_turkey.csv", na.strings = "NA")
# Check the data class
class(pisa)
## [1] "data.table" "data.frame"
# See variable names
names(pisa)
## [1] "CNTSTUID" "gender" "female" "grade" "computer"
## [6] "software" "internet" "desk" "own.room" "quiet.study"
## [11] "lit" "poetry" "art" "book.sch" "tech.book"
## [16] "dict" "art.book" "ST011Q05TA" "ST071Q02NA" "ST071Q01NA"
## [21] "ST123Q02NA" "ST082Q01NA" "ST119Q01NA" "ST119Q05NA" "ANXTEST"
## [26] "COOPERATE" "BELONG" "EMOSUPS" "WEALTH" "PARED"
## [31] "TMINS" "ESCS" "TEACHSUP" "TDTEACH" "IBTEACH"
## [36] "SCIEEFF" "math" "reading" "science"
# Preview the data
head(pisa)
## CNTSTUID gender female grade computer software internet desk own.room
## 1: 79200939 Female 1 Grade 9 0 0 0 1 1
## 2: 79206774 Female 1 Grade 9 1 1 1 1 1
## 3: 79204670 Female 1 Grade 9 0 0 1 1 0
## 4: 79201647 Female 1 Grade 9 1 1 0 1 0
## 5: 79203718 Female 1 Grade 9 0 0 0 0 0
## 6: 79204968 Female 1 Grade 9 0 0 0 1 0
## quiet.study lit poetry art book.sch tech.book dict art.book ST011Q05TA
## 1: 1 1 1 0 1 0 1 0 No
## 2: 0 1 0 1 1 0 1 0 Yes
## 3: 0 1 0 0 0 0 1 0 No
## 4: 0 1 1 0 0 0 1 0 Yes
## 5: 0 0 0 0 0 0 1 0 No
## 6: 0 0 1 0 0 0 1 0 No
## ST071Q02NA ST071Q01NA ST123Q02NA ST082Q01NA ST119Q01NA
## 1: 11 7 Strongly agree Agree Strongly agree
## 2: NA 5 Strongly agree Agree Strongly agree
## 3: 6 4 Agree Agree Strongly agree
## 4: 6 3 Agree Disagree Strongly agree
## 5: 6 6 Strongly agree Disagree Agree
## 6: 12 6 Disagree Strongly disagree Strongly agree
## ST119Q05NA ANXTEST COOPERATE BELONG EMOSUPS WEALTH PARED TMINS ESCS
## 1: Strongly agree 1.4394 1.3264 -0.8482 0.5658 -1.7154 12 1560 -1.7902
## 2: Agree 0.6654 1.1640 -0.7657 1.0991 -1.2426 8 900 -1.8259
## 3: Strongly agree 1.2704 0.5759 -0.5064 -1.3298 -1.2256 12 1600 -1.2038
## 4: Strongly agree 2.5493 0.9675 0.3142 -0.3306 -1.8473 16 2280 -0.9068
## 5: Strongly agree 1.1311 -0.2882 -0.6925 -0.2495 -4.7851 4 1600 -4.1131
## 6: Strongly agree 1.1311 -1.0629 -0.9932 -1.8280 -2.8012 14 1495 -1.0631
## TEACHSUP TDTEACH IBTEACH SCIEEFF math reading science
## 1: 0.7900 0.9505 0.8519 1.8526 361.9234 408.0202 394.2937
## 2: 1.4475 0.6580 -1.0496 -0.2154 325.4144 375.4197 350.5305
## 3: -0.1164 0.4505 0.5453 0.5561 365.5649 381.6692 396.2027
## 4: -0.4527 -0.0057 1.4812 0.5037 333.6344 345.9447 336.3870
## 5: 0.2725 0.0615 0.9054 0.1091 381.7032 381.6998 403.6358
## 6: -1.0080 -0.8780 0.0441 -1.4295 352.0996 350.1369 393.2887
# Check data dimensions
dim(pisa)
## [1] 5895 39
# Check data structure
str(pisa)
## Classes 'data.table' and 'data.frame': 5895 obs. of 39 variables:
## $ CNTSTUID : int 79200939 79206774 79204670 79201647 79203718 79204968 79200200 79200563 79200129 79205586 ...
## $ gender : chr "Female" "Female" "Female" "Female" ...
## $ female : int 1 1 1 1 1 1 1 1 1 1 ...
## $ grade : chr "Grade 9" "Grade 9" "Grade 9" "Grade 9" ...
## $ computer : int 0 1 0 1 0 0 0 1 0 1 ...
## $ software : int 0 1 0 1 0 0 1 1 0 NA ...
## $ internet : int 0 1 1 0 0 0 0 0 0 1 ...
## $ desk : int 1 1 1 1 0 1 0 0 0 1 ...
## $ own.room : int 1 1 0 0 0 0 0 0 0 1 ...
## $ quiet.study: int 1 0 0 0 0 0 1 0 1 1 ...
## $ lit : int 1 1 1 1 0 0 0 0 0 1 ...
## $ poetry : int 1 0 0 1 0 1 0 0 0 NA ...
## $ art : int 0 1 0 0 0 0 0 0 0 NA ...
## $ book.sch : int 1 1 0 0 0 0 0 1 0 1 ...
## $ tech.book : int 0 0 0 0 0 0 0 0 NA NA ...
## $ dict : int 1 1 1 1 1 1 1 1 1 1 ...
## $ art.book : int 0 0 0 0 0 0 0 0 0 1 ...
## $ ST011Q05TA : chr "No" "Yes" "No" "Yes" ...
## $ ST071Q02NA : int 11 NA 6 6 6 12 6 6 NA 3 ...
## $ ST071Q01NA : int 7 5 4 3 6 6 7 4 NA 10 ...
## $ ST123Q02NA : chr "Strongly agree" "Strongly agree" "Agree" "Agree" ...
## $ ST082Q01NA : chr "Agree" "Agree" "Agree" "Disagree" ...
## $ ST119Q01NA : chr "Strongly agree" "Strongly agree" "Strongly agree" "Strongly agree" ...
## $ ST119Q05NA : chr "Strongly agree" "Agree" "Strongly agree" "Strongly agree" ...
## $ ANXTEST : num 1.439 0.665 1.27 2.549 1.131 ...
## $ COOPERATE : num 1.326 1.164 0.576 0.968 -0.288 ...
## $ BELONG : num -0.848 -0.766 -0.506 0.314 -0.692 ...
## $ EMOSUPS : num 0.566 1.099 -1.33 -0.331 -0.249 ...
## $ WEALTH : num -1.72 -1.24 -1.23 -1.85 -4.79 ...
## $ PARED : int 12 8 12 16 4 14 8 4 4 8 ...
## $ TMINS : int 1560 900 1600 2280 1600 1495 1600 1600 NA 1600 ...
## $ ESCS : num -1.79 -1.826 -1.204 -0.907 -4.113 ...
## $ TEACHSUP : num 0.79 1.448 -0.116 -0.453 0.273 ...
## $ TDTEACH : num 0.9505 0.658 0.4505 -0.0057 0.0615 ...
## $ IBTEACH : num 0.852 -1.05 0.545 1.481 0.905 ...
## $ SCIEEFF : num 1.853 -0.215 0.556 0.504 0.109 ...
## $ math : num 362 325 366 334 382 ...
## $ reading : num 408 375 382 346 382 ...
## $ science : num 394 351 396 336 404 ...
## - attr(*, ".internal.selfref")=<externalptr>
# Counting unique, missing, median, and mean values
pisa %>% summarise(n = n_distinct(TMINS),
na = sum(is.na(TMINS)),
med = median(TMINS, na.rm = TRUE),
mean = mean(TMINS, na.rm = TRUE))
## n na med mean
## 1 192 382 1600 1558.749
For missing value, we can either remove them as mentioned above, replacing them with mean or median (Imputation), or perform listwise or pairwise calculation. See below for the code on median imputation on TMINS variable, but we will not run it in this practice.
# pisa <- pisa %>%
# mutate(TMINS
# = replace(TMINS,
# is.na(TMINS),
# mean(TMINS, na.rm = TRUE)))
For other methods, listwise deletion removes a case if it has one or more missing values in the variable of interest, while pairwise deletion, which is the default protocol of most analysis, omits a variable instead of an entire case if it has missing value. Other analysis can be done on the same case with non-missing variables.
### Data Wrangling and Visualization
Next, we will define the existing variable and visualize them.
variable.names(pisa)
## [1] "CNTSTUID" "gender" "female" "grade" "computer"
## [6] "software" "internet" "desk" "own.room" "quiet.study"
## [11] "lit" "poetry" "art" "book.sch" "tech.book"
## [16] "dict" "art.book" "ST011Q05TA" "ST071Q02NA" "ST071Q01NA"
## [21] "ST123Q02NA" "ST082Q01NA" "ST119Q01NA" "ST119Q05NA" "ANXTEST"
## [26] "COOPERATE" "BELONG" "EMOSUPS" "WEALTH" "PARED"
## [31] "TMINS" "ESCS" "TEACHSUP" "TDTEACH" "IBTEACH"
## [36] "SCIEEFF" "math" "reading" "science"
# Define new variables
pisa <- mutate(pisa,
# Reorder the levels of grade
grade = factor(grade,
levels = c("Grade 7", "Grade 8", "Grade 9", "Grade 10",
"Grade 11", "Grade 12", "Grade 13",
"Ungraded")),
# Define a numerical grade variable
grade1 = (as.numeric(sapply(grade, function(x) {
if(x=="Grade 7") "7"
else if (x=="Grade 8") "8"
else if (x=="Grade 9") "9"
else if (x=="Grade 10") "10"
else if (x=="Grade 11") "11"
else if (x=="Grade 12") "12"
else if (x=="Grade 13") NA_character_
else if (x=="Ungraded") NA_character_}))),
# Total learning time as hours
learning = round(TMINS/60, 0),
# Science performance based on OECD average (predetermined value)
science_oecd = as.factor(ifelse(science >= 493, "High", "Low")),
# Science performance based on Turkey's average
science_tr = as.factor(ifelse(science >= 422, "High", "Low")))
We can summarize the target variable (science) by categorical independent variables and see if there is any relationship.
# By grade - summary
pisa %>%
group_by(grade) %>%
summarise(Count = n(),
science = mean(science, na.rm = TRUE),
computer = mean(computer, na.rm = TRUE),
software = mean(software, na.rm = TRUE),
internet = mean(internet, na.rm = TRUE),
own.room = mean(own.room, na.rm = TRUE)
)
## # A tibble: 6 x 7
## grade Count science computer software internet own.room
## <fct> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Grade 7 16 328. 0.0625 0.312 0.0625 0.438
## 2 Grade 8 105 338. 0.303 0.385 0.26 0.324
## 3 Grade 9 1273 386. 0.609 0.427 0.571 0.657
## 4 Grade 10 4308 435. 0.710 0.422 0.666 0.740
## 5 Grade 11 186 418. 0.514 0.324 0.455 0.626
## 6 Grade 12 7 404. 0.571 0.429 0.286 0.714
# By grade and gender - summary
pisa %>%
group_by(grade, gender) %>%
summarise(Count = n(),
science = mean(science, na.rm = TRUE),
computer = mean(computer, na.rm = TRUE),
software = mean(software, na.rm = TRUE),
internet = mean(internet, na.rm = TRUE),
own.room = mean(own.room, na.rm = TRUE)
)
## # A tibble: 12 x 8
## # Groups: grade [6]
## grade gender Count science computer software internet own.room
## <fct> <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Grade 7 Female 6 339. 0 0.333 0 0.333
## 2 Grade 7 Male 10 322. 0.1 0.3 0.1 0.5
## 3 Grade 8 Female 42 338. 0.275 0.436 0.220 0.268
## 4 Grade 8 Male 63 338. 0.322 0.351 0.288 0.361
## 5 Grade 9 Female 494 388. 0.602 0.435 0.551 0.651
## 6 Grade 9 Male 779 385. 0.613 0.422 0.584 0.660
## 7 Grade 10 Female 2272 434. 0.710 0.459 0.671 0.750
## 8 Grade 10 Male 2036 437. 0.711 0.382 0.660 0.729
## 9 Grade 11 Female 120 421. 0.504 0.342 0.439 0.633
## 10 Grade 11 Male 66 412. 0.532 0.288 0.484 0.613
## 11 Grade 12 Female 4 408. 0.75 0.5 0.5 1
## 12 Grade 12 Male 3 397. 0.333 0.333 0 0.333
Now we can visualize some of the variables in the data. Let’s view science performance by grade.
ggplot(data = pisa,
mapping = aes(x = grade, y = science)) +
geom_boxplot() +
labs(x=NULL, y="Science Scores") +
theme_bw()
Let’s add horizontal line representing the OECD (red) and Turkey (blue) averages with geom_hline().
ggplot(data = pisa,
mapping = aes(x = grade, y = science)) +
geom_boxplot() +
labs(x=NULL, y="Science Scores") +
geom_hline(yintercept = 493, linetype="dashed", color = "red", size = 1) +
geom_hline(yintercept = 422, linetype="dashed", color = "blue", size = 1) +
theme_bw()
Add gender as a color layer using fill = gender with red line for OCED average.
ggplot(data = pisa,
mapping = aes(x = grade, y = science, fill = gender)) +
geom_boxplot() +
labs(x=NULL, y="Science Scores") +
geom_hline(yintercept = 493, linetype="dashed", color = "red", size = 1) +
theme_bw()
We can also see the impact of gender on science, reading, and math scores.
ggpairs(data = pisa,
mapping = aes(color = gender),
columns = c("reading", "science", "math"),
upper = list(continuous = wrap("cor", size = 4.5)))
Examining conditional relationships between science score and learning time.
p1 <- ggplot(data = pisa,
mapping = aes(x = learning, y = science)) +
geom_point() +
geom_smooth(method = "loess") +
labs(x = "Weekly Learning Time", y = "Science Scores") +
theme_bw()
# Replace "histogram" with "boxplot" or "density" for other types
ggMarginal(p1, type = "histogram")
Most importantly, correlations among the variables should be checked:
ggcorr(data = pisa[,c("science", "reading", "computer", "own.room", "quiet.study",
"ESCS", "SCIEEFF", "BELONG", "ANXTEST", "COOPERATE", "learning",
"EMOSUPS", "grade1")],
method = c("pairwise.complete.obs", "pearson"),
label = TRUE, label_size = 4,
low = "red3", high = "green3",
name = "Correlation Scale",
)
We will split our data set into both training (70%) and testing (30%) set, as well as use listwise deletion to remove cases with missing data before splitting.
pisa_nm <- pisa %>%
#Remove missing cases
na.omit() %>%
#Coding qualitative variables into factors
mutate_if(is.character, as.factor)
Calculating the proportion of Science performance based on Turkey’s average
pisa_nm %>%
count(science_tr) %>%
mutate(prop = n/sum(n))
## # A tibble: 2 x 3
## science_tr n prop
## <fct> <int> <dbl>
## 1 High 2254 0.533
## 2 Low 1976 0.467
The proportion discrepancy between low and high group is not extreme; Therefore, there is no need to use strata argument.
set.seed(442019)
# Put 70% of the data into the training set
pisa_split <- initial_split(pisa_nm, prop = 0.7)
# Create data frames for the two sets:
train <- training(pisa_split) #70%
test <- testing(pisa_split) #30%
We will be using both rpart and tidymodel approaches in building a decision tree for comparison. In rpart function, there are several elements:
science_perf) and predictors (IV) with ~ as the separator.method = class is for classification tree, while method = class is for regression tree.minsplit defines the minimum number of observations in a node (default = 20); cp is complexity parameter to prune the tree (default = 0.01, with 0 = no prunning); xval is the number of cross validation (default = 10, 0 = no cross validation).anova tree as none; For class tree, it can be either gini for the Gini index or information for the Entropy index. The latter case is slightly slower than the former in computational speed.Firstly, we will try building a decision tree df_fit1 with cp = 0, xval = 0, and gini splitting.
dt_fit1 <- rpart(formula = science_tr ~ grade1 + computer + own.room + ESCS +
EMOSUPS + COOPERATE,
data = train,
method = "class",
control = rpart.control(minsplit = 20,
cp = 0,
xval = 0),
parms = list(split = "gini"))
rpart.plot(dt_fit1)
## Warning: labs do not fit even at cex 0.15, there may be some overplotting
As we can see, the plot is very complex with no prunning; That is, we overfitted the model with excessive nodes. We will use cp = 0.006 to prun the model, generating a smaller tree with fewer predictors as a result.
dt_fit2 <- rpart(formula = science_tr ~ grade1 + computer + own.room + ESCS +
EMOSUPS + COOPERATE,
data = train,
method = "class",
control = rpart.control(minsplit = 20,
cp = 0.006,
xval = 0),
parms = list(split = "gini"))
rpart.plot(dt_fit2)
The model becomes simpler, with each node shows the predicted class, the predicted probability of the second class, and the percentage of observations in the node.
We can make the model to be more dictinct by adding colors, adjusting the shown values, and node shadow.
rpart.plot(dt_fit2, extra = 8, box.palette = "RdBu", shadow.col = "gray")
We can print the output of our model with printcp():
printcp(dt_fit2)
##
## Classification tree:
## rpart(formula = science_tr ~ grade1 + computer + own.room + ESCS +
## EMOSUPS + COOPERATE, data = train, method = "class", parms = list(split = "gini"),
## control = rpart.control(minsplit = 20, cp = 0.006, xval = 0))
##
## Variables actually used in tree construction:
## [1] computer EMOSUPS ESCS grade1
##
## Root node error: 1415/2961 = 0.47788
##
## n= 2961
##
## CP nsplit rel error
## 1 0.2091873 0 1.00000
## 2 0.0572438 1 0.79081
## 3 0.0070671 2 0.73357
## 4 0.0067138 4 0.71943
## 5 0.0060000 6 0.70601
#For more detail
we can use summary(dt_fit2) for more detailed result. caret::varImp() also tells us which variables are more influential in the analysis. The higher the value, the more influential that variable is to the analysis:
varImp(dt_fit2)
## Overall
## computer 134.36152
## COOPERATE 77.78897
## EMOSUPS 77.68769
## ESCS 168.50372
## grade1 108.21253
## own.room 14.22999
Next, we need to apply our tree to the test data set. Below we estimate the predicted classes (either high or low) from the test data by applying the estimated model; Then, we obtain model predictions using predict() and then turn the results into a data frame called dt_pred.
dt_pred <- predict(dt_fit2, test) %>%
as.data.frame()
head(dt_pred)
## High Low
## 1 0.1940299 0.8059701
## 7 0.1940299 0.8059701
## 8 0.4394619 0.5605381
## 12 0.4394619 0.5605381
## 15 0.4394619 0.5605381
## 22 0.4394619 0.5605381
The result above shows the probability of students falling into either groups. We will turn these probabilities into binary classifications (depending on whether they are >= 50%) and compare them with results from the test data to create a confusion matrix.
dt_pred <- mutate(dt_pred, science_tr = as.factor(ifelse(High >= 0.5, "High", "Low"))) %>%
select(science_tr)
confusionMatrix(dt_pred$science_tr, test$science_tr)
## Confusion Matrix and Statistics
##
## Reference
## Prediction High Low
## High 506 239
## Low 202 322
##
## Accuracy : 0.6525
## 95% CI : (0.6256, 0.6787)
## No Information Rate : 0.5579
## P-Value [Acc > NIR] : 4.388e-12
##
## Kappa : 0.2907
##
## Mcnemar's Test P-Value : 0.08648
##
## Sensitivity : 0.7147
## Specificity : 0.5740
## Pos Pred Value : 0.6792
## Neg Pred Value : 0.6145
## Prevalence : 0.5579
## Detection Rate : 0.3987
## Detection Prevalence : 0.5871
## Balanced Accuracy : 0.6443
##
## 'Positive' Class : High
##
The output shows that the overall accuracy is 65%, sensitivity 71%, specificity 57%. Adding more correlated variables may increase the accuracy and sensitivity of the model.
In our previous model, we did not run any cross-validation sample (xval = 0), but K-fold cross validations are useful when we do not have a test or validation data set, or our dataset is to small to split into training and test data.
A typical way to use crossvalidation in decision trees is to not specify a cp. We will assume that our dataset is not too big and thus we want to run a 10-fold cross validation.
dt_fit3 <- rpart(formula = science_tr ~ grade1 + computer + own.room + ESCS +
EMOSUPS + COOPERATE,
data = train,
method = "class",
control = rpart.control(minsplit = 20,
cp = 0,
xval = 10),
parms = list(split = "gini"))
We got a large tree again. In the results, we can evaluate the cross-validated error (i.e., X-val Relative Error) and choose the complexity parameter that would give us an acceptable value. . Then, we can use this cp value and prune the trees. We can use plotcp() to visualize the cross-validation results.
printcp(dt_fit3)
##
## Classification tree:
## rpart(formula = science_tr ~ grade1 + computer + own.room + ESCS +
## EMOSUPS + COOPERATE, data = train, method = "class", parms = list(split = "gini"),
## control = rpart.control(minsplit = 20, cp = 0, xval = 10))
##
## Variables actually used in tree construction:
## [1] computer COOPERATE EMOSUPS ESCS grade1 own.room
##
## Root node error: 1415/2961 = 0.47788
##
## n= 2961
##
## CP nsplit rel error xerror xstd
## 1 0.20918728 0 1.00000 1.00000 0.019209
## 2 0.05724382 1 0.79081 0.79081 0.018646
## 3 0.00706714 2 0.73357 0.73357 0.018349
## 4 0.00671378 4 0.71943 0.73640 0.018365
## 5 0.00530035 6 0.70601 0.73428 0.018353
## 6 0.00459364 8 0.69541 0.73710 0.018369
## 7 0.00424028 18 0.64523 0.73569 0.018361
## 8 0.00353357 19 0.64099 0.73993 0.018385
## 9 0.00282686 25 0.61979 0.73922 0.018381
## 10 0.00247350 31 0.60283 0.71943 0.018266
## 11 0.00212014 33 0.59788 0.73004 0.018329
## 12 0.00176678 37 0.58940 0.73074 0.018333
## 13 0.00164900 39 0.58587 0.73852 0.018377
## 14 0.00159011 42 0.58092 0.73852 0.018377
## 15 0.00141343 47 0.57244 0.73922 0.018381
## 16 0.00113074 60 0.55406 0.74558 0.018417
## 17 0.00106007 65 0.54841 0.75548 0.018470
## 18 0.00098940 78 0.53145 0.76254 0.018507
## 19 0.00070671 85 0.52367 0.77173 0.018554
## 20 0.00053004 97 0.51449 0.78445 0.018616
## 21 0.00047114 106 0.50671 0.79293 0.018656
## 22 0.00035336 110 0.50459 0.80141 0.018694
## 23 0.00030288 120 0.50106 0.81343 0.018746
## 24 0.00023557 127 0.49894 0.81343 0.018746
## 25 0.00017668 130 0.49823 0.81908 0.018769
## 26 0.00000000 142 0.49611 0.81837 0.018766
plotcp(dt_fit3)
Based on the results above, we can specify an ideal CP value (0.0034 at the lowest error value) and re-run the model without cross validation.
dt_fit4 <- rpart(formula = science_tr ~ grade1 + computer + own.room + ESCS +
EMOSUPS + COOPERATE,
data = train,
method = "class",
control = rpart.control(minsplit = 20,
cp = 0.0034,
xval = 0),
parms = list(split = "gini"))
For the final fit, we will obtain the model prediction as a data frame from the test set and compare the predicted result with the actual result with confusion matrix.
dt_pred_final <- predict(dt_fit4, test) %>%
as.data.frame()
head(dt_pred_final)
## High Low
## 1 0.1940299 0.8059701
## 7 0.1940299 0.8059701
## 8 0.2478632 0.7521368
## 12 0.2478632 0.7521368
## 15 0.2478632 0.7521368
## 22 0.3989637 0.6010363
dt_pred_final <- mutate(dt_pred_final, science_tr = as.factor(ifelse(High >= 0.5, "High", "Low"))) %>%
select(science_tr)
confusionMatrix(dt_pred_final$science_tr, test$science_tr)
## Confusion Matrix and Statistics
##
## Reference
## Prediction High Low
## High 502 224
## Low 206 337
##
## Accuracy : 0.6612
## 95% CI : (0.6344, 0.6872)
## No Information Rate : 0.5579
## P-Value [Acc > NIR] : 4.21e-14
##
## Kappa : 0.3108
##
## Mcnemar's Test P-Value : 0.4123
##
## Sensitivity : 0.7090
## Specificity : 0.6007
## Pos Pred Value : 0.6915
## Neg Pred Value : 0.6206
## Prevalence : 0.5579
## Detection Rate : 0.3956
## Detection Prevalence : 0.5721
## Balanced Accuracy : 0.6549
##
## 'Positive' Class : High
##
The final model yields the overall accuracy of 66%, sensitivity 71%, and specificity 60%, which are considerably better than the dt_fit2 model by adjusting the complexity value.
tidymodelWe will also try creating the decision tree model with tidy framework for future reference. First, we will indicate hyperparameters to tune, specifically cost complexity and min split.
tune_dt <-
decision_tree(
cost_complexity = tune(),
min_n = tune()
) %>%
set_engine("rpart") %>%
set_mode("classification")
tune_dt
## Decision Tree Model Specification (classification)
##
## Main Arguments:
## cost_complexity = tune()
## min_n = tune()
##
## Computational engine: rpart
To find out the optimal value for our hyperparameter, we can train multiple models using resampled data. To get there, we have to create a grid of values for both hyperparameters through dial package.
tree_grid <- grid_regular(cost_complexity(), #identifying the hyperparameters
min_n(),
levels = 5) #five possible values for each parameter
tree_grid
## # A tibble: 25 x 2
## cost_complexity min_n
## <dbl> <int>
## 1 0.0000000001 2
## 2 0.0000000178 2
## 3 0.00000316 2
## 4 0.000562 2
## 5 0.1 2
## 6 0.0000000001 11
## 7 0.0000000178 11
## 8 0.00000316 11
## 9 0.000562 11
## 10 0.1 11
## # ... with 15 more rows
#possible values for min_n
tree_grid %>%
count(min_n)
## # A tibble: 5 x 2
## min_n n
## <int> <int>
## 1 2 5
## 2 11 5
## 3 21 5
## 4 30 5
## 5 40 5
#setting a v-fold cross validation
set.seed(234)
pisa_folds <- vfold_cv(train) #default = 10
Next, we will fit the model at all the different values we chose for each tuned hyperparameter. We will tune a workflow() along with model specification.
#workflow
set.seed(345)
tree_wf <- workflow() %>%
add_model(tune_dt) %>%
add_formula(science_tr ~ grade1 + computer + own.room + ESCS +
EMOSUPS + COOPERATE)
tree_wf
## == Workflow ===========================================================================================================
## Preprocessor: Formula
## Model: decision_tree()
##
## -- Preprocessor -------------------------------------------------------------------------------------------------------
## science_tr ~ grade1 + computer + own.room + ESCS + EMOSUPS +
## COOPERATE
##
## -- Model --------------------------------------------------------------------------------------------------------------
## Decision Tree Model Specification (classification)
##
## Main Arguments:
## cost_complexity = tune()
## min_n = tune()
##
## Computational engine: rpart
#resampling with the workflow
tree_res <-
tree_wf %>%
tune_grid(
resamples = pisa_folds, #the fold
grid = tree_grid #the hyperparameter grid
)
tree_res
## # 10-fold cross-validation
## # A tibble: 10 x 4
## splits id .metrics .notes
## <list> <chr> <list> <list>
## 1 <split [2.7K/297]> Fold01 <tibble [50 x 5]> <tibble [0 x 1]>
## 2 <split [2.7K/296]> Fold02 <tibble [50 x 5]> <tibble [0 x 1]>
## 3 <split [2.7K/296]> Fold03 <tibble [50 x 5]> <tibble [0 x 1]>
## 4 <split [2.7K/296]> Fold04 <tibble [50 x 5]> <tibble [0 x 1]>
## 5 <split [2.7K/296]> Fold05 <tibble [50 x 5]> <tibble [0 x 1]>
## 6 <split [2.7K/296]> Fold06 <tibble [50 x 5]> <tibble [0 x 1]>
## 7 <split [2.7K/296]> Fold07 <tibble [50 x 5]> <tibble [0 x 1]>
## 8 <split [2.7K/296]> Fold08 <tibble [50 x 5]> <tibble [0 x 1]>
## 9 <split [2.7K/296]> Fold09 <tibble [50 x 5]> <tibble [0 x 1]>
## 10 <split [2.7K/296]> Fold10 <tibble [50 x 5]> <tibble [0 x 1]>
Now we have our tuning results. Let’s collect and visualize them with collect_metrics().
tree_res %>%
collect_metrics()
## # A tibble: 50 x 7
## cost_complexity min_n .metric .estimator mean n std_err
## <dbl> <int> <chr> <chr> <dbl> <int> <dbl>
## 1 0.0000000001 2 accuracy binary 0.566 10 0.0125
## 2 0.0000000001 2 roc_auc binary 0.565 10 0.0126
## 3 0.0000000001 11 accuracy binary 0.591 10 0.00882
## 4 0.0000000001 11 roc_auc binary 0.633 10 0.0127
## 5 0.0000000001 21 accuracy binary 0.602 10 0.0115
## 6 0.0000000001 21 roc_auc binary 0.650 10 0.0112
## 7 0.0000000001 30 accuracy binary 0.614 10 0.0107
## 8 0.0000000001 30 roc_auc binary 0.665 10 0.0128
## 9 0.0000000001 40 accuracy binary 0.620 10 0.00954
## 10 0.0000000001 40 roc_auc binary 0.674 10 0.0118
## # ... with 40 more rows
#plot
tree_res %>%
collect_metrics() %>%
mutate(min_n = factor(min_n)) %>%
ggplot(aes(cost_complexity, mean, color = min_n)) +
geom_line(size = 1.5, alpha = 0.6) +
geom_point(size = 2) +
facet_wrap(~ .metric, scales = "free", nrow = 2) +
#Transform the x-avis into log10 format
scale_x_log10(labels = scales::label_number()) +
scale_color_viridis_d(option = "plasma", begin = .9, end = 0)
From the plot above, the min_n of 2 performs worst in terms of accuracy and AUC across all values of cost_complexity. The more the number of min_n, the better the model performs, with the best candidate of 40. Let us see the best candidates.
tree_res %>%
show_best("roc_auc")
## # A tibble: 5 x 7
## cost_complexity min_n .metric .estimator mean n std_err
## <dbl> <int> <chr> <chr> <dbl> <int> <dbl>
## 1 0.000562 40 roc_auc binary 0.676 10 0.0110
## 2 0.0000000001 40 roc_auc binary 0.674 10 0.0118
## 3 0.0000000178 40 roc_auc binary 0.674 10 0.0118
## 4 0.00000316 40 roc_auc binary 0.674 10 0.0118
## 5 0.000562 30 roc_auc binary 0.671 10 0.0114
tree_res %>%
show_best("accuracy")
## # A tibble: 5 x 7
## cost_complexity min_n .metric .estimator mean n std_err
## <dbl> <int> <chr> <chr> <dbl> <int> <dbl>
## 1 0.000562 40 accuracy binary 0.627 10 0.00926
## 2 0.000562 30 accuracy binary 0.622 10 0.0105
## 3 0.1 2 accuracy binary 0.622 10 0.00882
## 4 0.1 11 accuracy binary 0.622 10 0.00882
## 5 0.1 21 accuracy binary 0.622 10 0.00882
We will use select_best() to pull out the best set of hyperparameter combination.
best_tree <- tree_res %>%
select_best(metric = "roc_auc")
best_tree
## # A tibble: 1 x 2
## cost_complexity min_n
## <dbl> <int>
## 1 0.000562 40
The values that would maximize AUC in this pisa data set are 0.000562 for cost complexity and 40 minimum split. We can now finalize our workflow tree_wf with the values from select_best().
final_wf <-
tree_wf %>%
finalize_workflow(best_tree)
final_wf
## == Workflow ===========================================================================================================
## Preprocessor: Formula
## Model: decision_tree()
##
## -- Preprocessor -------------------------------------------------------------------------------------------------------
## science_tr ~ grade1 + computer + own.room + ESCS + EMOSUPS +
## COOPERATE
##
## -- Model --------------------------------------------------------------------------------------------------------------
## Decision Tree Model Specification (classification)
##
## Main Arguments:
## cost_complexity = 0.000562341325190349
## min_n = 40
##
## Computational engine: rpart
Our tuning is done!
Now, to fit the finalized workflow into the training data.
final_tree <-
final_wf %>%
fit(data = train)
final_tree
## == Workflow [trained] =================================================================================================
## Preprocessor: Formula
## Model: decision_tree()
##
## -- Preprocessor -------------------------------------------------------------------------------------------------------
## science_tr ~ grade1 + computer + own.room + ESCS + EMOSUPS +
## COOPERATE
##
## -- Model --------------------------------------------------------------------------------------------------------------
## n= 2961
##
## node), split, n, loss, yval, (yprob)
## * denotes terminal node
##
## 1) root 2961 1415 High (0.52212091 0.47787909)
## 2) grade1>=9.5 2309 941 High (0.59246427 0.40753573)
## 4) computer>=0.5 1640 566 High (0.65487805 0.34512195)
## 8) ESCS>=-0.5287 536 123 High (0.77052239 0.22947761)
## 16) COOPERATE>=-0.0322 265 43 High (0.83773585 0.16226415) *
## 17) COOPERATE< -0.0322 271 80 High (0.70479705 0.29520295)
## 34) ESCS>=0.16805 135 29 High (0.78518519 0.21481481) *
## 35) ESCS< 0.16805 136 51 High (0.62500000 0.37500000)
## 70) ESCS< -0.407 25 4 High (0.84000000 0.16000000) *
## 71) ESCS>=-0.407 111 47 High (0.57657658 0.42342342)
## 142) ESCS>=-0.17595 69 24 High (0.65217391 0.34782609) *
## 143) ESCS< -0.17595 42 19 Low (0.45238095 0.54761905)
## 286) COOPERATE>=-0.5287 20 8 High (0.60000000 0.40000000) *
## 287) COOPERATE< -0.5287 22 7 Low (0.31818182 0.68181818) *
## 9) ESCS< -0.5287 1104 443 High (0.59873188 0.40126812)
## 18) EMOSUPS>=-2.3605 1073 418 High (0.61043802 0.38956198)
## 36) COOPERATE>=-0.17565 512 167 High (0.67382812 0.32617188)
## 72) COOPERATE< 0.86235 261 72 High (0.72413793 0.27586207) *
## 73) COOPERATE>=0.86235 251 95 High (0.62151394 0.37848606)
## 146) ESCS>=-2.5894 223 80 High (0.64125561 0.35874439)
## 292) ESCS< -2.37835 21 4 High (0.80952381 0.19047619) *
## 293) ESCS>=-2.37835 202 76 High (0.62376238 0.37623762)
## 586) ESCS>=-0.70875 19 4 High (0.78947368 0.21052632) *
## 587) ESCS< -0.70875 183 72 High (0.60655738 0.39344262)
## 1174) ESCS< -1.0688 141 51 High (0.63829787 0.36170213) *
## 1175) ESCS>=-1.0688 42 21 High (0.50000000 0.50000000)
## 2350) ESCS>=-0.80835 16 6 High (0.62500000 0.37500000) *
## 2351) ESCS< -0.80835 26 11 Low (0.42307692 0.57692308) *
## 147) ESCS< -2.5894 28 13 Low (0.46428571 0.53571429) *
## 37) COOPERATE< -0.17565 561 251 High (0.55258467 0.44741533)
## 74) ESCS>=-2.73035 528 228 High (0.56818182 0.43181818)
## 148) EMOSUPS>=0.425 145 50 High (0.65517241 0.34482759)
## 296) ESCS>=-0.9385 36 8 High (0.77777778 0.22222222) *
## 297) ESCS< -0.9385 109 42 High (0.61467890 0.38532110)
## 594) ESCS< -1.185 93 32 High (0.65591398 0.34408602) *
## 595) ESCS>=-1.185 16 6 Low (0.37500000 0.62500000) *
## 149) EMOSUPS< 0.425 383 178 High (0.53524804 0.46475196)
## 298) COOPERATE< -0.45625 222 91 High (0.59009009 0.40990991)
## 596) COOPERATE>=-1.0846 146 49 High (0.66438356 0.33561644)
## 1192) ESCS< -0.75945 132 39 High (0.70454545 0.29545455)
## 2384) COOPERATE>=-0.63175 18 2 High (0.88888889 0.11111111) *
## 2385) COOPERATE< -0.63175 114 37 High (0.67543860 0.32456140)
## 4770) ESCS< -1.98885 34 7 High (0.79411765 0.20588235) *
## 4771) ESCS>=-1.98885 80 30 High (0.62500000 0.37500000)
## 9542) ESCS>=-1.613 47 12 High (0.74468085 0.25531915) *
##
## ...
## and 70 more lines.
Now that we have the final tree fused with the model workflow, we can investigate importance of variables in the model with pull_workflow_fit() and vip()
final_tree %>%
pull_workflow_fit() %>%
vip::vip()
Now we are going to test our finalized model with both training and testing data with last_fit.
final_fit <-
final_wf %>%
last_fit(pisa_split)
final_fit %>%
collect_metrics()
## # A tibble: 2 x 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy binary 0.629
## 2 roc_auc binary 0.671
#ROC curves
final_fit %>%
collect_predictions() %>%
roc_curve(science_tr, .pred_High) %>%
autoplot()
Now we can compare both rpart and tidymodel approaches for our future reference. The final model in our non-tidy method has the overall accuracy of 66%, while the tidy method has 62% and 67% for accuracy and ROC respectively.
For Random Forest (RF), we use packages randomForest and caret to apply the method to classification and regression problems. The use is similar to rpart, with essential elements as follows:
rpart, this argument defines the DV and predictors (IV) with ~ as the separator.importance = T, then importance of the predictors is assessed in the model.In this practice, we will work on the same problem with the decision tree model. The initial ntree is 1000, but we will evaluate and adjust it as we proceed.
rf_fit1 <- randomForest(formula = science_tr ~ grade1 + computer + own.room + ESCS + EMOSUPS + COOPERATE,
data = train,
importance = TRUE,
ntree = 1000)
print(rf_fit1)
##
## Call:
## randomForest(formula = science_tr ~ grade1 + computer + own.room + ESCS + EMOSUPS + COOPERATE, data = train, importance = TRUE, ntree = 1000)
## Type of random forest: classification
## Number of trees: 1000
## No. of variables tried at each split: 2
##
## OOB estimate of error rate: 34.28%
## Confusion matrix:
## High Low class.error
## High 1167 379 0.2451488
## Low 636 779 0.4494700
In the output, we see the confusion matrix along with classification error and out-of-bag (OOB) error.
OBB is a method of measuring the prediction error of random forests by finding the mean prediction error on each training sample, using only the trees that did not have in their bootstrap sample.
The result above shows the OBB of 34%, and classification error of 24% for the high group, and 44% for the low group.
Next, by checking the level error across the number of trees, we can determine the ideal number of trees for our model.
plot(rf_fit1)
The plot shows that the error level does not go down any further after roughly 50 trees, so we e can run our model again by using ntree = 50 this time for optimality.
rf_fit2 <- randomForest(formula = science_tr ~ grade1 + computer + own.room + ESCS + EMOSUPS + COOPERATE,
data = train,
importance = TRUE,
ntree = 50)
print(rf_fit2)
##
## Call:
## randomForest(formula = science_tr ~ grade1 + computer + own.room + ESCS + EMOSUPS + COOPERATE, data = train, importance = TRUE, ntree = 50)
## Type of random forest: classification
## Number of trees: 50
## No. of variables tried at each split: 2
##
## OOB estimate of error rate: 35.7%
## Confusion matrix:
## High Low class.error
## High 1141 405 0.2619664
## Low 652 763 0.4607774
We can see the overall accuracy of model as follows:
sum(diag(rf_fit2$confusion)) / nrow(train)
## [1] 0.643026
As we did for the decision tree, we can check the importance of the predictors in the mode with importance() and varImpPlot(). With importance(), we will first import the importance measures(1), turn it into a data.frame(2), save the row names as predictor names(3), and finally sort the data by MeanDecreaseGini(4) (or, you can also see the basic output using only importance(rf_fit2))
importance(rf_fit2) %>% #(1)
as.data.frame() %>% #(2)
mutate(Predictors = row.names(.)) %>% #(3)
arrange(desc(MeanDecreaseGini)) #(4)
## High Low MeanDecreaseAccuracy MeanDecreaseGini Predictors
## 1 6.6211406 10.13625762 12.6670933 321.93721 ESCS
## 2 1.6314780 1.92065466 2.5665387 188.97823 COOPERATE
## 3 2.1401708 0.91865485 2.3243344 171.39810 EMOSUPS
## 4 12.5223945 16.50006386 16.1166758 95.93737 grade1
## 5 5.4397193 5.51937606 8.0501893 48.04911 computer
## 6 0.8734394 0.05139286 0.7653936 26.80365 own.room
varImpPlot(rf_fit2,
main = "Importance of Variables for Science Performance")
The output shows different importance measures for the predictors that we used in the model. MeanDecreaseAccuracy and MeanDecreaseGini represent the overall classification error rate (or for regression term, Mean Squared Error) and the total decrease in node impurities from splitting on the variable, averaged over all trees.
In the output, ESCS and grade are the two predictors that seem to influence the model performance substantially, whereas own.room and EMOSUPS are the least important variables. varImpPlot() can visualize the mentioned result.
Next, we check the confusion matrix to see the accuracy, sensitivity, and specificity of our model to the test set.
rf_pred <- predict(rf_fit2, test) %>% #predict
as.data.frame() %>% #coerce into a data.frame
mutate(science_tr = as.factor(`.`)) %>% #transform the DV into factor
select(science_tr)
confusionMatrix(rf_pred$science_tr, test$science_tr) #compare with the actual result
## Confusion Matrix and Statistics
##
## Reference
## Prediction High Low
## High 544 267
## Low 164 294
##
## Accuracy : 0.6604
## 95% CI : (0.6336, 0.6864)
## No Information Rate : 0.5579
## P-Value [Acc > NIR] : 6.535e-14
##
## Kappa : 0.2981
##
## Mcnemar's Test P-Value : 8.962e-07
##
## Sensitivity : 0.7684
## Specificity : 0.5241
## Pos Pred Value : 0.6708
## Neg Pred Value : 0.6419
## Prevalence : 0.5579
## Detection Rate : 0.4287
## Detection Prevalence : 0.6391
## Balanced Accuracy : 0.6462
##
## 'Positive' Class : High
##
The RF model overall accuracy is 65.8%, with 76% sensitivity and 52% specificity.
tidymodelWe will use validation_split() to segregate the train set into a 20% single resampled validation data set to measure model’s performance and an 80% training set to train the model.
set.seed(678)
pisa_val <- validation_split(train, prop = 0.80) #indicating the proportion
pisa_val
## # Validation Set Split (0.8/0.2)
## # A tibble: 1 x 2
## splits id
## <list> <chr>
## 1 <split [2.4K/592]> validation
For this method, we will tune two hyperparameters for optimization, mtry, which is the number of predictors that are sampled at splits in a tree, and min_n, which is the minimum n to split at a node.
cores <- parallel::detectCores()
cores
## [1] 8
#specify the model
rf_mod <-
rand_forest(mtry = tune(), min_n = tune(), trees = 50) %>%
set_engine("ranger", num.threads = cores) %>%
set_mode("classification")
rf_mod
## Random Forest Model Specification (classification)
##
## Main Arguments:
## mtry = tune()
## trees = 50
## min_n = tune()
##
## Engine-Specific Arguments:
## num.threads = cores
##
## Computational engine: ranger
#show what will be tuned
rf_mod %>%
parameters()
## Collection of 2 parameters for tuning
##
## id parameter type object class
## mtry mtry nparam[?]
## min_n min_n nparam[+]
##
## Model parameters needing finalization:
## # Randomly Selected Predictors ('mtry')
##
## See `?dials::finalize` or `?dials::update.parameters` for more information.
#create a workflow
rf_workflow <-
workflow() %>%
add_model(rf_mod) %>%
add_formula(science_tr ~ grade1 + computer + own.room + ESCS +
EMOSUPS + COOPERATE)
rf_workflow
## == Workflow ===========================================================================================================
## Preprocessor: Formula
## Model: rand_forest()
##
## -- Preprocessor -------------------------------------------------------------------------------------------------------
## science_tr ~ grade1 + computer + own.room + ESCS + EMOSUPS +
## COOPERATE
##
## -- Model --------------------------------------------------------------------------------------------------------------
## Random Forest Model Specification (classification)
##
## Main Arguments:
## mtry = tune()
## trees = 50
## min_n = tune()
##
## Engine-Specific Arguments:
## num.threads = cores
##
## Computational engine: ranger
We will use a space-filling design to tune the hyperparameter, with 25 candidate models:
set.seed(102030) #For replicability
rf_res <-
rf_workflow %>%
tune_grid(pisa_val,#val_set contains both training and validation data sets for the model.
grid = 25, #Intended number of candidate model
control = control_grid(save_pred = TRUE), #retain this prediction value for diagnosis.
metrics = metric_set(roc_auc)) #ask for the AUC.
## i Creating pre-processing data to finalize unknown parameter: mtry
Here are our top 5 random forest models, out of the 25 candidates in terms of AUC:
rf_res %>%
show_best(metric = "roc_auc")
## # A tibble: 5 x 7
## mtry min_n .metric .estimator mean n std_err
## <int> <int> <chr> <chr> <dbl> <int> <dbl>
## 1 1 11 roc_auc binary 0.704 1 NA
## 2 1 23 roc_auc binary 0.703 1 NA
## 3 2 34 roc_auc binary 0.702 1 NA
## 4 2 38 roc_auc binary 0.697 1 NA
## 5 2 31 roc_auc binary 0.695 1 NA
The top five models perform uniformly better than the manually tuned model in the non-tidy RF above. Now, to plot both hyperparameters to see performance of all model candidates:
autoplot(rf_res)
The y-axis range is 68% to 70%, which is better than the manually tuned model. Let’s select the best model according to the ROC AUC metric. Our final tuning parameter values are:
rf_best <-
rf_res %>% #The model
select_best(metric = "roc_auc")
rf_best
## # A tibble: 1 x 2
## mtry min_n
## <int> <int>
## 1 1 11
To calculate the data needed to plot the ROC curve, we use collect_predictions().
rf_res %>%
collect_predictions()
## # A tibble: 14,800 x 7
## id .pred_High .pred_Low .row mtry min_n science_tr
## <chr> <dbl> <dbl> <int> <int> <int> <fct>
## 1 validation 0.127 0.873 6 4 28 Low
## 2 validation 0.537 0.463 14 4 28 High
## 3 validation 0.230 0.770 19 4 28 High
## 4 validation 0.363 0.637 22 4 28 Low
## 5 validation 0.153 0.847 24 4 28 Low
## 6 validation 0.286 0.714 29 4 28 Low
## 7 validation 0.544 0.456 40 4 28 High
## 8 validation 0.256 0.744 50 4 28 High
## 9 validation 0.499 0.501 54 4 28 Low
## 10 validation 0.607 0.393 55 4 28 High
## # ... with 14,790 more rows
Now, to collect the best model candidate:
rf_auc <-
rf_res %>%
collect_predictions(parameters = rf_best) %>% #to collect only the best model.
roc_curve(science_tr, .pred_High) %>% #use only this two column.
mutate(model = "Random Forest") #specify the model.
rf_auc
## # A tibble: 572 x 4
## .threshold specificity sensitivity model
## <dbl> <dbl> <dbl> <chr>
## 1 -Inf 0 1 Random Forest
## 2 0.157 0 1 Random Forest
## 3 0.163 0.00360 1 Random Forest
## 4 0.164 0.00719 1 Random Forest
## 5 0.165 0.0108 1 Random Forest
## 6 0.166 0.0144 1 Random Forest
## 7 0.171 0.0144 0.997 Random Forest
## 8 0.172 0.0180 0.997 Random Forest
## 9 0.178 0.0216 0.997 Random Forest
## 10 0.185 0.0252 0.997 Random Forest
## # ... with 562 more rows
Our last task is to fit the final model on all the rows of non-tested (train) data set, and evaluate the model performance with the test set.
# the last model
last_rf_mod <-
rand_forest(mtry = 1, min_n = 11, trees = 50) %>% #the best candidate
set_engine("ranger", num.threads = cores, importance = "impurity") %>%
set_mode("classification")
# the last workflow
last_rf_workflow <-
rf_workflow %>% #random forest workflow
update_model(last_rf_mod) #update the workflow with the above model.
# the last fit
set.seed(6431)
last_rf_fit <-
last_rf_workflow %>%
last_fit(pisa_split) #Fit the final best model to the training set and evaluate the test set. "splits" contains both the training and the testing set.
last_rf_fit
## # Monte Carlo cross-validation (0.7/0.3) with 1 resamples
## # A tibble: 1 x 6
## splits id .metrics .notes .predictions .workflow
## <list> <chr> <list> <list> <list> <list>
## 1 <split [3K/1~ train/test ~ <tibble [2 ~ <tibble [0~ <tibble [1,269 ~ <workflo~
This fitted workflow contains everything, including our final metrics based on the test set. Let’s see how it performs.
last_rf_fit %>%
collect_metrics()
## # A tibble: 2 x 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy binary 0.677
## 2 roc_auc binary 0.724
The ROC_AUC from the test set is not too far from when we tune the model with the validation set. We can access those variable importance scores via the .workflow column.
last_rf_fit %>%
pluck(".workflow", 1) %>% #take out the first element in the ".workflow" column.
pull_workflow_fit() %>% #Extract element of the fitted model.
vip::vip() #Display variable importance
The most important variable is ESCS, following by grade1, COOPERATE, and EMOSUPS. The displayed VIP plot is similar to the VIP plot in the non-tidy method above.
Let’s generate our last ROC curve to visualize.
last_rf_fit %>%
collect_predictions() %>% #Collect the prediction metrics
roc_curve(science_tr, .pred_High) %>% #constructs the full ROC curve and provide it with classification probability.
autoplot()
To do support vector classifiers (and SVMs) in R, we will use the e1071 package. The svm function in the e1071 package requires that the outcome variable is a factor - like the science_tr variable that we created earlier. If the outcome is not a factor, svm will perform regression.
We will use the svm() function with the kernel = "linear" argument. For hyperparameter, we will need to specify our tolerance, which is represented inversely by the cost argument. When cost is low, the tolerance is high (wide margin and a lot of support vector), and vice versa.
cost is set to 1 by default and we will tune this parameter via cross-validation. But for not, let’ssee what it’s like without tuning.
svc_fit <- svm(formula = science_tr ~ reading + math + grade1 + computer + own.room +
ESCS + EMOSUPS + COOPERATE,
data = train,
kernel = "linear")
#see the result
summary(svc_fit)
##
## Call:
## svm(formula = science_tr ~ reading + math + grade1 + computer + own.room +
## ESCS + EMOSUPS + COOPERATE, data = train, kernel = "linear")
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: linear
## cost: 1
##
## Number of Support Vectors: 604
##
## ( 302 302 )
##
##
## Number of Classes: 2
##
## Levels:
## High Low
We can see that there are 604 support vectors: 302 in class “High” and 302 in class “Low”. We can also plot our model, but we need to specify the two features we want to plot (because our model has six feature). Let’s look at the model with reading on the y-axis and math on the x-axis.
plot(svc_fit, data = train, reading ~ math)
In this figure, the Xs are the support vectors, while the Os are the non-support vector observations; the upper triangle are for “High”, while the lower triangle is for “Low”. While the decision boundary looks jagged, it’s just an artifact of the way it’s drawn with this function.
We can see that many observations are misclassified (i.e., some red points are in the higher triangle and some black points are in the lower triangle). However, there are a lot of observations shown in this figure and it is difficult to discern the nature of the misclassification. Therefore, let’s take a random sample of 500 observations to get a better sense of our classifier.
set.seed(99)
ran_obs <- sample(1:nrow(train), 500) #generate a subset of data
plot(svc_fit, data = train[ran_obs, ], reading ~ math)
To further improve the model, we can tune the cost hyperparameter via cross-validation for optimal classification. We can use e1071::tune() to tune with 10-fold CV by default. We need to provide the tune function with a range of cost values (which again corresponds to our tolerance to violate the margin and hyperplane).
tune_svc <- e1071::tune(svm,
science_tr ~ reading + math + grade1 + computer + own.room +
ESCS + EMOSUPS + COOPERATE,
data = train,
kernel="linear",
ranges = list(cost = c(.01, .1, 1, 5, 10)))
summary(tune_svc)
##
## Parameter tuning of 'svm':
##
## - sampling method: 10-fold cross validation
##
## - best parameters:
## cost
## 0.01
##
## - best performance: 0.08510329
##
## - Detailed performance results:
## cost error dispersion
## 1 0.01 0.08510329 0.01393305
## 2 0.10 0.08780485 0.01508432
## 3 1.00 0.08949290 0.01567712
## 4 5.00 0.08881723 0.01629430
## 5 10.00 0.08881723 0.01590039
Now, to select the best model and view it.
best_svc <- tune_svc$best.model
summary(best_svc)
##
## Call:
## best.tune(method = svm, train.x = science_tr ~ reading + math + grade1 +
## computer + own.room + ESCS + EMOSUPS + COOPERATE, data = train,
## ranges = list(cost = c(0.01, 0.1, 1, 5, 10)), kernel = "linear")
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: linear
## cost: 0.01
##
## Number of Support Vectors: 945
##
## ( 474 471 )
##
##
## Number of Classes: 2
##
## Levels:
## High Low
Next, we write a function to evaluate our classifier that has one argument that takes a confusion matrix.
eval_classifier <- function(tab, print = F)
{
n <- sum(tab)
TN <- tab[2,2]
FP <- tab[2,1]
FN <- tab[1,2]
TP <- tab[1,1]
classify.rate <- (TP + TN) / n
TP.rate <- TP / (TP + FN)
TN.rate <- TN / (TN + FP)
object <- data.frame(accuracy = classify.rate,
sensitivity = TP.rate,
specificity = TN.rate)
return(object)
}
A confusion matrix for our best_svc can be created by:
svc_train <- table(train$science_tr, predict(best_svc)) #observed value, predicted value
eval_classifier(svc_train)
## accuracy sensitivity specificity
## 1 0.9145559 0.9075032 0.9222615
These statistics are overly optimistic as we are evaluating our model using the training data. Let’s see how it works with the test data.
svc_test <- table(test$science_tr, predict(best_svc, newdata = test))
eval_classifier(svc_test)
## accuracy sensitivity specificity
## 1 0.8967691 0.8983051 0.8948307
The statistics drop a bit, but still good. This is likely because math and reading are so highly correlated with science scores.
Support vector classifiers are quite similar to logistic regression in terms of loss functions in estimating parameters. In situations where the classes are well separated, SVM tends to do better. Inversely, Logistic Regression works better with situations where classes are not well-separated.
lr_fit <- glm(science_tr ~ reading + math + grade1 + computer + own.room + ESCS +
EMOSUPS + COOPERATE,
data = train,
family = "binomial")
#view the coefficient
coef(lr_fit)
## (Intercept) reading math grade1 computer own.room
## 36.46185453 -0.04268969 -0.04191258 -0.08310064 0.08228638 0.18142714
## ESCS EMOSUPS COOPERATE
## 0.02039443 -0.01612452 -0.01390496
How does it do relative to our best support vector classifier on the training and the testing data sets? Let’s see the model with train data set first.
lr_train <- table(train$science_tr,
round(predict(lr_fit, type = "response")))
lr_train
##
## 0 1
## High 1407 139
## Low 117 1298
eval_classifier(lr_train)
## accuracy sensitivity specificity
## 1 0.9135427 0.9100906 0.9173145
and then for the test data set.
lr_test <- table(test$science_tr,
round(predict(lr_fit, newdata = test, type = "response")))
lr_test
##
## 0 1
## High 638 70
## Low 68 493
eval_classifier(lr_test)
## accuracy sensitivity specificity
## 1 0.891253 0.9011299 0.8787879
Reference: https://github.com/okanbulut/edm