Educational Data Mining practice

We will be using Programme for International Student Assessment (PISA) 2015 data set for turkey student. The data comes from a large-scale assessment in various knowledge domains of international students.

setwd("D:/Class Materials & Work/Summer 2020 practice/Educational Data Mining")

# Packages to be used

library("data.table")
library("tidyverse")
library("tidymodels")
library("e1071")
library("caret")
library("mlr")
library("modelr")
library("randomForest")
library("rpart")
library("rpart.plot")
library("GGally")
library("ggExtra")

Data Examination

After importing the data, it is recommended to check properties of the data for good measure. fread is more preferable when reading a big data set as it is faster and can control number of thread used. The function also allows us to read the missing data as NA with na.strings = "NA".

Another way we can deal with missing data is by checking for missing data with complete.cases() and remove it with na.omit().

# fread function from data.table (reading as f-read)
pisa <- fread("pisa_turkey.csv", na.strings = "NA")

# Check the data class
class(pisa)

## [1] "data.table" "data.frame"

# See variable names
names(pisa)

##  [1] "CNTSTUID"    "gender"      "female"      "grade"       "computer"   
##  [6] "software"    "internet"    "desk"        "own.room"    "quiet.study"
## [11] "lit"         "poetry"      "art"         "book.sch"    "tech.book"  
## [16] "dict"        "art.book"    "ST011Q05TA"  "ST071Q02NA"  "ST071Q01NA" 
## [21] "ST123Q02NA"  "ST082Q01NA"  "ST119Q01NA"  "ST119Q05NA"  "ANXTEST"    
## [26] "COOPERATE"   "BELONG"      "EMOSUPS"     "WEALTH"      "PARED"      
## [31] "TMINS"       "ESCS"        "TEACHSUP"    "TDTEACH"     "IBTEACH"    
## [36] "SCIEEFF"     "math"        "reading"     "science"

# Preview the data
head(pisa)

##    CNTSTUID gender female   grade computer software internet desk own.room
## 1: 79200939 Female      1 Grade 9        0        0        0    1        1
## 2: 79206774 Female      1 Grade 9        1        1        1    1        1
## 3: 79204670 Female      1 Grade 9        0        0        1    1        0
## 4: 79201647 Female      1 Grade 9        1        1        0    1        0
## 5: 79203718 Female      1 Grade 9        0        0        0    0        0
## 6: 79204968 Female      1 Grade 9        0        0        0    1        0
##    quiet.study lit poetry art book.sch tech.book dict art.book ST011Q05TA
## 1:           1   1      1   0        1         0    1        0         No
## 2:           0   1      0   1        1         0    1        0        Yes
## 3:           0   1      0   0        0         0    1        0         No
## 4:           0   1      1   0        0         0    1        0        Yes
## 5:           0   0      0   0        0         0    1        0         No
## 6:           0   0      1   0        0         0    1        0         No
##    ST071Q02NA ST071Q01NA     ST123Q02NA        ST082Q01NA     ST119Q01NA
## 1:         11          7 Strongly agree             Agree Strongly agree
## 2:         NA          5 Strongly agree             Agree Strongly agree
## 3:          6          4          Agree             Agree Strongly agree
## 4:          6          3          Agree          Disagree Strongly agree
## 5:          6          6 Strongly agree          Disagree          Agree
## 6:         12          6       Disagree Strongly disagree Strongly agree
##        ST119Q05NA ANXTEST COOPERATE  BELONG EMOSUPS  WEALTH PARED TMINS    ESCS
## 1: Strongly agree  1.4394    1.3264 -0.8482  0.5658 -1.7154    12  1560 -1.7902
## 2:          Agree  0.6654    1.1640 -0.7657  1.0991 -1.2426     8   900 -1.8259
## 3: Strongly agree  1.2704    0.5759 -0.5064 -1.3298 -1.2256    12  1600 -1.2038
## 4: Strongly agree  2.5493    0.9675  0.3142 -0.3306 -1.8473    16  2280 -0.9068
## 5: Strongly agree  1.1311   -0.2882 -0.6925 -0.2495 -4.7851     4  1600 -4.1131
## 6: Strongly agree  1.1311   -1.0629 -0.9932 -1.8280 -2.8012    14  1495 -1.0631
##    TEACHSUP TDTEACH IBTEACH SCIEEFF     math  reading  science
## 1:   0.7900  0.9505  0.8519  1.8526 361.9234 408.0202 394.2937
## 2:   1.4475  0.6580 -1.0496 -0.2154 325.4144 375.4197 350.5305
## 3:  -0.1164  0.4505  0.5453  0.5561 365.5649 381.6692 396.2027
## 4:  -0.4527 -0.0057  1.4812  0.5037 333.6344 345.9447 336.3870
## 5:   0.2725  0.0615  0.9054  0.1091 381.7032 381.6998 403.6358
## 6:  -1.0080 -0.8780  0.0441 -1.4295 352.0996 350.1369 393.2887

# Check data dimensions
dim(pisa)

## [1] 5895   39

# Check data structure
str(pisa)

## Classes 'data.table' and 'data.frame':   5895 obs. of  39 variables:
##  $ CNTSTUID   : int  79200939 79206774 79204670 79201647 79203718 79204968 79200200 79200563 79200129 79205586 ...
##  $ gender     : chr  "Female" "Female" "Female" "Female" ...
##  $ female     : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ grade      : chr  "Grade 9" "Grade 9" "Grade 9" "Grade 9" ...
##  $ computer   : int  0 1 0 1 0 0 0 1 0 1 ...
##  $ software   : int  0 1 0 1 0 0 1 1 0 NA ...
##  $ internet   : int  0 1 1 0 0 0 0 0 0 1 ...
##  $ desk       : int  1 1 1 1 0 1 0 0 0 1 ...
##  $ own.room   : int  1 1 0 0 0 0 0 0 0 1 ...
##  $ quiet.study: int  1 0 0 0 0 0 1 0 1 1 ...
##  $ lit        : int  1 1 1 1 0 0 0 0 0 1 ...
##  $ poetry     : int  1 0 0 1 0 1 0 0 0 NA ...
##  $ art        : int  0 1 0 0 0 0 0 0 0 NA ...
##  $ book.sch   : int  1 1 0 0 0 0 0 1 0 1 ...
##  $ tech.book  : int  0 0 0 0 0 0 0 0 NA NA ...
##  $ dict       : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ art.book   : int  0 0 0 0 0 0 0 0 0 1 ...
##  $ ST011Q05TA : chr  "No" "Yes" "No" "Yes" ...
##  $ ST071Q02NA : int  11 NA 6 6 6 12 6 6 NA 3 ...
##  $ ST071Q01NA : int  7 5 4 3 6 6 7 4 NA 10 ...
##  $ ST123Q02NA : chr  "Strongly agree" "Strongly agree" "Agree" "Agree" ...
##  $ ST082Q01NA : chr  "Agree" "Agree" "Agree" "Disagree" ...
##  $ ST119Q01NA : chr  "Strongly agree" "Strongly agree" "Strongly agree" "Strongly agree" ...
##  $ ST119Q05NA : chr  "Strongly agree" "Agree" "Strongly agree" "Strongly agree" ...
##  $ ANXTEST    : num  1.439 0.665 1.27 2.549 1.131 ...
##  $ COOPERATE  : num  1.326 1.164 0.576 0.968 -0.288 ...
##  $ BELONG     : num  -0.848 -0.766 -0.506 0.314 -0.692 ...
##  $ EMOSUPS    : num  0.566 1.099 -1.33 -0.331 -0.249 ...
##  $ WEALTH     : num  -1.72 -1.24 -1.23 -1.85 -4.79 ...
##  $ PARED      : int  12 8 12 16 4 14 8 4 4 8 ...
##  $ TMINS      : int  1560 900 1600 2280 1600 1495 1600 1600 NA 1600 ...
##  $ ESCS       : num  -1.79 -1.826 -1.204 -0.907 -4.113 ...
##  $ TEACHSUP   : num  0.79 1.448 -0.116 -0.453 0.273 ...
##  $ TDTEACH    : num  0.9505 0.658 0.4505 -0.0057 0.0615 ...
##  $ IBTEACH    : num  0.852 -1.05 0.545 1.481 0.905 ...
##  $ SCIEEFF    : num  1.853 -0.215 0.556 0.504 0.109 ...
##  $ math       : num  362 325 366 334 382 ...
##  $ reading    : num  408 375 382 346 382 ...
##  $ science    : num  394 351 396 336 404 ...
##  - attr(*, ".internal.selfref")=<externalptr>

# Counting unique, missing, median, and mean values

pisa %>% summarise(n = n_distinct(TMINS),
                 na = sum(is.na(TMINS)),
                 med = median(TMINS, na.rm = TRUE),
                 mean = mean(TMINS, na.rm = TRUE))

##     n  na  med     mean
## 1 192 382 1600 1558.749

For missing value, we can either remove them as mentioned above, replacing them with mean or median (Imputation), or perform listwise or pairwise calculation. See below for the code on median imputation on TMINS variable, but we will not run it in this practice.

# pisa  <- pisa %>%
#             mutate(TMINS
#                     = replace(TMINS,
#                       is.na(TMINS),
#                       mean(TMINS, na.rm = TRUE)))

For other methods, listwise deletion removes a case if it has one or more missing values in the variable of interest, while pairwise deletion, which is the default protocol of most analysis, omits a variable instead of an entire case if it has missing value. Other analysis can be done on the same case with non-missing variables.
### Data Wrangling and Visualization
Next, we will define the existing variable and visualize them.

variable.names(pisa)

##  [1] "CNTSTUID"    "gender"      "female"      "grade"       "computer"   
##  [6] "software"    "internet"    "desk"        "own.room"    "quiet.study"
## [11] "lit"         "poetry"      "art"         "book.sch"    "tech.book"  
## [16] "dict"        "art.book"    "ST011Q05TA"  "ST071Q02NA"  "ST071Q01NA" 
## [21] "ST123Q02NA"  "ST082Q01NA"  "ST119Q01NA"  "ST119Q05NA"  "ANXTEST"    
## [26] "COOPERATE"   "BELONG"      "EMOSUPS"     "WEALTH"      "PARED"      
## [31] "TMINS"       "ESCS"        "TEACHSUP"    "TDTEACH"     "IBTEACH"    
## [36] "SCIEEFF"     "math"        "reading"     "science"

# Define new variables
pisa <- mutate(pisa,
               
               # Reorder the levels of grade 
               grade = factor(grade, 
                              levels = c("Grade 7", "Grade 8", "Grade 9", "Grade 10",
                                         "Grade 11", "Grade 12", "Grade 13", 
                                         "Ungraded")),
               
               # Define a numerical grade variable
               grade1 = (as.numeric(sapply(grade, function(x) {
                 if(x=="Grade 7") "7"
                 else if (x=="Grade 8") "8"
                 else if (x=="Grade 9") "9"
                 else if (x=="Grade 10") "10"
                 else if (x=="Grade 11") "11"
                 else if (x=="Grade 12") "12"
                 else if (x=="Grade 13") NA_character_
                 else if (x=="Ungraded") NA_character_}))),
               
               # Total learning time as hours
               learning = round(TMINS/60, 0),
               
               # Science performance based on OECD average (predetermined value)
               science_oecd = as.factor(ifelse(science >= 493, "High", "Low")),
               
               # Science performance based on Turkey's average
               science_tr = as.factor(ifelse(science >= 422, "High", "Low")))

We can summarize the target variable (science) by categorical independent variables and see if there is any relationship.

# By grade - summary
pisa %>%
  group_by(grade) %>%
  summarise(Count = n(),
            science = mean(science, na.rm = TRUE),
            computer = mean(computer, na.rm = TRUE),
            software = mean(software, na.rm = TRUE),
            internet = mean(internet, na.rm = TRUE),
            own.room = mean(own.room, na.rm = TRUE)
  )

## # A tibble: 6 x 7
##   grade    Count science computer software internet own.room
##   <fct>    <int>   <dbl>    <dbl>    <dbl>    <dbl>    <dbl>
## 1 Grade 7     16    328.   0.0625    0.312   0.0625    0.438
## 2 Grade 8    105    338.   0.303     0.385   0.26      0.324
## 3 Grade 9   1273    386.   0.609     0.427   0.571     0.657
## 4 Grade 10  4308    435.   0.710     0.422   0.666     0.740
## 5 Grade 11   186    418.   0.514     0.324   0.455     0.626
## 6 Grade 12     7    404.   0.571     0.429   0.286     0.714

# By grade and gender - summary
pisa %>%
  group_by(grade, gender) %>%
  summarise(Count = n(),
            science = mean(science, na.rm = TRUE),
            computer = mean(computer, na.rm = TRUE),
            software = mean(software, na.rm = TRUE),
            internet = mean(internet, na.rm = TRUE),
            own.room = mean(own.room, na.rm = TRUE)
  )

## # A tibble: 12 x 8
## # Groups:   grade [6]
##    grade    gender Count science computer software internet own.room
##    <fct>    <chr>  <int>   <dbl>    <dbl>    <dbl>    <dbl>    <dbl>
##  1 Grade 7  Female     6    339.    0        0.333    0        0.333
##  2 Grade 7  Male      10    322.    0.1      0.3      0.1      0.5  
##  3 Grade 8  Female    42    338.    0.275    0.436    0.220    0.268
##  4 Grade 8  Male      63    338.    0.322    0.351    0.288    0.361
##  5 Grade 9  Female   494    388.    0.602    0.435    0.551    0.651
##  6 Grade 9  Male     779    385.    0.613    0.422    0.584    0.660
##  7 Grade 10 Female  2272    434.    0.710    0.459    0.671    0.750
##  8 Grade 10 Male    2036    437.    0.711    0.382    0.660    0.729
##  9 Grade 11 Female   120    421.    0.504    0.342    0.439    0.633
## 10 Grade 11 Male      66    412.    0.532    0.288    0.484    0.613
## 11 Grade 12 Female     4    408.    0.75     0.5      0.5      1    
## 12 Grade 12 Male       3    397.    0.333    0.333    0        0.333

Now we can visualize some of the variables in the data. Let’s view science performance by grade.

ggplot(data = pisa, 
       mapping = aes(x = grade, y = science)) +
  geom_boxplot() +
  labs(x=NULL, y="Science Scores") +
  theme_bw()

Let’s add horizontal line representing the OECD (red) and Turkey (blue) averages with geom_hline().

ggplot(data = pisa, 
       mapping = aes(x = grade, y = science)) +
  geom_boxplot() +
  labs(x=NULL, y="Science Scores") +
  geom_hline(yintercept = 493, linetype="dashed", color = "red", size = 1) +
  geom_hline(yintercept = 422, linetype="dashed", color = "blue", size = 1) +
  theme_bw()

Add gender as a color layer using fill = gender with red line for OCED average.

ggplot(data = pisa,
       mapping = aes(x = grade, y = science, fill = gender)) +
  geom_boxplot() +
  labs(x=NULL, y="Science Scores") +
  geom_hline(yintercept = 493, linetype="dashed", color = "red", size = 1) +
  theme_bw()

We can also see the impact of gender on science, reading, and math scores.

ggpairs(data = pisa,
        mapping = aes(color = gender),
        columns = c("reading", "science", "math"),
        upper = list(continuous = wrap("cor", size = 4.5)))

Examining conditional relationships between science score and learning time.

p1 <- ggplot(data = pisa,
             mapping = aes(x = learning, y = science)) +
  geom_point() +
  geom_smooth(method = "loess") +
  labs(x = "Weekly Learning Time", y = "Science Scores") +
  theme_bw()

# Replace "histogram" with "boxplot" or "density" for other types
ggMarginal(p1, type = "histogram")

Most importantly, correlations among the variables should be checked:

ggcorr(data = pisa[,c("science", "reading", "computer", "own.room", "quiet.study",
                      "ESCS", "SCIEEFF", "BELONG", "ANXTEST", "COOPERATE", "learning",
                      "EMOSUPS", "grade1")],
       method = c("pairwise.complete.obs", "pearson"),
       label = TRUE, label_size = 4,
       low = "red3", high = "green3",
       name = "Correlation Scale",
       )

Decision Trees

We will split our data set into both training (70%) and testing (30%) set, as well as use listwise deletion to remove cases with missing data before splitting.

pisa_nm <- pisa %>% 
  
  #Remove missing cases
  na.omit() %>%
  
  #Coding qualitative variables into factors
  mutate_if(is.character, as.factor)

Calculating the proportion of Science performance based on Turkey’s average

pisa_nm %>% 
  count(science_tr) %>% 
  mutate(prop = n/sum(n))

## # A tibble: 2 x 3
##   science_tr     n  prop
##   <fct>      <int> <dbl>
## 1 High        2254 0.533
## 2 Low         1976 0.467

The proportion discrepancy between low and high group is not extreme; Therefore, there is no need to use strata argument.

set.seed(442019)

# Put 70% of the data into the training set 
pisa_split <- initial_split(pisa_nm, prop = 0.7)

# Create data frames for the two sets:
train <- training(pisa_split) #70%
test  <- testing(pisa_split)  #30%

We will be using both rpart and tidymodel approaches in building a decision tree for comparison. In rpart function, there are several elements:

formula: this argument defines the DV (science_perf) and predictors (IV) with ~ as the separator.
data: defines the data set.
method: defines the type of decision tree. method = class is for classification tree, while method = class is for regression tree.
control: defines hyperparameter of the decision tree algorithm. minsplit defines the minimum number of observations in a node (default = 20); cp is complexity parameter to prune the tree (default = 0.01, with 0 = no prunning); xval is the number of cross validation (default = 10, 0 = no cross validation).
parms: is a list of optional parameters for the splitting function. anova tree as none; For class tree, it can be either gini for the Gini index or information for the Entropy index. The latter case is slightly slower than the former in computational speed.

Firstly, we will try building a decision tree df_fit1 with cp = 0, xval = 0, and gini splitting.

dt_fit1 <- rpart(formula = science_tr ~ grade1 + computer + own.room + ESCS +
                   EMOSUPS + COOPERATE,
                 data = train,
                 method = "class",
                 control = rpart.control(minsplit = 20,
                                         cp = 0,
                                         xval = 0),
                 parms = list(split = "gini"))

rpart.plot(dt_fit1)

## Warning: labs do not fit even at cex 0.15, there may be some overplotting

As we can see, the plot is very complex with no prunning; That is, we overfitted the model with excessive nodes. We will use cp = 0.006 to prun the model, generating a smaller tree with fewer predictors as a result.

dt_fit2 <- rpart(formula = science_tr ~ grade1 + computer + own.room + ESCS +
                   EMOSUPS + COOPERATE,
                 data = train,
                 method = "class",
                 control = rpart.control(minsplit = 20,
                                         cp = 0.006,
                                         xval = 0),
                 parms = list(split = "gini"))

rpart.plot(dt_fit2)

The model becomes simpler, with each node shows the predicted class, the predicted probability of the second class, and the percentage of observations in the node.

We can make the model to be more dictinct by adding colors, adjusting the shown values, and node shadow.

rpart.plot(dt_fit2, extra = 8, box.palette = "RdBu", shadow.col = "gray")

We can print the output of our model with printcp():

printcp(dt_fit2)

## 
## Classification tree:
## rpart(formula = science_tr ~ grade1 + computer + own.room + ESCS + 
##     EMOSUPS + COOPERATE, data = train, method = "class", parms = list(split = "gini"), 
##     control = rpart.control(minsplit = 20, cp = 0.006, xval = 0))
## 
## Variables actually used in tree construction:
## [1] computer EMOSUPS  ESCS     grade1  
## 
## Root node error: 1415/2961 = 0.47788
## 
## n= 2961 
## 
##          CP nsplit rel error
## 1 0.2091873      0   1.00000
## 2 0.0572438      1   0.79081
## 3 0.0070671      2   0.73357
## 4 0.0067138      4   0.71943
## 5 0.0060000      6   0.70601

#For more detail

we can use summary(dt_fit2) for more detailed result. caret::varImp() also tells us which variables are more influential in the analysis. The higher the value, the more influential that variable is to the analysis:

varImp(dt_fit2)

##             Overall
## computer  134.36152
## COOPERATE  77.78897
## EMOSUPS    77.68769
## ESCS      168.50372
## grade1    108.21253
## own.room   14.22999

Next, we need to apply our tree to the test data set. Below we estimate the predicted classes (either high or low) from the test data by applying the estimated model; Then, we obtain model predictions using predict() and then turn the results into a data frame called dt_pred.

dt_pred <- predict(dt_fit2, test) %>%
  as.data.frame()

head(dt_pred)

##         High       Low
## 1  0.1940299 0.8059701
## 7  0.1940299 0.8059701
## 8  0.4394619 0.5605381
## 12 0.4394619 0.5605381
## 15 0.4394619 0.5605381
## 22 0.4394619 0.5605381

The result above shows the probability of students falling into either groups. We will turn these probabilities into binary classifications (depending on whether they are >= 50%) and compare them with results from the test data to create a confusion matrix.

dt_pred <-  mutate(dt_pred, science_tr = as.factor(ifelse(High >= 0.5, "High", "Low"))) %>%
            select(science_tr)

confusionMatrix(dt_pred$science_tr, test$science_tr)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction High Low
##       High  506 239
##       Low   202 322
##                                           
##                Accuracy : 0.6525          
##                  95% CI : (0.6256, 0.6787)
##     No Information Rate : 0.5579          
##     P-Value [Acc > NIR] : 4.388e-12       
##                                           
##                   Kappa : 0.2907          
##                                           
##  Mcnemar's Test P-Value : 0.08648         
##                                           
##             Sensitivity : 0.7147          
##             Specificity : 0.5740          
##          Pos Pred Value : 0.6792          
##          Neg Pred Value : 0.6145          
##              Prevalence : 0.5579          
##          Detection Rate : 0.3987          
##    Detection Prevalence : 0.5871          
##       Balanced Accuracy : 0.6443          
##                                           
##        'Positive' Class : High            
##

The output shows that the overall accuracy is 65%, sensitivity 71%, specificity 57%. Adding more correlated variables may increase the accuracy and sensitivity of the model.

In our previous model, we did not run any cross-validation sample (xval = 0), but K-fold cross validations are useful when we do not have a test or validation data set, or our dataset is to small to split into training and test data.

A typical way to use crossvalidation in decision trees is to not specify a cp. We will assume that our dataset is not too big and thus we want to run a 10-fold cross validation.

dt_fit3 <- rpart(formula = science_tr ~ grade1 + computer + own.room + ESCS +
                   EMOSUPS + COOPERATE,
                 data = train,
                 method = "class",
                 control = rpart.control(minsplit = 20,
                                         cp = 0,
                                         xval = 10),
                 parms = list(split = "gini"))

We got a large tree again. In the results, we can evaluate the cross-validated error (i.e., X-val Relative Error) and choose the complexity parameter that would give us an acceptable value. . Then, we can use this cp value and prune the trees. We can use plotcp() to visualize the cross-validation results.

printcp(dt_fit3)

## 
## Classification tree:
## rpart(formula = science_tr ~ grade1 + computer + own.room + ESCS + 
##     EMOSUPS + COOPERATE, data = train, method = "class", parms = list(split = "gini"), 
##     control = rpart.control(minsplit = 20, cp = 0, xval = 10))
## 
## Variables actually used in tree construction:
## [1] computer  COOPERATE EMOSUPS   ESCS      grade1    own.room 
## 
## Root node error: 1415/2961 = 0.47788
## 
## n= 2961 
## 
##            CP nsplit rel error  xerror     xstd
## 1  0.20918728      0   1.00000 1.00000 0.019209
## 2  0.05724382      1   0.79081 0.79081 0.018646
## 3  0.00706714      2   0.73357 0.73357 0.018349
## 4  0.00671378      4   0.71943 0.73640 0.018365
## 5  0.00530035      6   0.70601 0.73428 0.018353
## 6  0.00459364      8   0.69541 0.73710 0.018369
## 7  0.00424028     18   0.64523 0.73569 0.018361
## 8  0.00353357     19   0.64099 0.73993 0.018385
## 9  0.00282686     25   0.61979 0.73922 0.018381
## 10 0.00247350     31   0.60283 0.71943 0.018266
## 11 0.00212014     33   0.59788 0.73004 0.018329
## 12 0.00176678     37   0.58940 0.73074 0.018333
## 13 0.00164900     39   0.58587 0.73852 0.018377
## 14 0.00159011     42   0.58092 0.73852 0.018377
## 15 0.00141343     47   0.57244 0.73922 0.018381
## 16 0.00113074     60   0.55406 0.74558 0.018417
## 17 0.00106007     65   0.54841 0.75548 0.018470
## 18 0.00098940     78   0.53145 0.76254 0.018507
## 19 0.00070671     85   0.52367 0.77173 0.018554
## 20 0.00053004     97   0.51449 0.78445 0.018616
## 21 0.00047114    106   0.50671 0.79293 0.018656
## 22 0.00035336    110   0.50459 0.80141 0.018694
## 23 0.00030288    120   0.50106 0.81343 0.018746
## 24 0.00023557    127   0.49894 0.81343 0.018746
## 25 0.00017668    130   0.49823 0.81908 0.018769
## 26 0.00000000    142   0.49611 0.81837 0.018766

plotcp(dt_fit3)

Based on the results above, we can specify an ideal CP value (0.0034 at the lowest error value) and re-run the model without cross validation.

dt_fit4 <- rpart(formula = science_tr ~ grade1 + computer + own.room + ESCS +
                   EMOSUPS + COOPERATE,
                 data = train,
                 method = "class",
                 control = rpart.control(minsplit = 20,
                                         cp = 0.0034,
                                         xval = 0),
                 parms = list(split = "gini"))

For the final fit, we will obtain the model prediction as a data frame from the test set and compare the predicted result with the actual result with confusion matrix.

dt_pred_final <- predict(dt_fit4, test) %>%
  as.data.frame()

head(dt_pred_final)

##         High       Low
## 1  0.1940299 0.8059701
## 7  0.1940299 0.8059701
## 8  0.2478632 0.7521368
## 12 0.2478632 0.7521368
## 15 0.2478632 0.7521368
## 22 0.3989637 0.6010363

dt_pred_final <-  mutate(dt_pred_final, science_tr = as.factor(ifelse(High >= 0.5, "High", "Low"))) %>%
  select(science_tr)

confusionMatrix(dt_pred_final$science_tr, test$science_tr)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction High Low
##       High  502 224
##       Low   206 337
##                                           
##                Accuracy : 0.6612          
##                  95% CI : (0.6344, 0.6872)
##     No Information Rate : 0.5579          
##     P-Value [Acc > NIR] : 4.21e-14        
##                                           
##                   Kappa : 0.3108          
##                                           
##  Mcnemar's Test P-Value : 0.4123          
##                                           
##             Sensitivity : 0.7090          
##             Specificity : 0.6007          
##          Pos Pred Value : 0.6915          
##          Neg Pred Value : 0.6206          
##              Prevalence : 0.5579          
##          Detection Rate : 0.3956          
##    Detection Prevalence : 0.5721          
##       Balanced Accuracy : 0.6549          
##                                           
##        'Positive' Class : High            
##

The final model yields the overall accuracy of 66%, sensitivity 71%, and specificity 60%, which are considerably better than the dt_fit2 model by adjusting the complexity value.

Decision Trees with `tidymodel`

We will also try creating the decision tree model with tidy framework for future reference. First, we will indicate hyperparameters to tune, specifically cost complexity and min split.

tune_dt <- 
  decision_tree(
    cost_complexity = tune(),
    min_n = tune()
  ) %>% 
  set_engine("rpart") %>% 
  set_mode("classification")

tune_dt

## Decision Tree Model Specification (classification)
## 
## Main Arguments:
##   cost_complexity = tune()
##   min_n = tune()
## 
## Computational engine: rpart

To find out the optimal value for our hyperparameter, we can train multiple models using resampled data. To get there, we have to create a grid of values for both hyperparameters through dial package.

tree_grid <- grid_regular(cost_complexity(), #identifying the hyperparameters
                          min_n(),
                          levels = 5) #five possible values for each parameter

tree_grid

## # A tibble: 25 x 2
##    cost_complexity min_n
##              <dbl> <int>
##  1    0.0000000001     2
##  2    0.0000000178     2
##  3    0.00000316       2
##  4    0.000562         2
##  5    0.1              2
##  6    0.0000000001    11
##  7    0.0000000178    11
##  8    0.00000316      11
##  9    0.000562        11
## 10    0.1             11
## # ... with 15 more rows

#possible values for min_n
tree_grid %>% 
  count(min_n)

## # A tibble: 5 x 2
##   min_n     n
##   <int> <int>
## 1     2     5
## 2    11     5
## 3    21     5
## 4    30     5
## 5    40     5

#setting a v-fold cross validation

set.seed(234)
pisa_folds <- vfold_cv(train) #default = 10

Next, we will fit the model at all the different values we chose for each tuned hyperparameter. We will tune a workflow() along with model specification.

#workflow
set.seed(345)

tree_wf <- workflow() %>%
  add_model(tune_dt) %>%
  add_formula(science_tr ~  grade1 + computer + own.room + ESCS +
                EMOSUPS + COOPERATE)

tree_wf

## == Workflow ===========================================================================================================
## Preprocessor: Formula
## Model: decision_tree()
## 
## -- Preprocessor -------------------------------------------------------------------------------------------------------
## science_tr ~ grade1 + computer + own.room + ESCS + EMOSUPS + 
##     COOPERATE
## 
## -- Model --------------------------------------------------------------------------------------------------------------
## Decision Tree Model Specification (classification)
## 
## Main Arguments:
##   cost_complexity = tune()
##   min_n = tune()
## 
## Computational engine: rpart

#resampling with the workflow
tree_res <- 
  tree_wf %>% 
  tune_grid(
    resamples = pisa_folds, #the fold
    grid = tree_grid #the hyperparameter grid
  )

tree_res

## #  10-fold cross-validation 
## # A tibble: 10 x 4
##    splits             id     .metrics          .notes          
##    <list>             <chr>  <list>            <list>          
##  1 <split [2.7K/297]> Fold01 <tibble [50 x 5]> <tibble [0 x 1]>
##  2 <split [2.7K/296]> Fold02 <tibble [50 x 5]> <tibble [0 x 1]>
##  3 <split [2.7K/296]> Fold03 <tibble [50 x 5]> <tibble [0 x 1]>
##  4 <split [2.7K/296]> Fold04 <tibble [50 x 5]> <tibble [0 x 1]>
##  5 <split [2.7K/296]> Fold05 <tibble [50 x 5]> <tibble [0 x 1]>
##  6 <split [2.7K/296]> Fold06 <tibble [50 x 5]> <tibble [0 x 1]>
##  7 <split [2.7K/296]> Fold07 <tibble [50 x 5]> <tibble [0 x 1]>
##  8 <split [2.7K/296]> Fold08 <tibble [50 x 5]> <tibble [0 x 1]>
##  9 <split [2.7K/296]> Fold09 <tibble [50 x 5]> <tibble [0 x 1]>
## 10 <split [2.7K/296]> Fold10 <tibble [50 x 5]> <tibble [0 x 1]>

Now we have our tuning results. Let’s collect and visualize them with collect_metrics().

tree_res %>% 
  collect_metrics()

## # A tibble: 50 x 7
##    cost_complexity min_n .metric  .estimator  mean     n std_err
##              <dbl> <int> <chr>    <chr>      <dbl> <int>   <dbl>
##  1    0.0000000001     2 accuracy binary     0.566    10 0.0125 
##  2    0.0000000001     2 roc_auc  binary     0.565    10 0.0126 
##  3    0.0000000001    11 accuracy binary     0.591    10 0.00882
##  4    0.0000000001    11 roc_auc  binary     0.633    10 0.0127 
##  5    0.0000000001    21 accuracy binary     0.602    10 0.0115 
##  6    0.0000000001    21 roc_auc  binary     0.650    10 0.0112 
##  7    0.0000000001    30 accuracy binary     0.614    10 0.0107 
##  8    0.0000000001    30 roc_auc  binary     0.665    10 0.0128 
##  9    0.0000000001    40 accuracy binary     0.620    10 0.00954
## 10    0.0000000001    40 roc_auc  binary     0.674    10 0.0118 
## # ... with 40 more rows

#plot

tree_res %>%
  collect_metrics() %>%
  mutate(min_n = factor(min_n)) %>%
  ggplot(aes(cost_complexity, mean, color = min_n)) +
  geom_line(size = 1.5, alpha = 0.6) +
  geom_point(size = 2) +
  facet_wrap(~ .metric, scales = "free", nrow = 2) +
  #Transform the x-avis into log10 format
  scale_x_log10(labels = scales::label_number()) +
  scale_color_viridis_d(option = "plasma", begin = .9, end = 0)

From the plot above, the min_n of 2 performs worst in terms of accuracy and AUC across all values of cost_complexity. The more the number of min_n, the better the model performs, with the best candidate of 40. Let us see the best candidates.

tree_res %>%
  show_best("roc_auc")

## # A tibble: 5 x 7
##   cost_complexity min_n .metric .estimator  mean     n std_err
##             <dbl> <int> <chr>   <chr>      <dbl> <int>   <dbl>
## 1    0.000562        40 roc_auc binary     0.676    10  0.0110
## 2    0.0000000001    40 roc_auc binary     0.674    10  0.0118
## 3    0.0000000178    40 roc_auc binary     0.674    10  0.0118
## 4    0.00000316      40 roc_auc binary     0.674    10  0.0118
## 5    0.000562        30 roc_auc binary     0.671    10  0.0114

tree_res %>%
  show_best("accuracy")

## # A tibble: 5 x 7
##   cost_complexity min_n .metric  .estimator  mean     n std_err
##             <dbl> <int> <chr>    <chr>      <dbl> <int>   <dbl>
## 1        0.000562    40 accuracy binary     0.627    10 0.00926
## 2        0.000562    30 accuracy binary     0.622    10 0.0105 
## 3        0.1          2 accuracy binary     0.622    10 0.00882
## 4        0.1         11 accuracy binary     0.622    10 0.00882
## 5        0.1         21 accuracy binary     0.622    10 0.00882

We will use select_best() to pull out the best set of hyperparameter combination.

best_tree <- tree_res %>%
  select_best(metric = "roc_auc")

best_tree

## # A tibble: 1 x 2
##   cost_complexity min_n
##             <dbl> <int>
## 1        0.000562    40

The values that would maximize AUC in this pisa data set are 0.000562 for cost complexity and 40 minimum split. We can now finalize our workflow tree_wf with the values from select_best().

final_wf <- 
  tree_wf %>% 
  finalize_workflow(best_tree)

final_wf

## == Workflow ===========================================================================================================
## Preprocessor: Formula
## Model: decision_tree()
## 
## -- Preprocessor -------------------------------------------------------------------------------------------------------
## science_tr ~ grade1 + computer + own.room + ESCS + EMOSUPS + 
##     COOPERATE
## 
## -- Model --------------------------------------------------------------------------------------------------------------
## Decision Tree Model Specification (classification)
## 
## Main Arguments:
##   cost_complexity = 0.000562341325190349
##   min_n = 40
## 
## Computational engine: rpart

Our tuning is done!
Now, to fit the finalized workflow into the training data.

final_tree <- 
  final_wf %>%
  fit(data = train) 

final_tree

## == Workflow [trained] =================================================================================================
## Preprocessor: Formula
## Model: decision_tree()
## 
## -- Preprocessor -------------------------------------------------------------------------------------------------------
## science_tr ~ grade1 + computer + own.room + ESCS + EMOSUPS + 
##     COOPERATE
## 
## -- Model --------------------------------------------------------------------------------------------------------------
## n= 2961 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
##     1) root 2961 1415 High (0.52212091 0.47787909)  
##       2) grade1>=9.5 2309  941 High (0.59246427 0.40753573)  
##         4) computer>=0.5 1640  566 High (0.65487805 0.34512195)  
##           8) ESCS>=-0.5287 536  123 High (0.77052239 0.22947761)  
##            16) COOPERATE>=-0.0322 265   43 High (0.83773585 0.16226415) *
##            17) COOPERATE< -0.0322 271   80 High (0.70479705 0.29520295)  
##              34) ESCS>=0.16805 135   29 High (0.78518519 0.21481481) *
##              35) ESCS< 0.16805 136   51 High (0.62500000 0.37500000)  
##                70) ESCS< -0.407 25    4 High (0.84000000 0.16000000) *
##                71) ESCS>=-0.407 111   47 High (0.57657658 0.42342342)  
##                 142) ESCS>=-0.17595 69   24 High (0.65217391 0.34782609) *
##                 143) ESCS< -0.17595 42   19 Low (0.45238095 0.54761905)  
##                   286) COOPERATE>=-0.5287 20    8 High (0.60000000 0.40000000) *
##                   287) COOPERATE< -0.5287 22    7 Low (0.31818182 0.68181818) *
##           9) ESCS< -0.5287 1104  443 High (0.59873188 0.40126812)  
##            18) EMOSUPS>=-2.3605 1073  418 High (0.61043802 0.38956198)  
##              36) COOPERATE>=-0.17565 512  167 High (0.67382812 0.32617188)  
##                72) COOPERATE< 0.86235 261   72 High (0.72413793 0.27586207) *
##                73) COOPERATE>=0.86235 251   95 High (0.62151394 0.37848606)  
##                 146) ESCS>=-2.5894 223   80 High (0.64125561 0.35874439)  
##                   292) ESCS< -2.37835 21    4 High (0.80952381 0.19047619) *
##                   293) ESCS>=-2.37835 202   76 High (0.62376238 0.37623762)  
##                     586) ESCS>=-0.70875 19    4 High (0.78947368 0.21052632) *
##                     587) ESCS< -0.70875 183   72 High (0.60655738 0.39344262)  
##                      1174) ESCS< -1.0688 141   51 High (0.63829787 0.36170213) *
##                      1175) ESCS>=-1.0688 42   21 High (0.50000000 0.50000000)  
##                        2350) ESCS>=-0.80835 16    6 High (0.62500000 0.37500000) *
##                        2351) ESCS< -0.80835 26   11 Low (0.42307692 0.57692308) *
##                 147) ESCS< -2.5894 28   13 Low (0.46428571 0.53571429) *
##              37) COOPERATE< -0.17565 561  251 High (0.55258467 0.44741533)  
##                74) ESCS>=-2.73035 528  228 High (0.56818182 0.43181818)  
##                 148) EMOSUPS>=0.425 145   50 High (0.65517241 0.34482759)  
##                   296) ESCS>=-0.9385 36    8 High (0.77777778 0.22222222) *
##                   297) ESCS< -0.9385 109   42 High (0.61467890 0.38532110)  
##                     594) ESCS< -1.185 93   32 High (0.65591398 0.34408602) *
##                     595) ESCS>=-1.185 16    6 Low (0.37500000 0.62500000) *
##                 149) EMOSUPS< 0.425 383  178 High (0.53524804 0.46475196)  
##                   298) COOPERATE< -0.45625 222   91 High (0.59009009 0.40990991)  
##                     596) COOPERATE>=-1.0846 146   49 High (0.66438356 0.33561644)  
##                      1192) ESCS< -0.75945 132   39 High (0.70454545 0.29545455)  
##                        2384) COOPERATE>=-0.63175 18    2 High (0.88888889 0.11111111) *
##                        2385) COOPERATE< -0.63175 114   37 High (0.67543860 0.32456140)  
##                          4770) ESCS< -1.98885 34    7 High (0.79411765 0.20588235) *
##                          4771) ESCS>=-1.98885 80   30 High (0.62500000 0.37500000)  
##                            9542) ESCS>=-1.613 47   12 High (0.74468085 0.25531915) *
## 
## ...
## and 70 more lines.

Now that we have the final tree fused with the model workflow, we can investigate importance of variables in the model with pull_workflow_fit() and vip()

final_tree %>% 
  pull_workflow_fit() %>% 
  vip::vip()

Now we are going to test our finalized model with both training and testing data with last_fit.

final_fit <- 
  final_wf %>%
  last_fit(pisa_split) 

final_fit %>%
  collect_metrics()

## # A tibble: 2 x 3
##   .metric  .estimator .estimate
##   <chr>    <chr>          <dbl>
## 1 accuracy binary         0.629
## 2 roc_auc  binary         0.671

#ROC curves
final_fit %>%
  collect_predictions() %>% 
  roc_curve(science_tr, .pred_High) %>% 
  autoplot()

Now we can compare both rpart and tidymodel approaches for our future reference. The final model in our non-tidy method has the overall accuracy of 66%, while the tidy method has 62% and 67% for accuracy and ROC respectively.

Random Forests

For Random Forest (RF), we use packages randomForest and caret to apply the method to classification and regression problems. The use is similar to rpart, with essential elements as follows:

formula: Similar to rpart, this argument defines the DV and predictors (IV) with ~ as the separator.
data: defines the data set.
importance: If importance = T, then importance of the predictors is assessed in the model.
ntree: : Number of trees to grow in the model; we often start with a large number and then reduce it as we adjust the model based on the results. A large number of trees is computationally taxing.

In this practice, we will work on the same problem with the decision tree model. The initial ntree is 1000, but we will evaluate and adjust it as we proceed.

rf_fit1 <- randomForest(formula = science_tr ~ grade1 + computer + own.room + ESCS + EMOSUPS + COOPERATE,
                        data = train,
                        importance = TRUE,
                        ntree = 1000)
print(rf_fit1)

## 
## Call:
##  randomForest(formula = science_tr ~ grade1 + computer + own.room +      ESCS + EMOSUPS + COOPERATE, data = train, importance = TRUE,      ntree = 1000) 
##                Type of random forest: classification
##                      Number of trees: 1000
## No. of variables tried at each split: 2
## 
##         OOB estimate of  error rate: 34.28%
## Confusion matrix:
##      High Low class.error
## High 1167 379   0.2451488
## Low   636 779   0.4494700

In the output, we see the confusion matrix along with classification error and out-of-bag (OOB) error.

OBB is a method of measuring the prediction error of random forests by finding the mean prediction error on each training sample, using only the trees that did not have in their bootstrap sample.

The result above shows the OBB of 34%, and classification error of 24% for the high group, and 44% for the low group.

Next, by checking the level error across the number of trees, we can determine the ideal number of trees for our model.

plot(rf_fit1)

The plot shows that the error level does not go down any further after roughly 50 trees, so we e can run our model again by using ntree = 50 this time for optimality.

rf_fit2 <- randomForest(formula = science_tr ~ grade1 + computer + own.room + ESCS + EMOSUPS + COOPERATE,
                        data = train,
                        importance = TRUE,
                        ntree = 50)
print(rf_fit2)

## 
## Call:
##  randomForest(formula = science_tr ~ grade1 + computer + own.room +      ESCS + EMOSUPS + COOPERATE, data = train, importance = TRUE,      ntree = 50) 
##                Type of random forest: classification
##                      Number of trees: 50
## No. of variables tried at each split: 2
## 
##         OOB estimate of  error rate: 35.7%
## Confusion matrix:
##      High Low class.error
## High 1141 405   0.2619664
## Low   652 763   0.4607774

We can see the overall accuracy of model as follows:

sum(diag(rf_fit2$confusion)) / nrow(train)

## [1] 0.643026

As we did for the decision tree, we can check the importance of the predictors in the mode with importance() and varImpPlot(). With importance(), we will first import the importance measures(1), turn it into a data.frame(2), save the row names as predictor names(3), and finally sort the data by MeanDecreaseGini(4) (or, you can also see the basic output using only importance(rf_fit2))

importance(rf_fit2) %>% #(1)
  as.data.frame() %>% #(2)
  mutate(Predictors = row.names(.)) %>% #(3)
  arrange(desc(MeanDecreaseGini)) #(4)

##         High         Low MeanDecreaseAccuracy MeanDecreaseGini Predictors
## 1  6.6211406 10.13625762           12.6670933        321.93721       ESCS
## 2  1.6314780  1.92065466            2.5665387        188.97823  COOPERATE
## 3  2.1401708  0.91865485            2.3243344        171.39810    EMOSUPS
## 4 12.5223945 16.50006386           16.1166758         95.93737     grade1
## 5  5.4397193  5.51937606            8.0501893         48.04911   computer
## 6  0.8734394  0.05139286            0.7653936         26.80365   own.room

varImpPlot(rf_fit2,
           main = "Importance of Variables for Science Performance")

The output shows different importance measures for the predictors that we used in the model. MeanDecreaseAccuracy and MeanDecreaseGini represent the overall classification error rate (or for regression term, Mean Squared Error) and the total decrease in node impurities from splitting on the variable, averaged over all trees.

In the output, ESCS and grade are the two predictors that seem to influence the model performance substantially, whereas own.room and EMOSUPS are the least important variables. varImpPlot() can visualize the mentioned result.

Next, we check the confusion matrix to see the accuracy, sensitivity, and specificity of our model to the test set.

rf_pred <- predict(rf_fit2, test) %>% #predict
  as.data.frame() %>% #coerce into a data.frame
  mutate(science_tr = as.factor(`.`)) %>% #transform the DV into factor
  select(science_tr)

confusionMatrix(rf_pred$science_tr, test$science_tr) #compare with the actual result

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction High Low
##       High  544 267
##       Low   164 294
##                                           
##                Accuracy : 0.6604          
##                  95% CI : (0.6336, 0.6864)
##     No Information Rate : 0.5579          
##     P-Value [Acc > NIR] : 6.535e-14       
##                                           
##                   Kappa : 0.2981          
##                                           
##  Mcnemar's Test P-Value : 8.962e-07       
##                                           
##             Sensitivity : 0.7684          
##             Specificity : 0.5241          
##          Pos Pred Value : 0.6708          
##          Neg Pred Value : 0.6419          
##              Prevalence : 0.5579          
##          Detection Rate : 0.4287          
##    Detection Prevalence : 0.6391          
##       Balanced Accuracy : 0.6462          
##                                           
##        'Positive' Class : High            
##

The RF model overall accuracy is 65.8%, with 76% sensitivity and 52% specificity.

Random Forests with `tidymodel`

We will use validation_split() to segregate the train set into a 20% single resampled validation data set to measure model’s performance and an 80% training set to train the model.

set.seed(678)
pisa_val <- validation_split(train, prop = 0.80) #indicating the proportion

pisa_val

## # Validation Set Split (0.8/0.2)  
## # A tibble: 1 x 2
##   splits             id        
##   <list>             <chr>     
## 1 <split [2.4K/592]> validation

For this method, we will tune two hyperparameters for optimization, mtry, which is the number of predictors that are sampled at splits in a tree, and min_n, which is the minimum n to split at a node.

cores <- parallel::detectCores()
cores

## [1] 8

#specify the model
rf_mod <- 
  rand_forest(mtry = tune(), min_n = tune(), trees = 50) %>% 
  set_engine("ranger", num.threads = cores) %>% 
  set_mode("classification")

rf_mod

## Random Forest Model Specification (classification)
## 
## Main Arguments:
##   mtry = tune()
##   trees = 50
##   min_n = tune()
## 
## Engine-Specific Arguments:
##   num.threads = cores
## 
## Computational engine: ranger

#show what will be tuned
rf_mod %>%    
  parameters()

## Collection of 2 parameters for tuning
## 
##     id parameter type object class
##   mtry           mtry    nparam[?]
##  min_n          min_n    nparam[+]
## 
## Model parameters needing finalization:
##    # Randomly Selected Predictors ('mtry')
## 
## See `?dials::finalize` or `?dials::update.parameters` for more information.

#create a workflow
rf_workflow <- 
  workflow() %>% 
  add_model(rf_mod) %>% 
  add_formula(science_tr ~  grade1 + computer + own.room + ESCS +
                EMOSUPS + COOPERATE)

rf_workflow

## == Workflow ===========================================================================================================
## Preprocessor: Formula
## Model: rand_forest()
## 
## -- Preprocessor -------------------------------------------------------------------------------------------------------
## science_tr ~ grade1 + computer + own.room + ESCS + EMOSUPS + 
##     COOPERATE
## 
## -- Model --------------------------------------------------------------------------------------------------------------
## Random Forest Model Specification (classification)
## 
## Main Arguments:
##   mtry = tune()
##   trees = 50
##   min_n = tune()
## 
## Engine-Specific Arguments:
##   num.threads = cores
## 
## Computational engine: ranger

We will use a space-filling design to tune the hyperparameter, with 25 candidate models:

set.seed(102030) #For replicability

rf_res <- 
  rf_workflow %>% 
  tune_grid(pisa_val,#val_set contains both training and validation data sets for the model.
            grid = 25, #Intended number of candidate model
            control = control_grid(save_pred = TRUE), #retain this prediction value for diagnosis.
            metrics = metric_set(roc_auc)) #ask for the AUC.

## i Creating pre-processing data to finalize unknown parameter: mtry

Here are our top 5 random forest models, out of the 25 candidates in terms of AUC:

rf_res %>% 
  show_best(metric = "roc_auc")

## # A tibble: 5 x 7
##    mtry min_n .metric .estimator  mean     n std_err
##   <int> <int> <chr>   <chr>      <dbl> <int>   <dbl>
## 1     1    11 roc_auc binary     0.704     1      NA
## 2     1    23 roc_auc binary     0.703     1      NA
## 3     2    34 roc_auc binary     0.702     1      NA
## 4     2    38 roc_auc binary     0.697     1      NA
## 5     2    31 roc_auc binary     0.695     1      NA

The top five models perform uniformly better than the manually tuned model in the non-tidy RF above. Now, to plot both hyperparameters to see performance of all model candidates:

autoplot(rf_res)

The y-axis range is 68% to 70%, which is better than the manually tuned model. Let’s select the best model according to the ROC AUC metric. Our final tuning parameter values are:

rf_best <- 
  rf_res %>% #The model
  select_best(metric = "roc_auc")

rf_best

## # A tibble: 1 x 2
##    mtry min_n
##   <int> <int>
## 1     1    11

To calculate the data needed to plot the ROC curve, we use collect_predictions().

rf_res %>% 
  collect_predictions()

## # A tibble: 14,800 x 7
##    id         .pred_High .pred_Low  .row  mtry min_n science_tr
##    <chr>           <dbl>     <dbl> <int> <int> <int> <fct>     
##  1 validation      0.127     0.873     6     4    28 Low       
##  2 validation      0.537     0.463    14     4    28 High      
##  3 validation      0.230     0.770    19     4    28 High      
##  4 validation      0.363     0.637    22     4    28 Low       
##  5 validation      0.153     0.847    24     4    28 Low       
##  6 validation      0.286     0.714    29     4    28 Low       
##  7 validation      0.544     0.456    40     4    28 High      
##  8 validation      0.256     0.744    50     4    28 High      
##  9 validation      0.499     0.501    54     4    28 Low       
## 10 validation      0.607     0.393    55     4    28 High      
## # ... with 14,790 more rows

Now, to collect the best model candidate:

rf_auc <- 
  rf_res %>% 
  collect_predictions(parameters = rf_best) %>% #to collect only the best model.
  roc_curve(science_tr, .pred_High) %>% #use only this two column.
  mutate(model = "Random Forest") #specify the model.

rf_auc

## # A tibble: 572 x 4
##    .threshold specificity sensitivity model        
##         <dbl>       <dbl>       <dbl> <chr>        
##  1   -Inf         0             1     Random Forest
##  2      0.157     0             1     Random Forest
##  3      0.163     0.00360       1     Random Forest
##  4      0.164     0.00719       1     Random Forest
##  5      0.165     0.0108        1     Random Forest
##  6      0.166     0.0144        1     Random Forest
##  7      0.171     0.0144        0.997 Random Forest
##  8      0.172     0.0180        0.997 Random Forest
##  9      0.178     0.0216        0.997 Random Forest
## 10      0.185     0.0252        0.997 Random Forest
## # ... with 562 more rows

Our last task is to fit the final model on all the rows of non-tested (train) data set, and evaluate the model performance with the test set.

# the last model
last_rf_mod <- 
  rand_forest(mtry = 1, min_n = 11, trees = 50) %>% #the best candidate
  set_engine("ranger", num.threads = cores, importance = "impurity") %>% 
  set_mode("classification")

# the last workflow
last_rf_workflow <- 
  rf_workflow %>% #random forest workflow
  update_model(last_rf_mod) #update the workflow with the above model.

# the last fit
set.seed(6431)
last_rf_fit <- 
  last_rf_workflow %>% 
  last_fit(pisa_split) #Fit the final best model to the training set and evaluate the test set. "splits" contains both the training and the testing set.

last_rf_fit

## # Monte Carlo cross-validation (0.7/0.3) with 1 resamples  
## # A tibble: 1 x 6
##   splits        id           .metrics     .notes      .predictions     .workflow
##   <list>        <chr>        <list>       <list>      <list>           <list>   
## 1 <split [3K/1~ train/test ~ <tibble [2 ~ <tibble [0~ <tibble [1,269 ~ <workflo~

This fitted workflow contains everything, including our final metrics based on the test set. Let’s see how it performs.

last_rf_fit %>% 
  collect_metrics()

## # A tibble: 2 x 3
##   .metric  .estimator .estimate
##   <chr>    <chr>          <dbl>
## 1 accuracy binary         0.677
## 2 roc_auc  binary         0.724

The ROC_AUC from the test set is not too far from when we tune the model with the validation set. We can access those variable importance scores via the .workflow column.

last_rf_fit %>% 
  pluck(".workflow", 1) %>%   #take out the first element in the ".workflow" column.
  pull_workflow_fit() %>%     #Extract element of the fitted model.
  vip::vip()      #Display variable importance

The most important variable is ESCS, following by grade1, COOPERATE, and EMOSUPS. The displayed VIP plot is similar to the VIP plot in the non-tidy method above.

Let’s generate our last ROC curve to visualize.

last_rf_fit %>% 
  collect_predictions() %>% #Collect the prediction metrics
  roc_curve(science_tr, .pred_High) %>% #constructs the full ROC curve and provide it with classification probability.
  autoplot()

Support Vector Machine

To do support vector classifiers (and SVMs) in R, we will use the e1071 package. The svm function in the e1071 package requires that the outcome variable is a factor - like the science_tr variable that we created earlier. If the outcome is not a factor, svm will perform regression.

We will use the svm() function with the kernel = "linear" argument. For hyperparameter, we will need to specify our tolerance, which is represented inversely by the cost argument. When cost is low, the tolerance is high (wide margin and a lot of support vector), and vice versa.

cost is set to 1 by default and we will tune this parameter via cross-validation. But for not, let’ssee what it’s like without tuning.

svc_fit <- svm(formula = science_tr ~ reading + math + grade1 + computer + own.room + 
                 ESCS + EMOSUPS + COOPERATE,
               data = train,
               kernel = "linear")

#see the result
summary(svc_fit)

## 
## Call:
## svm(formula = science_tr ~ reading + math + grade1 + computer + own.room + 
##     ESCS + EMOSUPS + COOPERATE, data = train, kernel = "linear")
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  linear 
##        cost:  1 
## 
## Number of Support Vectors:  604
## 
##  ( 302 302 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  High Low

We can see that there are 604 support vectors: 302 in class “High” and 302 in class “Low”. We can also plot our model, but we need to specify the two features we want to plot (because our model has six feature). Let’s look at the model with reading on the y-axis and math on the x-axis.

plot(svc_fit, data = train, reading ~ math)

In this figure, the Xs are the support vectors, while the Os are the non-support vector observations; the upper triangle are for “High”, while the lower triangle is for “Low”. While the decision boundary looks jagged, it’s just an artifact of the way it’s drawn with this function.

We can see that many observations are misclassified (i.e., some red points are in the higher triangle and some black points are in the lower triangle). However, there are a lot of observations shown in this figure and it is difficult to discern the nature of the misclassification. Therefore, let’s take a random sample of 500 observations to get a better sense of our classifier.

set.seed(99)
ran_obs <- sample(1:nrow(train), 500) #generate a subset of data
plot(svc_fit, data = train[ran_obs, ], reading ~ math)

To further improve the model, we can tune the cost hyperparameter via cross-validation for optimal classification. We can use e1071::tune() to tune with 10-fold CV by default. We need to provide the tune function with a range of cost values (which again corresponds to our tolerance to violate the margin and hyperplane).

tune_svc <- e1071::tune(svm,
                 science_tr ~ reading + math + grade1 + computer + own.room +
                   ESCS + EMOSUPS + COOPERATE,
                 data = train,
                 kernel="linear",
                 ranges = list(cost = c(.01, .1, 1, 5, 10)))

summary(tune_svc)

## 
## Parameter tuning of 'svm':
## 
## - sampling method: 10-fold cross validation 
## 
## - best parameters:
##  cost
##  0.01
## 
## - best performance: 0.08510329 
## 
## - Detailed performance results:
##    cost      error dispersion
## 1  0.01 0.08510329 0.01393305
## 2  0.10 0.08780485 0.01508432
## 3  1.00 0.08949290 0.01567712
## 4  5.00 0.08881723 0.01629430
## 5 10.00 0.08881723 0.01590039

Now, to select the best model and view it.

best_svc <- tune_svc$best.model
summary(best_svc)

## 
## Call:
## best.tune(method = svm, train.x = science_tr ~ reading + math + grade1 + 
##     computer + own.room + ESCS + EMOSUPS + COOPERATE, data = train, 
##     ranges = list(cost = c(0.01, 0.1, 1, 5, 10)), kernel = "linear")
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  linear 
##        cost:  0.01 
## 
## Number of Support Vectors:  945
## 
##  ( 474 471 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  High Low

Next, we write a function to evaluate our classifier that has one argument that takes a confusion matrix.

eval_classifier <- function(tab, print = F)
  {
  n <- sum(tab)
  TN <- tab[2,2]
  FP <- tab[2,1]
  FN <- tab[1,2]
  TP <- tab[1,1]
  classify.rate <- (TP + TN) / n
  TP.rate <- TP / (TP + FN)
  TN.rate <- TN / (TN + FP)
  object <- data.frame(accuracy = classify.rate,
                       sensitivity = TP.rate,
                       specificity = TN.rate)
  return(object)
}

A confusion matrix for our best_svc can be created by:

svc_train <- table(train$science_tr, predict(best_svc))  #observed value, predicted value
eval_classifier(svc_train)

##    accuracy sensitivity specificity
## 1 0.9145559   0.9075032   0.9222615

These statistics are overly optimistic as we are evaluating our model using the training data. Let’s see how it works with the test data.

svc_test <- table(test$science_tr, predict(best_svc, newdata = test))
eval_classifier(svc_test)

##    accuracy sensitivity specificity
## 1 0.8967691   0.8983051   0.8948307

The statistics drop a bit, but still good. This is likely because math and reading are so highly correlated with science scores.

Comparison to Logistic Regression

Support vector classifiers are quite similar to logistic regression in terms of loss functions in estimating parameters. In situations where the classes are well separated, SVM tends to do better. Inversely, Logistic Regression works better with situations where classes are not well-separated.

lr_fit <- glm(science_tr ~ reading + math + grade1 + computer + own.room + ESCS +
                EMOSUPS + COOPERATE,
              data = train,
              family = "binomial")

#view the coefficient
coef(lr_fit)

## (Intercept)     reading        math      grade1    computer    own.room 
## 36.46185453 -0.04268969 -0.04191258 -0.08310064  0.08228638  0.18142714 
##        ESCS     EMOSUPS   COOPERATE 
##  0.02039443 -0.01612452 -0.01390496

How does it do relative to our best support vector classifier on the training and the testing data sets? Let’s see the model with train data set first.

lr_train <- table(train$science_tr,
                  round(predict(lr_fit, type = "response")))
lr_train

##       
##           0    1
##   High 1407  139
##   Low   117 1298

eval_classifier(lr_train)

##    accuracy sensitivity specificity
## 1 0.9135427   0.9100906   0.9173145

and then for the test data set.

lr_test <- table(test$science_tr,
                 round(predict(lr_fit, newdata = test, type = "response")))
lr_test

##       
##          0   1
##   High 638  70
##   Low   68 493

eval_classifier(lr_test)

##   accuracy sensitivity specificity
## 1 0.891253   0.9011299   0.8787879

Reference: https://github.com/okanbulut/edm

Educational Data Mining practice

Tarid Wongvorachan

October 6th, 2020

Data Examination

Decision Trees

Decision Trees with `tidymodel`

Random Forests

Random Forests with `tidymodel`

Support Vector Machine

Comparison to Logistic Regression

Educational Data Mining practice

Tarid Wongvorachan

October 6th, 2020

Data Examination

Decision Trees

Decision Trees with tidymodel

Random Forests

Random Forests with tidymodel

Support Vector Machine

Comparison to Logistic Regression

Decision Trees with `tidymodel`

Random Forests with `tidymodel`