Previously on STAT412:

  • Support Vector Machines

Naive Bayes Classifier

Naive Bayes classifier is a classification algorithm in machine learning and is included in supervised learning. This algorithm is quite popular to be used in Natural Language Processing or NLP. This algorithm is based on the Bayes Theorem created by Thomas Bayes. Therefore, we must first understand the Bayes Theorem before using the Naive Bayes Classifier.

The essence of the Bayes theorem is conditional probability where conditional probability is the probability that something will happen, given that something else has already occurred. By using conditional probability, we can find out the probability of an event will occur given the knowledge of the previous event.

\[ P(A|B)=\frac{P(B|A)P(A)}{P(B)} \]

Because marginal probability always remains the same in the calculation of naive bayes classifier, then we can ignore the calculation of marginal probability. In the Naive Bayes classifier, we determine the class of data points into based on the value of the greatest posterior probability.

Some assumptions used in the Naive Bayes Classifier are that each feature in the data may not be correlated or mutually independent. Then the second for likelihood P (x | y) must follow one of the statistical distributions, namely Gaussian, Multinomial, or Bernoulli.

Manual Application:

Assume we have the following data and we want to classify new data with the following criteria:

Age = 21–30

Income = Medium

Status = Married

1. First, Calculate Prior Probability P(y):

  • P (Buy=yes) → 7/15 → 0,467

  • P (Buy=no) → 8/15 → 0,533

2. Second, Calculate the Likelihood each features P (x | y):

  • P (Age = 21–30 | Buy=yes) → 2/7 → 0,285

  • P (Age = 21–30 | Buy=no) → 3/8 → 0,375

  • P (Income= Medium | Buy=yes) → 1/7 → 0,143

  • P (Income= Medium | Buy=no) → 5/8 → 0,625

  • P (Status= Married | Buy=yes) → 4/7 → 0,571

  • P (Status= Married | Buy=no) → 5/8 → 0,625

3. Calculate Total Likelihood

  • P (x | Buy=yes) → 0,285 * 0,143 * 0,571 = 0,0233

  • P (x | Buy=no) → 0,375 * 0,625 * 0,625 = 0,1464

4. Calculate Posterior Probability P(y | x):

  • P (x | Buy=yes) * P (Buy=yes) → 0,0233 * 0,467 = 0,0108

  • P (x | Buy=no) * P (Buy=no) → 0,1464 * 0,533 = 0,0781

From the above calculation, it was found that the new data belonged to the class that did not buy.

Computational Application:

For the example, we will use Pima Indians Diabetes Database dataset from mlbench package. The dataset is about prediction of the onset of diabetes in female Pima Indians from medical record data.

library(mlbench)
## Warning: package 'mlbench' was built under R version 4.3.3
data(PimaIndiansDiabetes)
head(PimaIndiansDiabetes)
pregnant glucose pressure triceps insulin mass pedigree age diabetes
6 148 72 35 0 33.6 0.627 50 pos
1 85 66 29 0 26.6 0.351 31 neg
8 183 64 0 0 23.3 0.672 32 pos
1 89 66 23 94 28.1 0.167 21 neg
0 137 40 35 168 43.1 2.288 33 pos
5 116 74 0 0 25.6 0.201 30 neg

Data dictionary:

pregnant: Number of times pregnant

glucose: Plasma glucose concentration (glucose tolerance test)

pressure: Diastolic blood pressure (mm Hg)

triceps: Triceps skin fold thickness (mm)

insulin: 2-Hour serum insulin (mu U/ml)

mass: Body mass index

pedigree: Diabetes pedigree function

age: Age (years)

diabetes: Class variable (test for diabetes)

dim(PimaIndiansDiabetes)
## [1] 768   9

As you see, it has 9 variables and 768 observations.

str(PimaIndiansDiabetes)
## 'data.frame':    768 obs. of  9 variables:
##  $ pregnant: num  6 1 8 1 0 5 3 10 2 8 ...
##  $ glucose : num  148 85 183 89 137 116 78 115 197 125 ...
##  $ pressure: num  72 66 64 66 40 74 50 0 70 96 ...
##  $ triceps : num  35 29 0 23 35 0 32 0 45 0 ...
##  $ insulin : num  0 0 0 94 168 0 88 0 543 0 ...
##  $ mass    : num  33.6 26.6 23.3 28.1 43.1 25.6 31 35.3 30.5 0 ...
##  $ pedigree: num  0.627 0.351 0.672 0.167 2.288 ...
##  $ age     : num  50 31 32 21 33 30 26 29 53 54 ...
##  $ diabetes: Factor w/ 2 levels "neg","pos": 2 1 2 1 2 1 2 1 2 2 ...

As you see, all features are numeric and our response variable, diabetes, is binary. In other words, it is a categorical variable with two levels.

To make the rest of the analysis simplier, we can recode our response variable.

PimaIndiansDiabetes$diabetes<-as.factor(ifelse(PimaIndiansDiabetes$diabetes=="neg",1,0))
head(PimaIndiansDiabetes$diabetes)
## [1] 0 1 0 1 0 1
## Levels: 0 1

Naive Bayes classifier is a supervised learning algorithm. Therefore, we are supposed to have train and test data to design the model and predict the unseen objects.

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(caret)
## Warning: package 'caret' was built under R version 4.3.3
## Zorunlu paket yükleniyor: ggplot2
## Zorunlu paket yükleniyor: lattice
set.seed(123)
training.samples <- PimaIndiansDiabetes$diabetes %>%createDataPartition(p = 0.8, list = FALSE) #createDataPartition helps you define train set index
train.data  <- PimaIndiansDiabetes[training.samples, ]
test.data <- PimaIndiansDiabetes[-training.samples, ]

Start with checking the correlation of the numerical variables, since Bayes Classifier requires an independence assumption.

library(ggcorrplot)
corr=cor(PimaIndiansDiabetes[,-9]) #extract the numerical variables
corr
##             pregnant    glucose   pressure     triceps     insulin       mass
## pregnant  1.00000000 0.12945867 0.14128198 -0.08167177 -0.07353461 0.01768309
## glucose   0.12945867 1.00000000 0.15258959  0.05732789  0.33135711 0.22107107
## pressure  0.14128198 0.15258959 1.00000000  0.20737054  0.08893338 0.28180529
## triceps  -0.08167177 0.05732789 0.20737054  1.00000000  0.43678257 0.39257320
## insulin  -0.07353461 0.33135711 0.08893338  0.43678257  1.00000000 0.19785906
## mass      0.01768309 0.22107107 0.28180529  0.39257320  0.19785906 1.00000000
## pedigree -0.03352267 0.13733730 0.04126495  0.18392757  0.18507093 0.14064695
## age       0.54434123 0.26351432 0.23952795 -0.11397026 -0.04216295 0.03624187
##             pedigree         age
## pregnant -0.03352267  0.54434123
## glucose   0.13733730  0.26351432
## pressure  0.04126495  0.23952795
## triceps   0.18392757 -0.11397026
## insulin   0.18507093 -0.04216295
## mass      0.14064695  0.03624187
## pedigree  1.00000000  0.03356131
## age       0.03356131  1.00000000
ggcorrplot(corr, method = "circle")

library(e1071)
## Warning: package 'e1071' was built under R version 4.3.3
NBclassfier=naiveBayes( diabetes ~ . , data=train.data)
print(NBclassfier)
## 
## Naive Bayes Classifier for Discrete Predictors
## 
## Call:
## naiveBayes.default(x = X, y = Y, laplace = laplace)
## 
## A-priori probabilities:
## Y
##         0         1 
## 0.3495935 0.6504065 
## 
## Conditional probabilities:
##    pregnant
## Y      [,1]     [,2]
##   0 4.87907 3.865046
##   1 3.29750 3.035957
## 
##    glucose
## Y       [,1]     [,2]
##   0 143.3674 31.31122
##   1 110.6725 25.43783
## 
##    pressure
## Y       [,1]     [,2]
##   0 71.28372 21.10186
##   1 68.47000 16.87932
## 
##    triceps
## Y       [,1]     [,2]
##   0 22.04186 17.60318
##   1 20.11250 15.01210
## 
##    insulin
## Y       [,1]      [,2]
##   0 107.6465 147.88062
##   1  70.2750  97.13156
## 
##    mass
## Y       [,1]     [,2]
##   0 34.99674 7.183746
##   1 30.34975 7.638519
## 
##    pedigree
## Y        [,1]      [,2]
##   0 0.5508093 0.3667233
##   1 0.4269350 0.2811081
## 
##    age
## Y       [,1]     [,2]
##   0 37.50698 11.35118
##   1 31.31250 11.86621

The naiveBayes() function assumes Gaussian distributions for numerical varibles. Also, the prior are calculated from the proportion of the training data. The priors are shown when the object is printed. The Y values are the means and standard deviations of the predictors within each class.

##Train

train_pred1<-predict(NBclassfier, newdata = train.data, type = "class")

train_tab1 = table(predicted = train_pred1 , actual = train.data$diabetes)
library(caret)
train_con_mat1 = confusionMatrix(train_tab1)
c(train_con_mat1$overall["Accuracy"], 
  train_con_mat1$byClass["Sensitivity"], 
  train_con_mat1$byClass["Specificity"])
##    Accuracy Sensitivity Specificity 
##   0.7642276   0.6232558   0.8400000
##Test

test_pred1<-predict(NBclassfier, newdata = test.data, type = "class")

test_tab1 = table(predicted = test_pred1 , actual = test.data$diabetes)
library(caret)
test_con_mat1 = confusionMatrix(test_tab1)
c(test_con_mat1$overall["Accuracy"], 
  test_con_mat1$byClass["Sensitivity"], 
  test_con_mat1$byClass["Specificity"])
##    Accuracy Sensitivity Specificity 
##   0.7581699   0.5660377   0.8600000

Decision Tree

Decision Tree algorithm belongs to the family of supervised learning algorithms. Unlike other supervised learning algorithms, the decision tree algorithm can be used for solving regression and classification problems too.

The goal of using a Decision Tree is to create a training model. The model can be used to predict the class or value of the target variable by learning simple decision rules inferred from the training data.

In Decision Trees, for predicting a class label for a record we start from the root of the tree. We compare the values of the root attribute with the record’s attribute. On the basis of comparison, we follow the branch corresponding to that value and jump to the next node.

Important Terminology related to Decision Trees

  • Root Node: It represents the entire population or sample and this further gets divided into two or more homogeneous sets.

  • Splitting: It is a process of dividing a node into two or more sub-nodes.

  • Decision Node: When a sub-node splits into further sub-nodes, then it is called the decision node.

  • Leaf / Terminal Node: Nodes do not split is called Leaf or Terminal node.

  • Pruning: When we remove sub-nodes of a decision node, this process is called pruning. It can be referred as the opposite process of splitting.

  • Branch / Sub-Tree: A subsection of the entire tree is called branch or sub-tree.

  • Parent and Child Node: A node, which is divided into sub-nodes is called a parent node of sub-nodes whereas sub-nodes are the child of a parent node.

Decision trees classify the examples by sorting them down the tree from the root to some leaf/terminal node, with the leaf/terminal node providing the classification of the example.

Each node in the tree acts as a test case for some attribute, and each edge descending from the node corresponds to the possible answers to the test case. This process is recursive in nature and is repeated for every subtree rooted at the new node.

Computational Application:

Classification Tree

Let’s consider Pima Indians Diabetes Database dataset from mlbench package, which is the same data used in Naive Bayes application. Remember that the data set is about prediction of the onset of diabetes in female Pima Indians from medical record data.

Let’s see the first six rows:

head(PimaIndiansDiabetes)
pregnant glucose pressure triceps insulin mass pedigree age diabetes
6 148 72 35 0 33.6 0.627 50 0
1 85 66 29 0 26.6 0.351 31 1
8 183 64 0 0 23.3 0.672 32 0
1 89 66 23 94 28.1 0.167 21 1
0 137 40 35 168 43.1 2.288 33 0
5 116 74 0 0 25.6 0.201 30 1

Let’s check proportion of the levels of the response.

prop.table(table(train.data$diabetes))
## 
##         0         1 
## 0.3495935 0.6504065
prop.table(table(test.data$diabetes))
## 
##         0         1 
## 0.3464052 0.6535948

In order to grow our decision tree, we have to first load the rpart package. Then we can use the rpart() function, specifying the model formula, data, and method parameters. In this case, we want to classify the feature diabetes using the predictors, so our call to rpart() should look like :

library(rpart)
library(rpart.plot)
fit <- rpart(diabetes~., data = train.data, method = 'class') #method = class is necessary for classification task
print(fit)
## n= 615 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
##  1) root 615 215 1 (0.34959350 0.65040650)  
##    2) glucose>=154.5 104  18 0 (0.82692308 0.17307692) *
##    3) glucose< 154.5 511 129 1 (0.25244618 0.74755382)  
##      6) age>=28.5 247  95 1 (0.38461538 0.61538462)  
##       12) mass>=26.7 202  92 1 (0.45544554 0.54455446)  
##         24) pedigree>=0.516 75  27 0 (0.64000000 0.36000000)  
##           48) glucose>=91.5 67  20 0 (0.70149254 0.29850746) *
##           49) glucose< 91.5 8   1 1 (0.12500000 0.87500000) *
##         25) pedigree< 0.516 127  44 1 (0.34645669 0.65354331) *
##       13) mass< 26.7 45   3 1 (0.06666667 0.93333333) *
##      7) age< 28.5 264  34 1 (0.12878788 0.87121212)  
##       14) glucose>=127.5 51  18 1 (0.35294118 0.64705882)  
##         28) pressure< 59 9   2 0 (0.77777778 0.22222222) *
##         29) pressure>=59 42  11 1 (0.26190476 0.73809524) *
##       15) glucose< 127.5 213  16 1 (0.07511737 0.92488263) *

Structure and Interpretation of the Tree:

Root Node (Node 1):
    n=615: Total of 615 observations.
    loss=215: If you predicted the majority class, you would misclassify 215 observations.
    yval=1: Majority class (1).
    yprob=(0.34959350 0.65040650): Probability of belonging to class 0 is 34.96%, and class 1 is 65.04%.

Decision 1 (Nodes 2 and 3):
    Node 2: glucose>=154.5
        There are 104 observations that meet this criterion, with 18 misclassifications.
        This node is terminal (leaf node) and predominantly represents class 0 (82.69%).
    Node 3: glucose<154.5
        There are 511 observations that do not meet the criterion, with 129 misclassifications.
        This node will be further split.

Further splits under Node 3 (Nodes 6 and 7):
    Node 6: age>=28.5 and Node 7: age<28.5
    Node 6 has further splits (e.g., Nodes 12 and 13).
    Node 7 shows that for observations where age<28.5, a large majority (87.12%) belong to class 1, suggesting potential further splits to refine the prediction.
rpart.plot(fit)

At the top, it is the overall probability of not being diabetes. It shows the proportion of patients who are not diabetes.

This node asks whether the glucose of patients are greater than or equal to 155 or not. If your answer is no, then you go down to the root’s right child node (depth 2). 83 percent of patients are not diabetes with of 75 percent.

If yes, then you go down to the root’s left child node, 17 percent of patients are diabetes with probability of 0.17

You can improve your interpretation like this.

By default, rpart() function uses the Gini impurity measure to split the note. The higher the Gini coefficient, the more different instances within the node.

Let’s calculate classification error rate and other measures for both models considering both training and test data to see the performance of the model.

#CE function
calc_class_err = function(actual, predicted) {
  mean(actual != predicted)
}
## Train data performance.

#CE
train_pred <-predict(fit,train.data[,1:8],type = 'class')
round(calc_class_err(train_pred,train.data[,9]),3)
## [1] 0.187
# Accuracy Sensivity and Specificity
library(caret)
train_tab = table(predicted = train_pred , actual = train.data[,9])
library(caret)
train_con_mat = confusionMatrix(train_tab)
c(train_con_mat$overall["Accuracy"], 
  train_con_mat$byClass["Sensitivity"], 
  train_con_mat$byClass["Specificity"])
##    Accuracy Sensitivity Specificity 
##   0.8130081   0.6511628   0.9000000
# Test Data Performance
#CE
test_pred <-predict(fit,test.data[,1:8],type = 'class')
round(calc_class_err(test_pred,test.data[,9]),3)
## [1] 0.275
# Accuracy Sensivity and Specificity
test_tab = table(predicted = test_pred , actual = test.data[,9])
library(caret)
test_con_mat = confusionMatrix(test_tab)
c(test_con_mat$overall["Accuracy"], 
  test_con_mat$byClass["Sensitivity"], 
  test_con_mat$byClass["Specificity"])
##    Accuracy Sensitivity Specificity 
##   0.7254902   0.4150943   0.8900000

To improve the performance of decision tree is not easy for rpart function. Therefore, we use another package to apply decision tree. This package is caret package.

The caret package contains set of functions to streamline model training for Regression and Classification.

  • Standard Interface for Modeling and Prediction

  • Simplify Model tuning

  • Data splitting

  • Feature selection

  • Evaluate models

  • Paralell processing

  • Models Supported

caret supports hundreds of predictive models, and provides facilities for adding your own models to take advantage of the caret infrastructure.

You can get a list of models supported by caret

library(caret) 
names(getModelInfo()) #names of the model you can fit with caret
##   [1] "ada"                 "AdaBag"              "AdaBoost.M1"        
##   [4] "adaboost"            "amdai"               "ANFIS"              
##   [7] "avNNet"              "awnb"                "awtan"              
##  [10] "bag"                 "bagEarth"            "bagEarthGCV"        
##  [13] "bagFDA"              "bagFDAGCV"           "bam"                
##  [16] "bartMachine"         "bayesglm"            "binda"              
##  [19] "blackboost"          "blasso"              "blassoAveraged"     
##  [22] "bridge"              "brnn"                "BstLm"              
##  [25] "bstSm"               "bstTree"             "C5.0"               
##  [28] "C5.0Cost"            "C5.0Rules"           "C5.0Tree"           
##  [31] "cforest"             "chaid"               "CSimca"             
##  [34] "ctree"               "ctree2"              "cubist"             
##  [37] "dda"                 "deepboost"           "DENFIS"             
##  [40] "dnn"                 "dwdLinear"           "dwdPoly"            
##  [43] "dwdRadial"           "earth"               "elm"                
##  [46] "enet"                "evtree"              "extraTrees"         
##  [49] "fda"                 "FH.GBML"             "FIR.DM"             
##  [52] "foba"                "FRBCS.CHI"           "FRBCS.W"            
##  [55] "FS.HGD"              "gam"                 "gamboost"           
##  [58] "gamLoess"            "gamSpline"           "gaussprLinear"      
##  [61] "gaussprPoly"         "gaussprRadial"       "gbm_h2o"            
##  [64] "gbm"                 "gcvEarth"            "GFS.FR.MOGUL"       
##  [67] "GFS.LT.RS"           "GFS.THRIFT"          "glm.nb"             
##  [70] "glm"                 "glmboost"            "glmnet_h2o"         
##  [73] "glmnet"              "glmStepAIC"          "gpls"               
##  [76] "hda"                 "hdda"                "hdrda"              
##  [79] "HYFIS"               "icr"                 "J48"                
##  [82] "JRip"                "kernelpls"           "kknn"               
##  [85] "knn"                 "krlsPoly"            "krlsRadial"         
##  [88] "lars"                "lars2"               "lasso"              
##  [91] "lda"                 "lda2"                "leapBackward"       
##  [94] "leapForward"         "leapSeq"             "Linda"              
##  [97] "lm"                  "lmStepAIC"           "LMT"                
## [100] "loclda"              "logicBag"            "LogitBoost"         
## [103] "logreg"              "lssvmLinear"         "lssvmPoly"          
## [106] "lssvmRadial"         "lvq"                 "M5"                 
## [109] "M5Rules"             "manb"                "mda"                
## [112] "Mlda"                "mlp"                 "mlpKerasDecay"      
## [115] "mlpKerasDecayCost"   "mlpKerasDropout"     "mlpKerasDropoutCost"
## [118] "mlpML"               "mlpSGD"              "mlpWeightDecay"     
## [121] "mlpWeightDecayML"    "monmlp"              "msaenet"            
## [124] "multinom"            "mxnet"               "mxnetAdam"          
## [127] "naive_bayes"         "nb"                  "nbDiscrete"         
## [130] "nbSearch"            "neuralnet"           "nnet"               
## [133] "nnls"                "nodeHarvest"         "null"               
## [136] "OneR"                "ordinalNet"          "ordinalRF"          
## [139] "ORFlog"              "ORFpls"              "ORFridge"           
## [142] "ORFsvm"              "ownn"                "pam"                
## [145] "parRF"               "PART"                "partDSA"            
## [148] "pcaNNet"             "pcr"                 "pda"                
## [151] "pda2"                "penalized"           "PenalizedLDA"       
## [154] "plr"                 "pls"                 "plsRglm"            
## [157] "polr"                "ppr"                 "pre"                
## [160] "PRIM"                "protoclass"          "qda"                
## [163] "QdaCov"              "qrf"                 "qrnn"               
## [166] "randomGLM"           "ranger"              "rbf"                
## [169] "rbfDDA"              "Rborist"             "rda"                
## [172] "regLogistic"         "relaxo"              "rf"                 
## [175] "rFerns"              "RFlda"               "rfRules"            
## [178] "ridge"               "rlda"                "rlm"                
## [181] "rmda"                "rocc"                "rotationForest"     
## [184] "rotationForestCp"    "rpart"               "rpart1SE"           
## [187] "rpart2"              "rpartCost"           "rpartScore"         
## [190] "rqlasso"             "rqnc"                "RRF"                
## [193] "RRFglobal"           "rrlda"               "RSimca"             
## [196] "rvmLinear"           "rvmPoly"             "rvmRadial"          
## [199] "SBC"                 "sda"                 "sdwd"               
## [202] "simpls"              "SLAVE"               "slda"               
## [205] "smda"                "snn"                 "sparseLDA"          
## [208] "spikeslab"           "spls"                "stepLDA"            
## [211] "stepQDA"             "superpc"             "svmBoundrangeString"
## [214] "svmExpoString"       "svmLinear"           "svmLinear2"         
## [217] "svmLinear3"          "svmLinearWeights"    "svmLinearWeights2"  
## [220] "svmPoly"             "svmRadial"           "svmRadialCost"      
## [223] "svmRadialSigma"      "svmRadialWeights"    "svmSpectrumString"  
## [226] "tan"                 "tanSearch"           "treebag"            
## [229] "vbmpRadial"          "vglmAdjCat"          "vglmContRatio"      
## [232] "vglmCumulative"      "widekernelpls"       "WM"                 
## [235] "wsrf"                "xgbDART"             "xgbLinear"          
## [238] "xgbTree"             "xyf"

rpart2 is the argument used to create a decision tree.

modelLookup("rpart2") #lets see tuneable parameters for decision tree
model parameter label forReg forClass probModel
rpart2 maxdepth Max Tree Depth TRUE TRUE TRUE

The function trainControl generates parameters that further control how models are created, with possible values:

train.control <- trainControl(
                           method = "repeatedcv",
                           number = 10, ## 10-fold CV
                           repeats = 3,## repeated three times
                           # USE AUC
                           summaryFunction = twoClassSummary, 
                           classProbs = TRUE
                           )

#twoClassSummary, will compute the sensitivity, specificity and area under the ROC curve

Now, we will consider the original data set of Pima Indian Diabetes because when we apply the transformation on the response (from character to numeric) the level of the response are not assigned correctly.

data("PimaIndiansDiabetes")
set.seed(123)
training.samples <- PimaIndiansDiabetes$diabetes %>%createDataPartition(p = 0.8, list = FALSE) #createDataPartition helps you define train set index
train.data  <- PimaIndiansDiabetes[training.samples, ]
test.data <- PimaIndiansDiabetes[-training.samples, ]
rpartFit1 <- train(diabetes~., data=train.data, 
                 method = "rpart2", 
                 tuneLength = 6,
                 trControl = train.control,
                 metric = "ROC")
rpartFit1
## CART 
## 
## 615 samples
##   8 predictor
##   2 classes: 'neg', 'pos' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 554, 554, 553, 554, 554, 553, ... 
## Resampling results across tuning parameters:
## 
##   maxdepth  ROC        Sens       Spec     
##    1        0.6481674  0.8700000  0.4263348
##    5        0.7225613  0.8400000  0.4979076
##    7        0.7323214  0.8391667  0.5298701
##    9        0.7388555  0.8258333  0.5577201
##   11        0.7397376  0.8208333  0.5700577
##   12        0.7416126  0.8241667  0.5717172
## 
## ROC was used to select the optimal model using the largest value.
## The final value used for the model was maxdepth = 12.
plot(rpartFit1)

rpartFit1$finalModel
## n= 615 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
##   1) root 615 215 neg (0.65040650 0.34959350)  
##     2) glucose< 154.5 518 136 neg (0.73745174 0.26254826)  
##       4) glucose< 111.5 260  37 neg (0.85769231 0.14230769)  
##         8) pedigree< 0.6605 211  20 neg (0.90521327 0.09478673) *
##         9) pedigree>=0.6605 49  17 neg (0.65306122 0.34693878)  
##          18) age< 40 38   8 neg (0.78947368 0.21052632) *
##          19) age>=40 11   2 pos (0.18181818 0.81818182) *
##       5) glucose>=111.5 258  99 neg (0.61627907 0.38372093)  
##        10) mass< 27.05 51   5 neg (0.90196078 0.09803922) *
##        11) mass>=27.05 207  94 neg (0.54589372 0.45410628)  
##          22) age< 30.5 108  36 neg (0.66666667 0.33333333)  
##            44) triceps>=14.5 80  19 neg (0.76250000 0.23750000) *
##            45) triceps< 14.5 28  11 pos (0.39285714 0.60714286) *
##          23) age>=30.5 99  41 pos (0.41414141 0.58585859)  
##            46) pedigree< 0.514 57  28 neg (0.50877193 0.49122807)  
##              92) pregnant>=1.5 49  20 neg (0.59183673 0.40816327)  
##               184) pressure>=84.5 12   1 neg (0.91666667 0.08333333) *
##               185) pressure< 84.5 37  18 pos (0.48648649 0.51351351)  
##                 370) glucose>=121 29  12 neg (0.58620690 0.41379310) *
##                 371) glucose< 121 8   1 pos (0.12500000 0.87500000) *
##              93) pregnant< 1.5 8   0 pos (0.00000000 1.00000000) *
##            47) pedigree>=0.514 42  12 pos (0.28571429 0.71428571) *
##     3) glucose>=154.5 97  18 pos (0.18556701 0.81443299) *
rpart.plot(rpartFit1$finalModel)

# Test Data Performance
#CE
test_pred <-predict(rpartFit1$finalModel,test.data[,1:8],type = 'class')
round(calc_class_err(test_pred,test.data[,9]),3)
## [1] 0.242
# Accuracy Sensivity and Specificity
test_tab = table(predicted = test_pred , actual = test.data[,9])
library(caret)
test_con_mat = confusionMatrix(test_tab)
c(test_con_mat$overall["Accuracy"], 
  test_con_mat$byClass["Sensitivity"], 
  test_con_mat$byClass["Specificity"])
##    Accuracy Sensitivity Specificity 
##   0.7581699   0.8600000   0.5660377

Random Forest

A single decision tree suffer from high variance. Although, the prunning is applied to decrease the variance, there are alternative method to decrease the variance of the single tree and thereby, improve the performance of the model compared to single tree. One of these approach is bagging which is acronym for Bootstrap Aggregating proposed by Breiman in 1996.

Bagging

Bagging is a supervised ensemble method which combines and averages the multiple models. In this way, it has less variability rather than a single tree and reduces the risk of overfitting problem resulting in more accurate results. The algorithm of bagging is defined below.

  • Create m bootstrap samples from the training data. Bootstrapped samples allow us to create many slightly different data sets but with the same distribution as the overall training set.

  • For each bootstrap sample train a single, unpruned regression tree.

  • Average individual predictions from each tree to create an overall average predicted value.

  • Gives the final rule by applying average rule for regression problem or majority rule for classification problem.

Random Forest

Random Forests is a supervised learning technique which can be considered as a modification of bagging explained above. In random forests, a large set of de-correlated trees are built by randomly selecting \(p\) number of variables among the set of \(m\) variables. \((p<m)\)

Random forests are built on the same fundamental principles as decision trees and bagging. Bagging trees introduce a random component into the tree building process that reduces the variance of a single trees prediction and improves predictive performance. However, the trees in bagging are not completely independent of each other since all the original predictors are considered at every split of every tree. Rather, trees from different bootstrap samples typically have similar structure to each other (especially at the top of the tree) due to underlying relationships, as already stated above.

This situation is known as tree correlation and hinder the bagging to reduce the variance of the predictions optimally. In this case, we need to consider an additional way to decrease variance more, and this is the minimization of correlation between trees. The random forests complete this process in two ways.

1. Bootstrap: As in the bagging, each tree is grown to a bootstrap resampled data set, which makes them different and somewhat decorrelates them.

2. Split-variable randomization: Each time a split is to be performed, the search for the split variable is limited to a random subset of \(m\) of the \(p\) variables. For regression trees, typical default values are \(p=m/3\) but this should be considered as tuning parameter. When \(m=p\) ,the randomization amounts to using only step 1 and is the same as bagging.

The basic algorithm for random forest is given below. Given training data set

  • Select number of trees to build (ntrees)

  • for i = 1 to ntrees do

    • Generate a bootstrap sample of the original data.

    • Grow a regression tree to the bootstrapped data

  • for each split do

    • Select m variables at random from all p variables

    • Pick the best variable/split-point among the m

    • Split the node into two child nodes

    • end

  • Use typical tree model stopping criteria to determine when a tree is complete (but do not prune)

  • End

At the end of the process, we will have a machine having less variance and de-correlate trees compared to bagging.

2.2. Analysis

There are over 20 random forest packages in R. To demonstrate the basic implementation we illustrate the use of the randomForest package, the oldest and most well known implementation of the Random Forest algorithm in R.

Please consider “Prestige” dataset from car package.

library(car)
## Zorunlu paket yükleniyor: carData
## 
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
## 
##     recode
data(Prestige)

Let’s see first six observation of the dataset.

head(Prestige)
education income women prestige census type
gov.administrators 13.11 12351 11.16 68.8 1113 prof
general.managers 12.26 25879 4.02 69.1 1130 prof
accountants 12.77 9271 15.70 63.4 1171 prof
purchasing.officers 11.42 8865 9.11 56.8 1175 prof
chemists 14.62 8403 11.68 73.5 2111 prof
physicists 15.64 11030 5.13 77.6 2113 prof
dim(Prestige)
## [1] 102   6

The data set has 102 observations and 6 variables. You can see the name of the variables and their corresponding class below.

Data dictionary:

education: average education of occupational incumbents, years, in 1971

income: average income of incumbents, dollars, in 1971

women: percentage of incumbents who are women

prestige: Pineo-Porter prestige score for occupation, from a social survey conducted in the mid- 1960s

census: Canadian Census occupational code

type: a factor for type of occupation with levels bc (Blue Collar), prof (Professional, Managerial and Technical) and wc (White Collar)

str(Prestige)
## 'data.frame':    102 obs. of  6 variables:
##  $ education: num  13.1 12.3 12.8 11.4 14.6 ...
##  $ income   : int  12351 25879 9271 8865 8403 11030 8258 14163 11377 11023 ...
##  $ women    : num  11.16 4.02 15.7 9.11 11.68 ...
##  $ prestige : num  68.8 69.1 63.4 56.8 73.5 77.6 72.6 78.1 73.1 68.8 ...
##  $ census   : int  1113 1130 1171 1175 2111 2113 2133 2141 2143 2153 ...
##  $ type     : Factor w/ 3 levels "bc","prof","wc": 2 2 2 2 2 2 2 2 2 2 ...

The output shows that only type variable is the factor variable with three levels. The remaining variables are the numerical variables.

Let’s see the summary of the data.

summary(Prestige)
##    education          income          women           prestige    
##  Min.   : 6.380   Min.   :  611   Min.   : 0.000   Min.   :14.80  
##  1st Qu.: 8.445   1st Qu.: 4106   1st Qu.: 3.592   1st Qu.:35.23  
##  Median :10.540   Median : 5930   Median :13.600   Median :43.60  
##  Mean   :10.738   Mean   : 6798   Mean   :28.979   Mean   :46.83  
##  3rd Qu.:12.648   3rd Qu.: 8187   3rd Qu.:52.203   3rd Qu.:59.27  
##  Max.   :15.970   Max.   :25879   Max.   :97.510   Max.   :87.20  
##      census       type   
##  Min.   :1113   bc  :44  
##  1st Qu.:3120   prof:31  
##  Median :5135   wc  :23  
##  Mean   :5402   NA's: 4  
##  3rd Qu.:8312            
##  Max.   :9517

As you see, we have 4 missing values in the type variable. Since percentage of the missing observation (3.92%) is very low, we can remove them using na.omit() function.

data<-na.omit(Prestige)
summary(data)
##    education          income          women           prestige    
##  Min.   : 6.380   Min.   : 1656   Min.   : 0.000   Min.   :17.30  
##  1st Qu.: 8.445   1st Qu.: 4250   1st Qu.: 3.268   1st Qu.:35.38  
##  Median :10.605   Median : 6036   Median :14.475   Median :43.60  
##  Mean   :10.795   Mean   : 6939   Mean   :28.986   Mean   :47.33  
##  3rd Qu.:12.755   3rd Qu.: 8226   3rd Qu.:52.203   3rd Qu.:59.90  
##  Max.   :15.970   Max.   :25879   Max.   :97.510   Max.   :87.20  
##      census       type   
##  Min.   :1113   bc  :44  
##  1st Qu.:3116   prof:31  
##  Median :5132   wc  :23  
##  Mean   :5400            
##  3rd Qu.:8328            
##  Max.   :9517

As you see, we have no any missing values in the data set.

dim(data)
## [1] 98  6

In this example, we are trying to predict the prestige variable, which is numeric. Therefore, we have a regression problem.

Before building the model, divide the data set into two parts.

set.seed(123)
trainindex<-sample(1:98,round(0.8*98))
train<-data[trainindex,]
test<-data[-trainindex,]
dim(train)
## [1] 78  6

In order to build the tree, we can use randomForest function from randomForest package.

library(randomForest)
## randomForest 4.7-1.1
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
## 
##     margin
## The following object is masked from 'package:dplyr':
## 
##     combine
rf1<-randomForest(prestige~.,data=train)
rf1
## 
## Call:
##  randomForest(formula = prestige ~ ., data = train) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 1
## 
##           Mean of squared residuals: 51.04686
##                     % Var explained: 83.26

The above Mean Squared Residuals and Variance explained are calculated using Out of Bag Error Estimation.In this 2/3 of Training data is used for training and the remaining 1/3 is used to Validate the Trees. Also, the number of variables randomly selected at each split is 1.

You can use plot function to visualize the number of trees graph and corresponding error. Thereby, you can set number of trees to grow.

plot(rf1)

This plot is generally used to notice that how the Error is dropping as we keep on adding more and more trees and average them. When we look this plot, however, we cannot make a such comment.

Let’s increase the number of trees to grow using ntree argument and build the tree again.

library(randomForest)
rf2<-randomForest(prestige~.,data=train,ntree=600)
rf2
## 
## Call:
##  randomForest(formula = prestige ~ ., data = train, ntree = 600) 
##                Type of random forest: regression
##                      Number of trees: 600
## No. of variables tried at each split: 1
## 
##           Mean of squared residuals: 49.31865
##                     % Var explained: 83.83
plot(rf2)

As we can see, the error seems stabilized after number of tree 500. Actually, the plotted error rate above is based on the OOB sample error and can be accessed directly at rf2$mse. Thus, we can find which number of trees providing the lowest error rate, which is 94 trees providing an average prestige score error of 6.959482.

# number of trees with lowest MSE
which.min(rf2$mse)
## [1] 94
# RMSE of this optimal random forest
sqrt(rf2$mse[which.min(rf2$mse)])
## [1] 6.959482

Out of Bag Sample Errors and Error on Test set

OOB error was used to set the number of trees above. OOB error was also used to set mtry argument which is the number of variables randomly sampled as candidates at each split. The above Random Forest model chose randomly 1 variables to be considered at each split. We could now try all possible 5 predictors which can be found at each split.

oob.err=double(5)
test.err=double(5)

#mtry is no of Variables randomly chosen at each split

for(mtry in 1:5) 
{
  rf=randomForest(prestige~ . , data = train ,mtry=mtry,ntree=520) 
  oob.err[mtry] = rf$mse[520] #Error of all Trees fitted
  
  pred<-predict(rf,test[-4,]) #Predictions on Test Set for each Tree
  test.err[mtry]= mean((test[,4] - pred)^2) #Mean Squared Test Error
  
  cat(mtry," ") #printing the output to the console
  
}
## Warning in test[, 4] - pred: uzun olan nesne uzunluğu kısa olan nesne
## uzunluğunun bir katı değil
## 1
## Warning in test[, 4] - pred: uzun olan nesne uzunluğu kısa olan nesne
## uzunluğunun bir katı değil
## 2
## Warning in test[, 4] - pred: uzun olan nesne uzunluğu kısa olan nesne
## uzunluğunun bir katı değil
## 3
## Warning in test[, 4] - pred: uzun olan nesne uzunluğu kısa olan nesne
## uzunluğunun bir katı değil
## 4
## Warning in test[, 4] - pred: uzun olan nesne uzunluğu kısa olan nesne
## uzunluğunun bir katı değil
## 5
oob.err
## [1] 50.24549 45.56053 44.81708 43.51478 42.76154
test.err
## [1] 266.1793 253.2689 231.4938 212.2652 188.9154

Now, we can visualize OOB and Test set error, then we can find the mtry value.

matplot(1:mtry , cbind(oob.err,test.err), pch=20 , col=c("red","orange"),type="b",ylab="Mean Squared Error",xlab="Number of Predictors Considered at each Split")
legend("topright",legend=c("Out of Bag Error","Test Error"),pch=19, col=c("red","orange"))

Red line shows the OOB estimates and orange line is the error calculated on test set. It seem that the both curves are correlated, i.e they have same pattern. As we see, the error tends to be minimized around \(mtry=5\).

Therefore, we can build a tree with values of ntree (520) and mtry (5).

rf3<-randomForest(prestige~.,data=train,ntree=520,mtry=5)
rf3
## 
## Call:
##  randomForest(formula = prestige ~ ., data = train, ntree = 520,      mtry = 5) 
##                Type of random forest: regression
##                      Number of trees: 520
## No. of variables tried at each split: 5
## 
##           Mean of squared residuals: 42.41467
##                     % Var explained: 86.09

Actually, we were tunning the parameter of the function manually so far. Luckily, we don’t always have to do this process for tunning the parameters. Instead, we can use tuneRF function to set the parameter values.

tuneRF(x, y, mtryStart, ntreeTry=50, stepFactor=2, improve=0.05,
       trace=TRUE, plot=TRUE, doBest=FALSE, ...)
parameters<-tuneRF(train[,-4],train[,4])
## mtry = 1  OOB error = 52.51773 
## Searching left ...
## Searching right ...
## mtry = 2     OOB error = 47.11179 
## 0.1029355 0.05 
## mtry = 4     OOB error = 42.12466 
## 0.1058574 0.05 
## mtry = 5     OOB error = 47.35043 
## -0.1240548 0.05

#tuneRF(input,response)

The function shows that the error tends to be minimized around mtry=4 opposite to our choice. Therefore, our best random forest model is built with 520 tree and 4 randomly selected variable at each split.

rf4<-randomForest(prestige~.,data=train,ntree=520,mtry=4)
rf4
## 
## Call:
##  randomForest(formula = prestige ~ ., data = train, ntree = 520,      mtry = 4) 
##                Type of random forest: regression
##                      Number of trees: 520
## No. of variables tried at each split: 4
## 
##           Mean of squared residuals: 42.37231
##                     % Var explained: 86.1

After training a random forest, it is natural to ask which variables have the most predictive power. Variables with high importance are drivers of the outcome and their values have a significant impact on the outcome values. By contrast, variables with low importance might be omitted from a model, making it simpler and faster to fit and predict.

In order to draw a variable importance plot, we can use varImpPlot() function. In this plot, the variables are sorted by mean decrease Gini value. If the variable has the highest value, it can be considered as most important and significant input variable on the prediction on the response variable.

varImpPlot(rf4)

The plot shows that census is the most important and significant feature on the prediction of prestige score.

Apply Linear Regression

As stated below, it is a regression problem and let’s fit a linear regression to make a comparison.

lm.fit<-lm(prestige~.,data=train)
summary(lm.fit)
## 
## Call:
## lm(formula = prestige ~ ., data = train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.4658  -5.7871  -0.2579   5.3222  20.3642 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.640e+01  1.105e+01  -1.485   0.1421    
## education    3.722e+00  8.790e-01   4.235 6.75e-05 ***
## income       1.458e-03  4.710e-04   3.096   0.0028 ** 
## women        3.976e-02  4.019e-02   0.989   0.3259    
## census       1.445e-03  9.829e-04   1.470   0.1459    
## typeprof     1.233e+01  6.665e+00   1.849   0.0686 .  
## typewc       7.891e-01  4.563e+00   0.173   0.8632    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.456 on 71 degrees of freedom
## Multiple R-squared:  0.8341, Adjusted R-squared:   0.82 
## F-statistic: 59.48 on 6 and 71 DF,  p-value: < 2.2e-16

As you see, the model has very high \(R^{2}_{adj}\) value (0.82). Except, education and income, the remaning variables don’t contribute to the model significantly.

Random Forest and Linear Model Comparison

Train Data

Random Forest

pred<-predict(rf4,data=train[,-4])
trainrf_mse<-mean((pred-train[,4])^2)
trainrf_mse
## [1] 42.37231

Linear Regression

pred1<-predict(lm.fit,data=train[,-4])
trainlm_mse<-mean((pred1-train[,4])^2)
trainlm_mse
## [1] 50.60217

Test Data

Random Forest

pred<-predict(rf4,data=test[,-4])
testrf_mse<-mean((pred-test[,4])^2)
## Warning in pred - test[, 4]: uzun olan nesne uzunluğu kısa olan nesne
## uzunluğunun bir katı değil
testrf_mse
## [1] 468.7358

Linear Regression

pred1<-predict(lm.fit,data=test[,-4])
testlm_mse<-mean((pred1-test[,4])^2)
## Warning in pred1 - test[, 4]: uzun olan nesne uzunluğu kısa olan nesne
## uzunluğunun bir katı değil
testlm_mse
## [1] 483.1183

Therefore, we can say that RF outperforms LR.