Previously on STAT412:
Naive Bayes classifier is a classification algorithm in machine learning and is included in supervised learning. This algorithm is quite popular to be used in Natural Language Processing or NLP. This algorithm is based on the Bayes Theorem created by Thomas Bayes. Therefore, we must first understand the Bayes Theorem before using the Naive Bayes Classifier.
The essence of the Bayes theorem is conditional probability where conditional probability is the probability that something will happen, given that something else has already occurred. By using conditional probability, we can find out the probability of an event will occur given the knowledge of the previous event.
\[ P(A|B)=\frac{P(B|A)P(A)}{P(B)} \]
Because marginal probability always remains the same in the calculation of naive bayes classifier, then we can ignore the calculation of marginal probability. In the Naive Bayes classifier, we determine the class of data points into based on the value of the greatest posterior probability.
Some assumptions used in the Naive Bayes Classifier are that each feature in the data may not be correlated or mutually independent. Then the second for likelihood P (x | y) must follow one of the statistical distributions, namely Gaussian, Multinomial, or Bernoulli.
Manual Application:
Assume we have the following data and we want to classify new data with the following criteria:
Age = 21–30
Income = Medium
Status = Married
1. First, Calculate Prior Probability P(y):
P (Buy=yes) → 7/15 → 0,467
P (Buy=no) → 8/15 → 0,533
2. Second, Calculate the Likelihood each features P (x | y):
P (Age = 21–30 | Buy=yes) → 2/7 → 0,285
P (Age = 21–30 | Buy=no) → 3/8 → 0,375
P (Income= Medium | Buy=yes) → 1/7 → 0,143
P (Income= Medium | Buy=no) → 5/8 → 0,625
P (Status= Married | Buy=yes) → 4/7 → 0,571
P (Status= Married | Buy=no) → 5/8 → 0,625
3. Calculate Total Likelihood
P (x | Buy=yes) → 0,285 * 0,143 * 0,571 = 0,0233
P (x | Buy=no) → 0,375 * 0,625 * 0,625 = 0,1464
4. Calculate Posterior Probability P(y | x):
P (x | Buy=yes) * P (Buy=yes) → 0,0233 * 0,467 = 0,0108
P (x | Buy=no) * P (Buy=no) → 0,1464 * 0,533 = 0,0781
From the above calculation, it was found that the new data belonged to the class that did not buy.
Computational Application:
For the example, we will use Pima Indians Diabetes Database dataset from mlbench package. The dataset is about prediction of the onset of diabetes in female Pima Indians from medical record data.
## Warning: package 'mlbench' was built under R version 4.3.3
| pregnant | glucose | pressure | triceps | insulin | mass | pedigree | age | diabetes |
|---|---|---|---|---|---|---|---|---|
| 6 | 148 | 72 | 35 | 0 | 33.6 | 0.627 | 50 | pos |
| 1 | 85 | 66 | 29 | 0 | 26.6 | 0.351 | 31 | neg |
| 8 | 183 | 64 | 0 | 0 | 23.3 | 0.672 | 32 | pos |
| 1 | 89 | 66 | 23 | 94 | 28.1 | 0.167 | 21 | neg |
| 0 | 137 | 40 | 35 | 168 | 43.1 | 2.288 | 33 | pos |
| 5 | 116 | 74 | 0 | 0 | 25.6 | 0.201 | 30 | neg |
Data dictionary:
pregnant: Number of times pregnant
glucose: Plasma glucose concentration (glucose tolerance test)
pressure: Diastolic blood pressure (mm Hg)
triceps: Triceps skin fold thickness (mm)
insulin: 2-Hour serum insulin (mu U/ml)
mass: Body mass index
pedigree: Diabetes pedigree function
age: Age (years)
diabetes: Class variable (test for diabetes)
## [1] 768 9
As you see, it has 9 variables and 768 observations.
## 'data.frame': 768 obs. of 9 variables:
## $ pregnant: num 6 1 8 1 0 5 3 10 2 8 ...
## $ glucose : num 148 85 183 89 137 116 78 115 197 125 ...
## $ pressure: num 72 66 64 66 40 74 50 0 70 96 ...
## $ triceps : num 35 29 0 23 35 0 32 0 45 0 ...
## $ insulin : num 0 0 0 94 168 0 88 0 543 0 ...
## $ mass : num 33.6 26.6 23.3 28.1 43.1 25.6 31 35.3 30.5 0 ...
## $ pedigree: num 0.627 0.351 0.672 0.167 2.288 ...
## $ age : num 50 31 32 21 33 30 26 29 53 54 ...
## $ diabetes: Factor w/ 2 levels "neg","pos": 2 1 2 1 2 1 2 1 2 2 ...
As you see, all features are numeric and our response variable, diabetes, is binary. In other words, it is a categorical variable with two levels.
To make the rest of the analysis simplier, we can recode our response variable.
PimaIndiansDiabetes$diabetes<-as.factor(ifelse(PimaIndiansDiabetes$diabetes=="neg",1,0))
head(PimaIndiansDiabetes$diabetes)## [1] 0 1 0 1 0 1
## Levels: 0 1
Naive Bayes classifier is a supervised learning algorithm. Therefore, we are supposed to have train and test data to design the model and predict the unseen objects.
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
## Warning: package 'caret' was built under R version 4.3.3
## Zorunlu paket yükleniyor: ggplot2
## Zorunlu paket yükleniyor: lattice
set.seed(123)
training.samples <- PimaIndiansDiabetes$diabetes %>%createDataPartition(p = 0.8, list = FALSE) #createDataPartition helps you define train set index
train.data <- PimaIndiansDiabetes[training.samples, ]
test.data <- PimaIndiansDiabetes[-training.samples, ]Start with checking the correlation of the numerical variables, since Bayes Classifier requires an independence assumption.
## pregnant glucose pressure triceps insulin mass
## pregnant 1.00000000 0.12945867 0.14128198 -0.08167177 -0.07353461 0.01768309
## glucose 0.12945867 1.00000000 0.15258959 0.05732789 0.33135711 0.22107107
## pressure 0.14128198 0.15258959 1.00000000 0.20737054 0.08893338 0.28180529
## triceps -0.08167177 0.05732789 0.20737054 1.00000000 0.43678257 0.39257320
## insulin -0.07353461 0.33135711 0.08893338 0.43678257 1.00000000 0.19785906
## mass 0.01768309 0.22107107 0.28180529 0.39257320 0.19785906 1.00000000
## pedigree -0.03352267 0.13733730 0.04126495 0.18392757 0.18507093 0.14064695
## age 0.54434123 0.26351432 0.23952795 -0.11397026 -0.04216295 0.03624187
## pedigree age
## pregnant -0.03352267 0.54434123
## glucose 0.13733730 0.26351432
## pressure 0.04126495 0.23952795
## triceps 0.18392757 -0.11397026
## insulin 0.18507093 -0.04216295
## mass 0.14064695 0.03624187
## pedigree 1.00000000 0.03356131
## age 0.03356131 1.00000000
## Warning: package 'e1071' was built under R version 4.3.3
##
## Naive Bayes Classifier for Discrete Predictors
##
## Call:
## naiveBayes.default(x = X, y = Y, laplace = laplace)
##
## A-priori probabilities:
## Y
## 0 1
## 0.3495935 0.6504065
##
## Conditional probabilities:
## pregnant
## Y [,1] [,2]
## 0 4.87907 3.865046
## 1 3.29750 3.035957
##
## glucose
## Y [,1] [,2]
## 0 143.3674 31.31122
## 1 110.6725 25.43783
##
## pressure
## Y [,1] [,2]
## 0 71.28372 21.10186
## 1 68.47000 16.87932
##
## triceps
## Y [,1] [,2]
## 0 22.04186 17.60318
## 1 20.11250 15.01210
##
## insulin
## Y [,1] [,2]
## 0 107.6465 147.88062
## 1 70.2750 97.13156
##
## mass
## Y [,1] [,2]
## 0 34.99674 7.183746
## 1 30.34975 7.638519
##
## pedigree
## Y [,1] [,2]
## 0 0.5508093 0.3667233
## 1 0.4269350 0.2811081
##
## age
## Y [,1] [,2]
## 0 37.50698 11.35118
## 1 31.31250 11.86621
The naiveBayes() function assumes Gaussian distributions for numerical varibles. Also, the prior are calculated from the proportion of the training data. The priors are shown when the object is printed. The Y values are the means and standard deviations of the predictors within each class.
##Train
train_pred1<-predict(NBclassfier, newdata = train.data, type = "class")
train_tab1 = table(predicted = train_pred1 , actual = train.data$diabetes)
library(caret)
train_con_mat1 = confusionMatrix(train_tab1)
c(train_con_mat1$overall["Accuracy"],
train_con_mat1$byClass["Sensitivity"],
train_con_mat1$byClass["Specificity"])## Accuracy Sensitivity Specificity
## 0.7642276 0.6232558 0.8400000
##Test
test_pred1<-predict(NBclassfier, newdata = test.data, type = "class")
test_tab1 = table(predicted = test_pred1 , actual = test.data$diabetes)
library(caret)
test_con_mat1 = confusionMatrix(test_tab1)
c(test_con_mat1$overall["Accuracy"],
test_con_mat1$byClass["Sensitivity"],
test_con_mat1$byClass["Specificity"])## Accuracy Sensitivity Specificity
## 0.7581699 0.5660377 0.8600000
Decision Tree algorithm belongs to the family of supervised learning algorithms. Unlike other supervised learning algorithms, the decision tree algorithm can be used for solving regression and classification problems too.
The goal of using a Decision Tree is to create a training model. The model can be used to predict the class or value of the target variable by learning simple decision rules inferred from the training data.
In Decision Trees, for predicting a class label for a record we start from the root of the tree. We compare the values of the root attribute with the record’s attribute. On the basis of comparison, we follow the branch corresponding to that value and jump to the next node.
Important Terminology related to Decision Trees
Root Node: It represents the entire population or sample and this further gets divided into two or more homogeneous sets.
Splitting: It is a process of dividing a node into two or more sub-nodes.
Decision Node: When a sub-node splits into further sub-nodes, then it is called the decision node.
Leaf / Terminal Node: Nodes do not split is called Leaf or Terminal node.
Pruning: When we remove sub-nodes of a decision node, this process is called pruning. It can be referred as the opposite process of splitting.
Branch / Sub-Tree: A subsection of the entire tree is called branch or sub-tree.
Parent and Child Node: A node, which is divided into sub-nodes is called a parent node of sub-nodes whereas sub-nodes are the child of a parent node.
Decision trees classify the examples by sorting them down the tree from the root to some leaf/terminal node, with the leaf/terminal node providing the classification of the example.
Each node in the tree acts as a test case for some attribute, and each edge descending from the node corresponds to the possible answers to the test case. This process is recursive in nature and is repeated for every subtree rooted at the new node.
Computational Application:
Classification Tree
Let’s consider Pima Indians Diabetes Database dataset from mlbench package, which is the same data used in Naive Bayes application. Remember that the data set is about prediction of the onset of diabetes in female Pima Indians from medical record data.
Let’s see the first six rows:
| pregnant | glucose | pressure | triceps | insulin | mass | pedigree | age | diabetes |
|---|---|---|---|---|---|---|---|---|
| 6 | 148 | 72 | 35 | 0 | 33.6 | 0.627 | 50 | 0 |
| 1 | 85 | 66 | 29 | 0 | 26.6 | 0.351 | 31 | 1 |
| 8 | 183 | 64 | 0 | 0 | 23.3 | 0.672 | 32 | 0 |
| 1 | 89 | 66 | 23 | 94 | 28.1 | 0.167 | 21 | 1 |
| 0 | 137 | 40 | 35 | 168 | 43.1 | 2.288 | 33 | 0 |
| 5 | 116 | 74 | 0 | 0 | 25.6 | 0.201 | 30 | 1 |
Let’s check proportion of the levels of the response.
##
## 0 1
## 0.3495935 0.6504065
##
## 0 1
## 0.3464052 0.6535948
In order to grow our decision tree, we have to first load the
rpart package. Then we can use the rpart()
function, specifying the model formula, data, and method parameters. In
this case, we want to classify the feature diabetes using the
predictors, so our call to rpart() should look like :
library(rpart)
library(rpart.plot)
fit <- rpart(diabetes~., data = train.data, method = 'class') #method = class is necessary for classification task
print(fit)## n= 615
##
## node), split, n, loss, yval, (yprob)
## * denotes terminal node
##
## 1) root 615 215 1 (0.34959350 0.65040650)
## 2) glucose>=154.5 104 18 0 (0.82692308 0.17307692) *
## 3) glucose< 154.5 511 129 1 (0.25244618 0.74755382)
## 6) age>=28.5 247 95 1 (0.38461538 0.61538462)
## 12) mass>=26.7 202 92 1 (0.45544554 0.54455446)
## 24) pedigree>=0.516 75 27 0 (0.64000000 0.36000000)
## 48) glucose>=91.5 67 20 0 (0.70149254 0.29850746) *
## 49) glucose< 91.5 8 1 1 (0.12500000 0.87500000) *
## 25) pedigree< 0.516 127 44 1 (0.34645669 0.65354331) *
## 13) mass< 26.7 45 3 1 (0.06666667 0.93333333) *
## 7) age< 28.5 264 34 1 (0.12878788 0.87121212)
## 14) glucose>=127.5 51 18 1 (0.35294118 0.64705882)
## 28) pressure< 59 9 2 0 (0.77777778 0.22222222) *
## 29) pressure>=59 42 11 1 (0.26190476 0.73809524) *
## 15) glucose< 127.5 213 16 1 (0.07511737 0.92488263) *
Structure and Interpretation of the Tree:
Root Node (Node 1):
n=615: Total of 615 observations.
loss=215: If you predicted the majority class, you would misclassify 215 observations.
yval=1: Majority class (1).
yprob=(0.34959350 0.65040650): Probability of belonging to class 0 is 34.96%, and class 1 is 65.04%.
Decision 1 (Nodes 2 and 3):
Node 2: glucose>=154.5
There are 104 observations that meet this criterion, with 18 misclassifications.
This node is terminal (leaf node) and predominantly represents class 0 (82.69%).
Node 3: glucose<154.5
There are 511 observations that do not meet the criterion, with 129 misclassifications.
This node will be further split.
Further splits under Node 3 (Nodes 6 and 7):
Node 6: age>=28.5 and Node 7: age<28.5
Node 6 has further splits (e.g., Nodes 12 and 13).
Node 7 shows that for observations where age<28.5, a large majority (87.12%) belong to class 1, suggesting potential further splits to refine the prediction.
At the top, it is the overall probability of not being diabetes. It shows the proportion of patients who are not diabetes.
This node asks whether the glucose of patients are greater than or equal to 155 or not. If your answer is no, then you go down to the root’s right child node (depth 2). 83 percent of patients are not diabetes with of 75 percent.
If yes, then you go down to the root’s left child node, 17 percent of patients are diabetes with probability of 0.17
You can improve your interpretation like this.
By default, rpart() function uses the Gini impurity measure to split the note. The higher the Gini coefficient, the more different instances within the node.
Let’s calculate classification error rate and other measures for both models considering both training and test data to see the performance of the model.
## Train data performance.
#CE
train_pred <-predict(fit,train.data[,1:8],type = 'class')
round(calc_class_err(train_pred,train.data[,9]),3)## [1] 0.187
# Accuracy Sensivity and Specificity
library(caret)
train_tab = table(predicted = train_pred , actual = train.data[,9])
library(caret)
train_con_mat = confusionMatrix(train_tab)
c(train_con_mat$overall["Accuracy"],
train_con_mat$byClass["Sensitivity"],
train_con_mat$byClass["Specificity"])## Accuracy Sensitivity Specificity
## 0.8130081 0.6511628 0.9000000
# Test Data Performance
#CE
test_pred <-predict(fit,test.data[,1:8],type = 'class')
round(calc_class_err(test_pred,test.data[,9]),3)## [1] 0.275
# Accuracy Sensivity and Specificity
test_tab = table(predicted = test_pred , actual = test.data[,9])
library(caret)
test_con_mat = confusionMatrix(test_tab)
c(test_con_mat$overall["Accuracy"],
test_con_mat$byClass["Sensitivity"],
test_con_mat$byClass["Specificity"])## Accuracy Sensitivity Specificity
## 0.7254902 0.4150943 0.8900000
To improve the performance of decision tree is not easy for
rpart function. Therefore, we use another package to apply
decision tree. This package is caret package.
The caret package contains set of functions to streamline model training for Regression and Classification.
Standard Interface for Modeling and Prediction
Simplify Model tuning
Data splitting
Feature selection
Evaluate models
Paralell processing
Models Supported
caret supports hundreds of predictive models, and
provides facilities for adding your own models to take advantage of the
caret infrastructure.
You can get a list of models supported by caret
## [1] "ada" "AdaBag" "AdaBoost.M1"
## [4] "adaboost" "amdai" "ANFIS"
## [7] "avNNet" "awnb" "awtan"
## [10] "bag" "bagEarth" "bagEarthGCV"
## [13] "bagFDA" "bagFDAGCV" "bam"
## [16] "bartMachine" "bayesglm" "binda"
## [19] "blackboost" "blasso" "blassoAveraged"
## [22] "bridge" "brnn" "BstLm"
## [25] "bstSm" "bstTree" "C5.0"
## [28] "C5.0Cost" "C5.0Rules" "C5.0Tree"
## [31] "cforest" "chaid" "CSimca"
## [34] "ctree" "ctree2" "cubist"
## [37] "dda" "deepboost" "DENFIS"
## [40] "dnn" "dwdLinear" "dwdPoly"
## [43] "dwdRadial" "earth" "elm"
## [46] "enet" "evtree" "extraTrees"
## [49] "fda" "FH.GBML" "FIR.DM"
## [52] "foba" "FRBCS.CHI" "FRBCS.W"
## [55] "FS.HGD" "gam" "gamboost"
## [58] "gamLoess" "gamSpline" "gaussprLinear"
## [61] "gaussprPoly" "gaussprRadial" "gbm_h2o"
## [64] "gbm" "gcvEarth" "GFS.FR.MOGUL"
## [67] "GFS.LT.RS" "GFS.THRIFT" "glm.nb"
## [70] "glm" "glmboost" "glmnet_h2o"
## [73] "glmnet" "glmStepAIC" "gpls"
## [76] "hda" "hdda" "hdrda"
## [79] "HYFIS" "icr" "J48"
## [82] "JRip" "kernelpls" "kknn"
## [85] "knn" "krlsPoly" "krlsRadial"
## [88] "lars" "lars2" "lasso"
## [91] "lda" "lda2" "leapBackward"
## [94] "leapForward" "leapSeq" "Linda"
## [97] "lm" "lmStepAIC" "LMT"
## [100] "loclda" "logicBag" "LogitBoost"
## [103] "logreg" "lssvmLinear" "lssvmPoly"
## [106] "lssvmRadial" "lvq" "M5"
## [109] "M5Rules" "manb" "mda"
## [112] "Mlda" "mlp" "mlpKerasDecay"
## [115] "mlpKerasDecayCost" "mlpKerasDropout" "mlpKerasDropoutCost"
## [118] "mlpML" "mlpSGD" "mlpWeightDecay"
## [121] "mlpWeightDecayML" "monmlp" "msaenet"
## [124] "multinom" "mxnet" "mxnetAdam"
## [127] "naive_bayes" "nb" "nbDiscrete"
## [130] "nbSearch" "neuralnet" "nnet"
## [133] "nnls" "nodeHarvest" "null"
## [136] "OneR" "ordinalNet" "ordinalRF"
## [139] "ORFlog" "ORFpls" "ORFridge"
## [142] "ORFsvm" "ownn" "pam"
## [145] "parRF" "PART" "partDSA"
## [148] "pcaNNet" "pcr" "pda"
## [151] "pda2" "penalized" "PenalizedLDA"
## [154] "plr" "pls" "plsRglm"
## [157] "polr" "ppr" "pre"
## [160] "PRIM" "protoclass" "qda"
## [163] "QdaCov" "qrf" "qrnn"
## [166] "randomGLM" "ranger" "rbf"
## [169] "rbfDDA" "Rborist" "rda"
## [172] "regLogistic" "relaxo" "rf"
## [175] "rFerns" "RFlda" "rfRules"
## [178] "ridge" "rlda" "rlm"
## [181] "rmda" "rocc" "rotationForest"
## [184] "rotationForestCp" "rpart" "rpart1SE"
## [187] "rpart2" "rpartCost" "rpartScore"
## [190] "rqlasso" "rqnc" "RRF"
## [193] "RRFglobal" "rrlda" "RSimca"
## [196] "rvmLinear" "rvmPoly" "rvmRadial"
## [199] "SBC" "sda" "sdwd"
## [202] "simpls" "SLAVE" "slda"
## [205] "smda" "snn" "sparseLDA"
## [208] "spikeslab" "spls" "stepLDA"
## [211] "stepQDA" "superpc" "svmBoundrangeString"
## [214] "svmExpoString" "svmLinear" "svmLinear2"
## [217] "svmLinear3" "svmLinearWeights" "svmLinearWeights2"
## [220] "svmPoly" "svmRadial" "svmRadialCost"
## [223] "svmRadialSigma" "svmRadialWeights" "svmSpectrumString"
## [226] "tan" "tanSearch" "treebag"
## [229] "vbmpRadial" "vglmAdjCat" "vglmContRatio"
## [232] "vglmCumulative" "widekernelpls" "WM"
## [235] "wsrf" "xgbDART" "xgbLinear"
## [238] "xgbTree" "xyf"
rpart2 is the argument used to create a decision tree.
| model | parameter | label | forReg | forClass | probModel |
|---|---|---|---|---|---|
| rpart2 | maxdepth | Max Tree Depth | TRUE | TRUE | TRUE |
The function trainControl generates parameters that
further control how models are created, with possible values:
train.control <- trainControl(
method = "repeatedcv",
number = 10, ## 10-fold CV
repeats = 3,## repeated three times
# USE AUC
summaryFunction = twoClassSummary,
classProbs = TRUE
)
#twoClassSummary, will compute the sensitivity, specificity and area under the ROC curveNow, we will consider the original data set of Pima Indian Diabetes because when we apply the transformation on the response (from character to numeric) the level of the response are not assigned correctly.
data("PimaIndiansDiabetes")
set.seed(123)
training.samples <- PimaIndiansDiabetes$diabetes %>%createDataPartition(p = 0.8, list = FALSE) #createDataPartition helps you define train set index
train.data <- PimaIndiansDiabetes[training.samples, ]
test.data <- PimaIndiansDiabetes[-training.samples, ]rpartFit1 <- train(diabetes~., data=train.data,
method = "rpart2",
tuneLength = 6,
trControl = train.control,
metric = "ROC")## CART
##
## 615 samples
## 8 predictor
## 2 classes: 'neg', 'pos'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times)
## Summary of sample sizes: 554, 554, 553, 554, 554, 553, ...
## Resampling results across tuning parameters:
##
## maxdepth ROC Sens Spec
## 1 0.6481674 0.8700000 0.4263348
## 5 0.7225613 0.8400000 0.4979076
## 7 0.7323214 0.8391667 0.5298701
## 9 0.7388555 0.8258333 0.5577201
## 11 0.7397376 0.8208333 0.5700577
## 12 0.7416126 0.8241667 0.5717172
##
## ROC was used to select the optimal model using the largest value.
## The final value used for the model was maxdepth = 12.
## n= 615
##
## node), split, n, loss, yval, (yprob)
## * denotes terminal node
##
## 1) root 615 215 neg (0.65040650 0.34959350)
## 2) glucose< 154.5 518 136 neg (0.73745174 0.26254826)
## 4) glucose< 111.5 260 37 neg (0.85769231 0.14230769)
## 8) pedigree< 0.6605 211 20 neg (0.90521327 0.09478673) *
## 9) pedigree>=0.6605 49 17 neg (0.65306122 0.34693878)
## 18) age< 40 38 8 neg (0.78947368 0.21052632) *
## 19) age>=40 11 2 pos (0.18181818 0.81818182) *
## 5) glucose>=111.5 258 99 neg (0.61627907 0.38372093)
## 10) mass< 27.05 51 5 neg (0.90196078 0.09803922) *
## 11) mass>=27.05 207 94 neg (0.54589372 0.45410628)
## 22) age< 30.5 108 36 neg (0.66666667 0.33333333)
## 44) triceps>=14.5 80 19 neg (0.76250000 0.23750000) *
## 45) triceps< 14.5 28 11 pos (0.39285714 0.60714286) *
## 23) age>=30.5 99 41 pos (0.41414141 0.58585859)
## 46) pedigree< 0.514 57 28 neg (0.50877193 0.49122807)
## 92) pregnant>=1.5 49 20 neg (0.59183673 0.40816327)
## 184) pressure>=84.5 12 1 neg (0.91666667 0.08333333) *
## 185) pressure< 84.5 37 18 pos (0.48648649 0.51351351)
## 370) glucose>=121 29 12 neg (0.58620690 0.41379310) *
## 371) glucose< 121 8 1 pos (0.12500000 0.87500000) *
## 93) pregnant< 1.5 8 0 pos (0.00000000 1.00000000) *
## 47) pedigree>=0.514 42 12 pos (0.28571429 0.71428571) *
## 3) glucose>=154.5 97 18 pos (0.18556701 0.81443299) *
# Test Data Performance
#CE
test_pred <-predict(rpartFit1$finalModel,test.data[,1:8],type = 'class')
round(calc_class_err(test_pred,test.data[,9]),3)## [1] 0.242
# Accuracy Sensivity and Specificity
test_tab = table(predicted = test_pred , actual = test.data[,9])
library(caret)
test_con_mat = confusionMatrix(test_tab)
c(test_con_mat$overall["Accuracy"],
test_con_mat$byClass["Sensitivity"],
test_con_mat$byClass["Specificity"])## Accuracy Sensitivity Specificity
## 0.7581699 0.8600000 0.5660377
A single decision tree suffer from high variance. Although, the prunning is applied to decrease the variance, there are alternative method to decrease the variance of the single tree and thereby, improve the performance of the model compared to single tree. One of these approach is bagging which is acronym for Bootstrap Aggregating proposed by Breiman in 1996.
Bagging
Bagging is a supervised ensemble method which combines and averages the multiple models. In this way, it has less variability rather than a single tree and reduces the risk of overfitting problem resulting in more accurate results. The algorithm of bagging is defined below.
Create m bootstrap samples from the training data. Bootstrapped samples allow us to create many slightly different data sets but with the same distribution as the overall training set.
For each bootstrap sample train a single, unpruned regression tree.
Average individual predictions from each tree to create an overall average predicted value.
Gives the final rule by applying average rule for regression problem or majority rule for classification problem.
Random Forest
Random Forests is a supervised learning technique which can be considered as a modification of bagging explained above. In random forests, a large set of de-correlated trees are built by randomly selecting \(p\) number of variables among the set of \(m\) variables. \((p<m)\)
Random forests are built on the same fundamental principles as decision trees and bagging. Bagging trees introduce a random component into the tree building process that reduces the variance of a single trees prediction and improves predictive performance. However, the trees in bagging are not completely independent of each other since all the original predictors are considered at every split of every tree. Rather, trees from different bootstrap samples typically have similar structure to each other (especially at the top of the tree) due to underlying relationships, as already stated above.
This situation is known as tree correlation and hinder the bagging to reduce the variance of the predictions optimally. In this case, we need to consider an additional way to decrease variance more, and this is the minimization of correlation between trees. The random forests complete this process in two ways.
1. Bootstrap: As in the bagging, each tree is grown to a bootstrap resampled data set, which makes them different and somewhat decorrelates them.
2. Split-variable randomization: Each time a split is to be performed, the search for the split variable is limited to a random subset of \(m\) of the \(p\) variables. For regression trees, typical default values are \(p=m/3\) but this should be considered as tuning parameter. When \(m=p\) ,the randomization amounts to using only step 1 and is the same as bagging.
The basic algorithm for random forest is given below. Given training data set
Select number of trees to build (ntrees)
for i = 1 to ntrees do
Generate a bootstrap sample of the original data.
Grow a regression tree to the bootstrapped data
for each split do
Select m variables at random from all p variables
Pick the best variable/split-point among the m
Split the node into two child nodes
end
Use typical tree model stopping criteria to determine when a tree is complete (but do not prune)
End
At the end of the process, we will have a machine having less variance and de-correlate trees compared to bagging.
2.2. Analysis
There are over 20 random forest packages in R. To demonstrate the
basic implementation we illustrate the use of the
randomForest package, the oldest and most well known
implementation of the Random Forest algorithm in R.
Please consider “Prestige” dataset from car package.
## Zorunlu paket yükleniyor: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
Let’s see first six observation of the dataset.
| education | income | women | prestige | census | type | |
|---|---|---|---|---|---|---|
| gov.administrators | 13.11 | 12351 | 11.16 | 68.8 | 1113 | prof |
| general.managers | 12.26 | 25879 | 4.02 | 69.1 | 1130 | prof |
| accountants | 12.77 | 9271 | 15.70 | 63.4 | 1171 | prof |
| purchasing.officers | 11.42 | 8865 | 9.11 | 56.8 | 1175 | prof |
| chemists | 14.62 | 8403 | 11.68 | 73.5 | 2111 | prof |
| physicists | 15.64 | 11030 | 5.13 | 77.6 | 2113 | prof |
## [1] 102 6
The data set has 102 observations and 6 variables. You can see the name of the variables and their corresponding class below.
Data dictionary:
education: average education of occupational incumbents, years, in 1971
income: average income of incumbents, dollars, in 1971
women: percentage of incumbents who are women
prestige: Pineo-Porter prestige score for occupation, from a social survey conducted in the mid- 1960s
census: Canadian Census occupational code
type: a factor for type of occupation with levels bc (Blue Collar), prof (Professional, Managerial and Technical) and wc (White Collar)
## 'data.frame': 102 obs. of 6 variables:
## $ education: num 13.1 12.3 12.8 11.4 14.6 ...
## $ income : int 12351 25879 9271 8865 8403 11030 8258 14163 11377 11023 ...
## $ women : num 11.16 4.02 15.7 9.11 11.68 ...
## $ prestige : num 68.8 69.1 63.4 56.8 73.5 77.6 72.6 78.1 73.1 68.8 ...
## $ census : int 1113 1130 1171 1175 2111 2113 2133 2141 2143 2153 ...
## $ type : Factor w/ 3 levels "bc","prof","wc": 2 2 2 2 2 2 2 2 2 2 ...
The output shows that only type variable is the factor variable with three levels. The remaining variables are the numerical variables.
Let’s see the summary of the data.
## education income women prestige
## Min. : 6.380 Min. : 611 Min. : 0.000 Min. :14.80
## 1st Qu.: 8.445 1st Qu.: 4106 1st Qu.: 3.592 1st Qu.:35.23
## Median :10.540 Median : 5930 Median :13.600 Median :43.60
## Mean :10.738 Mean : 6798 Mean :28.979 Mean :46.83
## 3rd Qu.:12.648 3rd Qu.: 8187 3rd Qu.:52.203 3rd Qu.:59.27
## Max. :15.970 Max. :25879 Max. :97.510 Max. :87.20
## census type
## Min. :1113 bc :44
## 1st Qu.:3120 prof:31
## Median :5135 wc :23
## Mean :5402 NA's: 4
## 3rd Qu.:8312
## Max. :9517
As you see, we have 4 missing values in the type variable. Since percentage of the missing observation (3.92%) is very low, we can remove them using na.omit() function.
## education income women prestige
## Min. : 6.380 Min. : 1656 Min. : 0.000 Min. :17.30
## 1st Qu.: 8.445 1st Qu.: 4250 1st Qu.: 3.268 1st Qu.:35.38
## Median :10.605 Median : 6036 Median :14.475 Median :43.60
## Mean :10.795 Mean : 6939 Mean :28.986 Mean :47.33
## 3rd Qu.:12.755 3rd Qu.: 8226 3rd Qu.:52.203 3rd Qu.:59.90
## Max. :15.970 Max. :25879 Max. :97.510 Max. :87.20
## census type
## Min. :1113 bc :44
## 1st Qu.:3116 prof:31
## Median :5132 wc :23
## Mean :5400
## 3rd Qu.:8328
## Max. :9517
As you see, we have no any missing values in the data set.
## [1] 98 6
In this example, we are trying to predict the prestige variable, which is numeric. Therefore, we have a regression problem.
Before building the model, divide the data set into two parts.
set.seed(123)
trainindex<-sample(1:98,round(0.8*98))
train<-data[trainindex,]
test<-data[-trainindex,]
dim(train)## [1] 78 6
In order to build the tree, we can use randomForest
function from randomForest package.
## randomForest 4.7-1.1
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
##
## margin
## The following object is masked from 'package:dplyr':
##
## combine
##
## Call:
## randomForest(formula = prestige ~ ., data = train)
## Type of random forest: regression
## Number of trees: 500
## No. of variables tried at each split: 1
##
## Mean of squared residuals: 51.04686
## % Var explained: 83.26
The above Mean Squared Residuals and Variance explained are calculated using Out of Bag Error Estimation.In this 2/3 of Training data is used for training and the remaining 1/3 is used to Validate the Trees. Also, the number of variables randomly selected at each split is 1.
You can use plot function to visualize the number of trees graph and corresponding error. Thereby, you can set number of trees to grow.
This plot is generally used to notice that how the Error is dropping as we keep on adding more and more trees and average them. When we look this plot, however, we cannot make a such comment.
Let’s increase the number of trees to grow using ntree
argument and build the tree again.
##
## Call:
## randomForest(formula = prestige ~ ., data = train, ntree = 600)
## Type of random forest: regression
## Number of trees: 600
## No. of variables tried at each split: 1
##
## Mean of squared residuals: 49.31865
## % Var explained: 83.83
As we can see, the error seems stabilized after number of tree 500. Actually, the plotted error rate above is based on the OOB sample error and can be accessed directly at rf2$mse. Thus, we can find which number of trees providing the lowest error rate, which is 94 trees providing an average prestige score error of 6.959482.
## [1] 94
## [1] 6.959482
Out of Bag Sample Errors and Error on Test set
OOB error was used to set the number of trees above. OOB error was also used to set mtry argument which is the number of variables randomly sampled as candidates at each split. The above Random Forest model chose randomly 1 variables to be considered at each split. We could now try all possible 5 predictors which can be found at each split.
oob.err=double(5)
test.err=double(5)
#mtry is no of Variables randomly chosen at each split
for(mtry in 1:5)
{
rf=randomForest(prestige~ . , data = train ,mtry=mtry,ntree=520)
oob.err[mtry] = rf$mse[520] #Error of all Trees fitted
pred<-predict(rf,test[-4,]) #Predictions on Test Set for each Tree
test.err[mtry]= mean((test[,4] - pred)^2) #Mean Squared Test Error
cat(mtry," ") #printing the output to the console
}## Warning in test[, 4] - pred: uzun olan nesne uzunluğu kısa olan nesne
## uzunluğunun bir katı değil
## 1
## Warning in test[, 4] - pred: uzun olan nesne uzunluğu kısa olan nesne
## uzunluğunun bir katı değil
## 2
## Warning in test[, 4] - pred: uzun olan nesne uzunluğu kısa olan nesne
## uzunluğunun bir katı değil
## 3
## Warning in test[, 4] - pred: uzun olan nesne uzunluğu kısa olan nesne
## uzunluğunun bir katı değil
## 4
## Warning in test[, 4] - pred: uzun olan nesne uzunluğu kısa olan nesne
## uzunluğunun bir katı değil
## 5
## [1] 50.24549 45.56053 44.81708 43.51478 42.76154
## [1] 266.1793 253.2689 231.4938 212.2652 188.9154
Now, we can visualize OOB and Test set error, then we can find the mtry value.
matplot(1:mtry , cbind(oob.err,test.err), pch=20 , col=c("red","orange"),type="b",ylab="Mean Squared Error",xlab="Number of Predictors Considered at each Split")
legend("topright",legend=c("Out of Bag Error","Test Error"),pch=19, col=c("red","orange"))Red line shows the OOB estimates and orange line is the error calculated on test set. It seem that the both curves are correlated, i.e they have same pattern. As we see, the error tends to be minimized around \(mtry=5\).
Therefore, we can build a tree with values of ntree (520) and mtry (5).
##
## Call:
## randomForest(formula = prestige ~ ., data = train, ntree = 520, mtry = 5)
## Type of random forest: regression
## Number of trees: 520
## No. of variables tried at each split: 5
##
## Mean of squared residuals: 42.41467
## % Var explained: 86.09
Actually, we were tunning the parameter of the function manually so far. Luckily, we don’t always have to do this process for tunning the parameters. Instead, we can use tuneRF function to set the parameter values.
tuneRF(x, y, mtryStart, ntreeTry=50, stepFactor=2, improve=0.05,
trace=TRUE, plot=TRUE, doBest=FALSE, ...)
## mtry = 1 OOB error = 52.51773
## Searching left ...
## Searching right ...
## mtry = 2 OOB error = 47.11179
## 0.1029355 0.05
## mtry = 4 OOB error = 42.12466
## 0.1058574 0.05
## mtry = 5 OOB error = 47.35043
## -0.1240548 0.05
The function shows that the error tends to be minimized around mtry=4 opposite to our choice. Therefore, our best random forest model is built with 520 tree and 4 randomly selected variable at each split.
##
## Call:
## randomForest(formula = prestige ~ ., data = train, ntree = 520, mtry = 4)
## Type of random forest: regression
## Number of trees: 520
## No. of variables tried at each split: 4
##
## Mean of squared residuals: 42.37231
## % Var explained: 86.1
After training a random forest, it is natural to ask which variables have the most predictive power. Variables with high importance are drivers of the outcome and their values have a significant impact on the outcome values. By contrast, variables with low importance might be omitted from a model, making it simpler and faster to fit and predict.
In order to draw a variable importance plot, we can use
varImpPlot() function. In this plot, the variables are
sorted by mean decrease Gini value. If the variable has the highest
value, it can be considered as most important and significant input
variable on the prediction on the response variable.
The plot shows that census is the most important and significant feature on the prediction of prestige score.
Apply Linear Regression
As stated below, it is a regression problem and let’s fit a linear regression to make a comparison.
##
## Call:
## lm(formula = prestige ~ ., data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.4658 -5.7871 -0.2579 5.3222 20.3642
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.640e+01 1.105e+01 -1.485 0.1421
## education 3.722e+00 8.790e-01 4.235 6.75e-05 ***
## income 1.458e-03 4.710e-04 3.096 0.0028 **
## women 3.976e-02 4.019e-02 0.989 0.3259
## census 1.445e-03 9.829e-04 1.470 0.1459
## typeprof 1.233e+01 6.665e+00 1.849 0.0686 .
## typewc 7.891e-01 4.563e+00 0.173 0.8632
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.456 on 71 degrees of freedom
## Multiple R-squared: 0.8341, Adjusted R-squared: 0.82
## F-statistic: 59.48 on 6 and 71 DF, p-value: < 2.2e-16
As you see, the model has very high \(R^{2}_{adj}\) value (0.82). Except, education and income, the remaning variables don’t contribute to the model significantly.
Random Forest and Linear Model Comparison
Random Forest
## [1] 42.37231
Linear Regression
## [1] 50.60217
Random Forest
## Warning in pred - test[, 4]: uzun olan nesne uzunluğu kısa olan nesne
## uzunluğunun bir katı değil
## [1] 468.7358
Linear Regression
## Warning in pred1 - test[, 4]: uzun olan nesne uzunluğu kısa olan nesne
## uzunluğunun bir katı değil
## [1] 483.1183
Therefore, we can say that RF outperforms LR.