Naive Bayes

# turning "warnings" off
options(warn=-1)

default_par = par() # save default par

Naive Bayes is one of the simplest classification algorithms, but it can also be very accurate. At a high level, Naive Bayes tries to classify instances based on the probabilities of previously seen attributes/instances, assuming complete attribute independence. Oddly enough, this usually works out to give you a good classifier.

“Iris” data set:

Now, I am going to show you how to do Naive Bayes classification in R. First, you need to install a few packages, so time to boot up your R instance!

library("klaR")

## Loading required package: MASS

library("caret")

## Loading required package: lattice

## Loading required package: ggplot2

Caret is a very nice data mining package for R, it has tons of awesome features.

The package klaR contains our Naive Bayes classifier.

Everyone does the iris dataset first, so I wont break that trend. Later, I will show you a much more interesting dataset. Loadup the iris dataset and separate the labels from the attributes.

data("iris")
x = iris[,-5]
y = iris$Species

Now, x has all the attributes and y has all the labels. Now we can train our model.

library(e1071)
model = train(x,y,'nb',trControl=trainControl(method='cv',number=10))

This one line will generate a Naive Bayes model, using 10-fold cross-validation. From above, x is the attributes and y is the labels. The ‘nb’ tells the trainer to use Naive Bayes. The trainController part tells the trainer to use cross-validataion (‘cv’) with 10 folds.

You can then print out the model:

model

## Naive Bayes 
## 
## 150 samples
##   4 predictor
##   3 classes: 'setosa', 'versicolor', 'virginica' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 135, 135, 135, 135, 135, 135, ... 
## Resampling results across tuning parameters:
## 
##   usekernel  Accuracy   Kappa
##   FALSE      0.9600000  0.94 
##    TRUE      0.9533333  0.93 
## 
## Tuning parameter 'fL' was held constant at a value of 0
## Tuning
##  parameter 'adjust' was held constant at a value of 1
## Accuracy was used to select the optimal model using  the largest value.
## The final values used for the model were fL = 0, usekernel = FALSE
##  and adjust = 1.

Awesome! We have a 94% kappa, life is good! One of the really cool things about caret’s train function is that it will fine-tune the parameters to your model (to a certain extent).

Now that we have generated a classification model, how can we use it for prediction? Easy!

predict(model$finalModel,x)

## $class
##   [1] setosa     setosa     setosa     setosa     setosa     setosa    
##   [7] setosa     setosa     setosa     setosa     setosa     setosa    
##  [13] setosa     setosa     setosa     setosa     setosa     setosa    
##  [19] setosa     setosa     setosa     setosa     setosa     setosa    
##  [25] setosa     setosa     setosa     setosa     setosa     setosa    
##  [31] setosa     setosa     setosa     setosa     setosa     setosa    
##  [37] setosa     setosa     setosa     setosa     setosa     setosa    
##  [43] setosa     setosa     setosa     setosa     setosa     setosa    
##  [49] setosa     setosa     versicolor versicolor virginica  versicolor
##  [55] versicolor versicolor versicolor versicolor versicolor versicolor
##  [61] versicolor versicolor versicolor versicolor versicolor versicolor
##  [67] versicolor versicolor versicolor versicolor virginica  versicolor
##  [73] versicolor versicolor versicolor versicolor versicolor virginica 
##  [79] versicolor versicolor versicolor versicolor versicolor versicolor
##  [85] versicolor versicolor versicolor versicolor versicolor versicolor
##  [91] versicolor versicolor versicolor versicolor versicolor versicolor
##  [97] versicolor versicolor versicolor versicolor virginica  virginica 
## [103] virginica  virginica  virginica  virginica  versicolor virginica 
## [109] virginica  virginica  virginica  virginica  virginica  virginica 
## [115] virginica  virginica  virginica  virginica  virginica  versicolor
## [121] virginica  virginica  virginica  virginica  virginica  virginica 
## [127] virginica  virginica  virginica  virginica  virginica  virginica 
## [133] virginica  versicolor virginica  virginica  virginica  virginica 
## [139] virginica  virginica  virginica  virginica  virginica  virginica 
## [145] virginica  virginica  virginica  virginica  virginica  virginica 
## Levels: setosa versicolor virginica
## 
## $posterior
##               setosa   versicolor    virginica
##   [1,]  1.000000e+00 2.981309e-18 2.152373e-25
##   [2,]  1.000000e+00 3.169312e-17 6.938030e-25
##   [3,]  1.000000e+00 2.367113e-18 7.240956e-26
##   [4,]  1.000000e+00 3.069606e-17 8.690636e-25
##   [5,]  1.000000e+00 1.017337e-18 8.885794e-26
##   [6,]  1.000000e+00 2.717732e-14 4.344285e-21
##   [7,]  1.000000e+00 2.321639e-17 7.988271e-25
##   [8,]  1.000000e+00 1.390751e-17 8.166995e-25
##   [9,]  1.000000e+00 1.990156e-17 3.606469e-25
##  [10,]  1.000000e+00 7.378931e-18 3.615492e-25
##  [11,]  1.000000e+00 9.396089e-18 1.474623e-24
##  [12,]  1.000000e+00 3.461964e-17 2.093627e-24
##  [13,]  1.000000e+00 2.804520e-18 1.010192e-25
##  [14,]  1.000000e+00 1.799033e-19 6.060578e-27
##  [15,]  1.000000e+00 5.533879e-19 2.485033e-25
##  [16,]  1.000000e+00 6.273863e-17 4.509864e-23
##  [17,]  1.000000e+00 1.106658e-16 1.282419e-23
##  [18,]  1.000000e+00 4.841773e-17 2.350011e-24
##  [19,]  1.000000e+00 1.126175e-14 2.567180e-21
##  [20,]  1.000000e+00 1.808513e-17 1.963924e-24
##  [21,]  1.000000e+00 2.178382e-15 2.013989e-22
##  [22,]  1.000000e+00 1.210057e-15 7.788592e-23
##  [23,]  1.000000e+00 4.535220e-20 3.130074e-27
##  [24,]  1.000000e+00 3.147327e-11 8.175305e-19
##  [25,]  1.000000e+00 1.838507e-14 1.553757e-21
##  [26,]  1.000000e+00 6.873990e-16 1.830374e-23
##  [27,]  1.000000e+00 3.192598e-14 1.045146e-21
##  [28,]  1.000000e+00 1.542562e-17 1.274394e-24
##  [29,]  1.000000e+00 8.833285e-18 5.368077e-25
##  [30,]  1.000000e+00 9.557935e-17 3.652571e-24
##  [31,]  1.000000e+00 2.166837e-16 6.730536e-24
##  [32,]  1.000000e+00 3.940500e-14 1.546678e-21
##  [33,]  1.000000e+00 1.609092e-20 1.013278e-26
##  [34,]  1.000000e+00 7.222217e-20 4.261853e-26
##  [35,]  1.000000e+00 6.289348e-17 1.831694e-24
##  [36,]  1.000000e+00 2.850926e-18 8.874002e-26
##  [37,]  1.000000e+00 7.746279e-18 7.235628e-25
##  [38,]  1.000000e+00 8.623934e-20 1.223633e-26
##  [39,]  1.000000e+00 4.612936e-18 9.655450e-26
##  [40,]  1.000000e+00 2.009325e-17 1.237755e-24
##  [41,]  1.000000e+00 1.300634e-17 5.657689e-25
##  [42,]  1.000000e+00 1.577617e-15 5.717219e-24
##  [43,]  1.000000e+00 1.494911e-18 4.800333e-26
##  [44,]  1.000000e+00 1.076475e-10 3.721344e-18
##  [45,]  1.000000e+00 1.357569e-12 1.708326e-19
##  [46,]  1.000000e+00 3.882113e-16 5.587814e-24
##  [47,]  1.000000e+00 5.086735e-18 8.960156e-25
##  [48,]  1.000000e+00 5.012793e-18 1.636566e-25
##  [49,]  1.000000e+00 5.717245e-18 8.231337e-25
##  [50,]  1.000000e+00 7.713456e-18 3.349997e-25
##  [51,] 4.893048e-107 8.018653e-01 1.981347e-01
##  [52,] 7.920550e-100 9.429283e-01 5.707168e-02
##  [53,] 5.494369e-121 4.606254e-01 5.393746e-01
##  [54,]  1.129435e-69 9.999621e-01 3.789964e-05
##  [55,] 1.473329e-105 9.503408e-01 4.965916e-02
##  [56,]  1.931184e-89 9.990013e-01 9.986538e-04
##  [57,] 4.539099e-113 6.592515e-01 3.407485e-01
##  [58,]  2.549753e-34 9.999997e-01 3.119517e-07
##  [59,]  6.562814e-97 9.895385e-01 1.046153e-02
##  [60,]  5.000210e-69 9.998928e-01 1.071638e-04
##  [61,]  7.354548e-41 9.999997e-01 3.143915e-07
##  [62,]  4.799134e-86 9.958564e-01 4.143617e-03
##  [63,]  4.631287e-60 9.999925e-01 7.541274e-06
##  [64,] 1.052252e-103 9.850868e-01 1.491324e-02
##  [65,]  4.789799e-55 9.999700e-01 2.999393e-05
##  [66,]  1.514706e-92 9.787587e-01 2.124125e-02
##  [67,]  1.338348e-97 9.899311e-01 1.006893e-02
##  [68,]  2.026115e-62 9.999799e-01 2.007314e-05
##  [69,] 6.547473e-101 9.941996e-01 5.800427e-03
##  [70,]  3.016276e-58 9.999913e-01 8.739959e-06
##  [71,] 1.053341e-127 1.609361e-01 8.390639e-01
##  [72,]  1.248202e-70 9.997743e-01 2.256698e-04
##  [73,] 3.294753e-119 9.245812e-01 7.541876e-02
##  [74,]  1.314175e-95 9.979398e-01 2.060233e-03
##  [75,]  3.003117e-83 9.982736e-01 1.726437e-03
##  [76,]  2.536747e-92 9.865372e-01 1.346281e-02
##  [77,] 1.558909e-111 9.102260e-01 8.977398e-02
##  [78,] 7.014282e-136 7.989607e-02 9.201039e-01
##  [79,]  5.034528e-99 9.854957e-01 1.450433e-02
##  [80,]  1.439052e-41 9.999984e-01 1.601574e-06
##  [81,]  1.251567e-54 9.999955e-01 4.500139e-06
##  [82,]  8.769539e-48 9.999983e-01 1.742560e-06
##  [83,]  3.447181e-62 9.999664e-01 3.361987e-05
##  [84,] 1.087302e-132 6.134355e-01 3.865645e-01
##  [85,]  4.119852e-97 9.918297e-01 8.170260e-03
##  [86,] 1.140835e-102 8.734107e-01 1.265893e-01
##  [87,] 2.247339e-110 7.971795e-01 2.028205e-01
##  [88,]  4.870630e-88 9.992978e-01 7.022084e-04
##  [89,]  2.028672e-72 9.997620e-01 2.379898e-04
##  [90,]  2.227900e-69 9.999461e-01 5.390514e-05
##  [91,]  5.110709e-81 9.998510e-01 1.489819e-04
##  [92,]  5.774841e-99 9.885399e-01 1.146006e-02
##  [93,]  5.146736e-66 9.999591e-01 4.089540e-05
##  [94,]  1.332816e-34 9.999997e-01 2.716264e-07
##  [95,]  6.094144e-77 9.998034e-01 1.966331e-04
##  [96,]  1.424276e-72 9.998236e-01 1.764463e-04
##  [97,]  8.302641e-77 9.996692e-01 3.307548e-04
##  [98,]  1.835520e-82 9.988601e-01 1.139915e-03
##  [99,]  5.710350e-30 9.999997e-01 3.094739e-07
## [100,]  3.996459e-73 9.998204e-01 1.795726e-04
## [101,] 3.993755e-249 1.031032e-10 1.000000e+00
## [102,] 1.228659e-149 2.724406e-02 9.727559e-01
## [103,] 2.460661e-216 2.327488e-07 9.999998e-01
## [104,] 2.864831e-173 2.290954e-03 9.977090e-01
## [105,] 8.299884e-214 3.175384e-07 9.999997e-01
## [106,] 1.371182e-267 3.807455e-10 1.000000e+00
## [107,] 3.444090e-107 9.719885e-01 2.801154e-02
## [108,] 3.741929e-224 1.782047e-06 9.999982e-01
## [109,] 5.564644e-188 5.823191e-04 9.994177e-01
## [110,] 2.052443e-260 2.461662e-12 1.000000e+00
## [111,] 8.669405e-159 4.895235e-04 9.995105e-01
## [112,] 4.220200e-163 3.168643e-03 9.968314e-01
## [113,] 4.360059e-190 6.230821e-06 9.999938e-01
## [114,] 6.142256e-151 1.423414e-02 9.857659e-01
## [115,] 2.201426e-186 1.393247e-06 9.999986e-01
## [116,] 2.949945e-191 6.128385e-07 9.999994e-01
## [117,] 2.909076e-168 2.152843e-03 9.978472e-01
## [118,] 1.347608e-281 2.872996e-12 1.000000e+00
## [119,] 2.786402e-306 1.151469e-12 1.000000e+00
## [120,] 2.082510e-123 9.561626e-01 4.383739e-02
## [121,] 2.194169e-217 1.712166e-08 1.000000e+00
## [122,] 3.325791e-145 1.518718e-02 9.848128e-01
## [123,] 6.251357e-269 1.170872e-09 1.000000e+00
## [124,] 4.415135e-135 1.360432e-01 8.639568e-01
## [125,] 6.315716e-201 1.300512e-06 9.999987e-01
## [126,] 5.257347e-203 9.507989e-06 9.999905e-01
## [127,] 1.476391e-129 2.067703e-01 7.932297e-01
## [128,] 8.772841e-134 1.130589e-01 8.869411e-01
## [129,] 5.230800e-194 1.395719e-05 9.999860e-01
## [130,] 7.014892e-179 8.232518e-04 9.991767e-01
## [131,] 6.306820e-218 1.214497e-06 9.999988e-01
## [132,] 2.539020e-247 4.668891e-10 1.000000e+00
## [133,] 2.210812e-201 2.000316e-06 9.999980e-01
## [134,] 1.128613e-128 7.118948e-01 2.881052e-01
## [135,] 8.114869e-151 4.900992e-01 5.099008e-01
## [136,] 7.419068e-249 1.448050e-10 1.000000e+00
## [137,] 1.004503e-215 9.743357e-09 1.000000e+00
## [138,] 1.346716e-167 2.186989e-03 9.978130e-01
## [139,] 1.994716e-128 1.999894e-01 8.000106e-01
## [140,] 8.440466e-185 6.769126e-06 9.999932e-01
## [141,] 2.334365e-218 7.456220e-09 1.000000e+00
## [142,] 2.179139e-183 6.352663e-07 9.999994e-01
## [143,] 1.228659e-149 2.724406e-02 9.727559e-01
## [144,] 3.426814e-229 6.597015e-09 1.000000e+00
## [145,] 2.011574e-232 2.620636e-10 1.000000e+00
## [146,] 1.078519e-187 7.915543e-07 9.999992e-01
## [147,] 1.061392e-146 2.770575e-02 9.722942e-01
## [148,] 1.846900e-164 4.398402e-04 9.995602e-01
## [149,] 1.439996e-195 3.384156e-07 9.999997e-01
## [150,] 2.771480e-143 5.987903e-02 9.401210e-01

This will printout a bunch of lines. Near the top you can see the classes it predicted, then you will see the posterior probabilities in the bottom half. As we are only interested in the class predictions, we can grab only those with the following line.

predict(model$finalModel,x)$class

##   [1] setosa     setosa     setosa     setosa     setosa     setosa    
##   [7] setosa     setosa     setosa     setosa     setosa     setosa    
##  [13] setosa     setosa     setosa     setosa     setosa     setosa    
##  [19] setosa     setosa     setosa     setosa     setosa     setosa    
##  [25] setosa     setosa     setosa     setosa     setosa     setosa    
##  [31] setosa     setosa     setosa     setosa     setosa     setosa    
##  [37] setosa     setosa     setosa     setosa     setosa     setosa    
##  [43] setosa     setosa     setosa     setosa     setosa     setosa    
##  [49] setosa     setosa     versicolor versicolor virginica  versicolor
##  [55] versicolor versicolor versicolor versicolor versicolor versicolor
##  [61] versicolor versicolor versicolor versicolor versicolor versicolor
##  [67] versicolor versicolor versicolor versicolor virginica  versicolor
##  [73] versicolor versicolor versicolor versicolor versicolor virginica 
##  [79] versicolor versicolor versicolor versicolor versicolor versicolor
##  [85] versicolor versicolor versicolor versicolor versicolor versicolor
##  [91] versicolor versicolor versicolor versicolor versicolor versicolor
##  [97] versicolor versicolor versicolor versicolor virginica  virginica 
## [103] virginica  virginica  virginica  virginica  versicolor virginica 
## [109] virginica  virginica  virginica  virginica  virginica  virginica 
## [115] virginica  virginica  virginica  virginica  virginica  versicolor
## [121] virginica  virginica  virginica  virginica  virginica  virginica 
## [127] virginica  virginica  virginica  virginica  virginica  virginica 
## [133] virginica  versicolor virginica  virginica  virginica  virginica 
## [139] virginica  virginica  virginica  virginica  virginica  virginica 
## [145] virginica  virginica  virginica  virginica  virginica  virginica 
## Levels: setosa versicolor virginica

Lets build a confusion matrix so that we can visualize the classification errors.

table(predict(model$finalModel,x)$class,y)

##             y
##              setosa versicolor virginica
##   setosa         50          0         0
##   versicolor      0         47         3
##   virginica       0          3        47

This will generate a confusion matrix of the predictions of your Naive Bayes model versus the actual classification of the data instances.

Now, what I have done here is actually a terrible idea. You never want to use the same data you trained on for testing, but this is only an example. I will provide a better example later on.

That is basically how you do Naive Bayes classification in R with cross-validation.

“Spam Emails” data set:

Now, lets try this on a more interesting dataset, spam emails.

library('ElemStatLearn')
library("klaR")
library("caret")

head(spam)

##    A.1  A.2  A.3 A.4  A.5  A.6  A.7  A.8  A.9 A.10 A.11 A.12 A.13 A.14
## 1 0.00 0.64 0.64   0 0.32 0.00 0.00 0.00 0.00 0.00 0.00 0.64 0.00 0.00
## 2 0.21 0.28 0.50   0 0.14 0.28 0.21 0.07 0.00 0.94 0.21 0.79 0.65 0.21
## 3 0.06 0.00 0.71   0 1.23 0.19 0.19 0.12 0.64 0.25 0.38 0.45 0.12 0.00
## 4 0.00 0.00 0.00   0 0.63 0.00 0.31 0.63 0.31 0.63 0.31 0.31 0.31 0.00
## 5 0.00 0.00 0.00   0 0.63 0.00 0.31 0.63 0.31 0.63 0.31 0.31 0.31 0.00
## 6 0.00 0.00 0.00   0 1.85 0.00 0.00 1.85 0.00 0.00 0.00 0.00 0.00 0.00
##   A.15 A.16 A.17 A.18 A.19 A.20 A.21 A.22 A.23 A.24 A.25 A.26 A.27 A.28
## 1 0.00 0.32 0.00 1.29 1.93 0.00 0.96    0 0.00 0.00    0    0    0    0
## 2 0.14 0.14 0.07 0.28 3.47 0.00 1.59    0 0.43 0.43    0    0    0    0
## 3 1.75 0.06 0.06 1.03 1.36 0.32 0.51    0 1.16 0.06    0    0    0    0
## 4 0.00 0.31 0.00 0.00 3.18 0.00 0.31    0 0.00 0.00    0    0    0    0
## 5 0.00 0.31 0.00 0.00 3.18 0.00 0.31    0 0.00 0.00    0    0    0    0
## 6 0.00 0.00 0.00 0.00 0.00 0.00 0.00    0 0.00 0.00    0    0    0    0
##   A.29 A.30 A.31 A.32 A.33 A.34 A.35 A.36 A.37 A.38 A.39 A.40 A.41 A.42
## 1    0    0    0    0    0    0    0    0 0.00    0    0 0.00    0    0
## 2    0    0    0    0    0    0    0    0 0.07    0    0 0.00    0    0
## 3    0    0    0    0    0    0    0    0 0.00    0    0 0.06    0    0
## 4    0    0    0    0    0    0    0    0 0.00    0    0 0.00    0    0
## 5    0    0    0    0    0    0    0    0 0.00    0    0 0.00    0    0
## 6    0    0    0    0    0    0    0    0 0.00    0    0 0.00    0    0
##   A.43 A.44 A.45 A.46 A.47 A.48 A.49  A.50 A.51  A.52  A.53  A.54  A.55
## 1 0.00    0 0.00 0.00    0    0 0.00 0.000    0 0.778 0.000 0.000 3.756
## 2 0.00    0 0.00 0.00    0    0 0.00 0.132    0 0.372 0.180 0.048 5.114
## 3 0.12    0 0.06 0.06    0    0 0.01 0.143    0 0.276 0.184 0.010 9.821
## 4 0.00    0 0.00 0.00    0    0 0.00 0.137    0 0.137 0.000 0.000 3.537
## 5 0.00    0 0.00 0.00    0    0 0.00 0.135    0 0.135 0.000 0.000 3.537
## 6 0.00    0 0.00 0.00    0    0 0.00 0.223    0 0.000 0.000 0.000 3.000
##   A.56 A.57 spam
## 1   61  278 spam
## 2  101 1028 spam
## 3  485 2259 spam
## 4   40  191 spam
## 5   40  191 spam
## 6   15   54 spam

dim(spam)

## [1] 4601   58

index = sample(nrow(spam), floor(nrow(spam) * 0.7)) #70/30 split.
train = spam[index,]
test = spam[-index,]

xTrain = train[,-58] # removing y-outcome variable.
yTrain = train$spam # only y.

xTest = test[,-58]
yTest = test$spam

model = train(xTrain,yTrain,'nb',trControl=trainControl(method='cv',number=10))

prop.table(table(predict(model$finalModel,xTest)$class,yTest)) # table() gives frequency table, prop.table() gives freq% table.

##        yTest
##              email       spam
##   email 0.33671253 0.01520637
##   spam  0.27081825 0.37726285

Here we take 70% of the dataset to train on, and then we test on the remaining 30%.

These results will be different on each run, as sample is a random function. The results aren’t great, but for a very simple classifier, they are really good!

Naive Bayes

Maulik Patel

November 5, 2016

“Iris” data set:

Now, I am going to show you how to do Naive Bayes classification in R. First, you need to install a few packages, so time to boot up your R instance!

Caret is a very nice data mining package for R, it has tons of awesome features.

The package klaR contains our Naive Bayes classifier.

Everyone does the iris dataset first, so I wont break that trend. Later, I will show you a much more interesting dataset. Loadup the iris dataset and separate the labels from the attributes.

Now, x has all the attributes and y has all the labels. Now we can train our model.

This one line will generate a Naive Bayes model, using 10-fold cross-validation. From above, x is the attributes and y is the labels. The ‘nb’ tells the trainer to use Naive Bayes. The trainController part tells the trainer to use cross-validataion (‘cv’) with 10 folds.

You can then print out the model:

Awesome! We have a 94% kappa, life is good! One of the really cool things about caret’s train function is that it will fine-tune the parameters to your model (to a certain extent).

Now that we have generated a classification model, how can we use it for prediction? Easy!

This will printout a bunch of lines. Near the top you can see the classes it predicted, then you will see the posterior probabilities in the bottom half. As we are only interested in the class predictions, we can grab only those with the following line.

Lets build a confusion matrix so that we can visualize the classification errors.

This will generate a confusion matrix of the predictions of your Naive Bayes model versus the actual classification of the data instances.

Now, what I have done here is actually a terrible idea. You never want to use the same data you trained on for testing, but this is only an example. I will provide a better example later on.

That is basically how you do Naive Bayes classification in R with cross-validation.

“Spam Emails” data set:

Now, lets try this on a more interesting dataset, spam emails.

Here we take 70% of the dataset to train on, and then we test on the remaining 30%.

These results will be different on each run, as sample is a random function. The results aren’t great, but for a very simple classifier, they are really good!