Introduction

In this document I am going to attempt a Naïve Bayesian analysis on made-up data in order to try and deepen my understanding of the process. I will restrict the features to categorical variables for simplicity. I will sanity test my findings against an existing software package’s results.

Overview

Naïve Bayes predicts the class of new observations based on the distribution of the classes of existing data under two important assumptions:

The data is iid
The class probabilities are conditionally independent

It relies on Bayes theorem:

\[ P(class|data) = \frac{P(data|class)P(class)}{P(data)} \]

As \(P(data)\) is a constant, the implementation dispenses with the denominator and relies on:

\[ P(class|data) \propto P(data|class)P(class) \]

The conditional independence arrives in the decomposition of \(P(data|class)\). Given \(i\) observations, \(j\) sets of features, and \(k\) possible classes, the probability of the \(i^{th}\) observation being of class \(k\) is:

\[ P(y_i \in C_k|x) \propto \prod_j P(x_{ij}|C_k)P(C_k) \]

The assigned class will be the class for which the the above product is maximized:

\[ C_i = \mathop{\mathrm{arg\,max}}_k \prod_j P(x_{ij}|C_k)P(C_k) \]

Data

I will randomly select a target variable and three features. The danger with this is there not being any signal to extract from the noise.

library(caret)
library(data.table)

set.seed(18426)
y <- sample(c("Animal", "Vegetable", "Mineral"), 50, replace = TRUE)
x1 <- sample(c("Green", "Red", "Blue", "Orange", "Yellow", "White", "Black",
               "Pink"), 50, replace = TRUE)
x2 <- sample(c("NorAm", "SouAm", "Eur", "Aisa", "Afr", "Aus"), 50, replace = TRUE)
x3 <- sample(c("Sphere", "Cube", "Pyramid", "Cylinder"), 50, replace = TRUE)
DT <- data.table(y = y, x1 = x1, x2 = x2, x3 = x3)
knitr::kable(head(DT, 10), caption = "First 10 Entries", format = 'html')

First 10 Entries
y	x1	x2	x3
Vegetable	Pink	Aus	Pyramid
Mineral	Black	SouAm	Sphere
Mineral	Red	Aisa	Sphere
Mineral	Pink	SouAm	Sphere
Animal	Yellow	NorAm	Cube
Vegetable	White	Eur	Cube
Animal	White	SouAm	Pyramid
Animal	Blue	SouAm	Sphere
Mineral	Blue	NorAm	Cube
Vegetable	Green	SouAm	Sphere

Modeling

Training & Testing

set.seed(142)
trIDX <- createDataPartition(DT$y, p = 0.75)
trnSet <- DT[trIDX$Resample1]
tstSet <- DT[-trIDX$Resample1]

Marginal Class Probabilities

The section calculates \(P(Class)\) for each feature. This is simply the count for each class in the training set as a proportion of the total observations in the training set.

ntrn <-  dim(trnSet)[[1]]
classP <- trnSet[, .(p = .N / ntrn), by = y]
classP[]

##            y         p
## 1: Vegetable 0.2368421
## 2:   Mineral 0.3947368
## 3:    Animal 0.3684211

Conditional Class Probabilities

This section calculates \(P(data|class)\). This is a little more complicated. What is being measured is the fraction of all observations with feature \(x_i\) given that the class is \(C_k\). So the denominator is constantly changing as well. The code below uses data.table to set st as the count for each class y, and then divides that into the count of x1, x2, and x3 given that the class is y.

condPX1 <- trnSet[, {st = .N;
                     .SD[, .(p = .N / st), by = x1]
                    }, by = y]
condPX2 <- trnSet[, {st = .N;
                     .SD[, .(p = .N / st), by = x2]
                    }, by = y]
condPX3 <- trnSet[, {st = .N;
                     .SD[, .(p = .N / st), by = x3]
                    }, by = y]
condPX1[]

##             y     x1          p
##  1: Vegetable   Pink 0.22222222
##  2: Vegetable  White 0.11111111
##  3: Vegetable  Green 0.33333333
##  4: Vegetable Orange 0.22222222
##  5: Vegetable Yellow 0.11111111
##  6:   Mineral  Black 0.13333333
##  7:   Mineral    Red 0.13333333
##  8:   Mineral   Blue 0.20000000
##  9:   Mineral Orange 0.06666667
## 10:   Mineral  White 0.20000000
## 11:   Mineral  Green 0.20000000
## 12:   Mineral   Pink 0.06666667
## 13:    Animal  White 0.07142857
## 14:    Animal Yellow 0.21428571
## 15:    Animal   Blue 0.14285714
## 16:    Animal  Black 0.14285714
## 17:    Animal    Red 0.28571429
## 18:    Animal   Pink 0.07142857
## 19:    Animal Orange 0.07142857

condPX2[]

##             y    x2          p
##  1: Vegetable   Aus 0.22222222
##  2: Vegetable   Eur 0.11111111
##  3: Vegetable SouAm 0.22222222
##  4: Vegetable NorAm 0.22222222
##  5: Vegetable  Aisa 0.22222222
##  6:   Mineral SouAm 0.26666667
##  7:   Mineral  Aisa 0.13333333
##  8:   Mineral NorAm 0.46666667
##  9:   Mineral   Aus 0.06666667
## 10:   Mineral   Eur 0.06666667
## 11:    Animal SouAm 0.21428571
## 12:    Animal NorAm 0.21428571
## 13:    Animal   Aus 0.28571429
## 14:    Animal  Aisa 0.14285714
## 15:    Animal   Afr 0.14285714

condPX3[]

##             y       x3          p
##  1: Vegetable  Pyramid 0.22222222
##  2: Vegetable     Cube 0.44444444
##  3: Vegetable   Sphere 0.33333333
##  4:   Mineral   Sphere 0.46666667
##  5:   Mineral     Cube 0.26666667
##  6:   Mineral  Pyramid 0.20000000
##  7:   Mineral Cylinder 0.06666667
##  8:    Animal  Pyramid 0.35714286
##  9:    Animal Cylinder 0.21428571
## 10:    Animal     Cube 0.21428571
## 11:    Animal   Sphere 0.21428571

Predictions

Given these conditional probabilities and class probabilities, the proportional probabilities of the test data can be calculated.

xtst <- tstSet[, -1]
ntst <- dim(xtst)[[1]]

# When you can, pre-allocate your data structures. It is faster and more
# memory efficient. 
yPreds <- data.table(Vegetable = double(ntst),
                     Mineral = double(ntst),
                     Animal = double(ntst))

# This is a four-way merge. First x1 probabilities are joined to the test set.
# Then the x2, x3, and class probabilities as well. Once that is done, the
# probabilities are multiplied per naïve Bayes assumptions, and this value is
# stored in the appropriate cell of the output table, yPreds. For exposition
# the temporary join table is printed inside the loop.
for (i in seq_along(classP$y)) {
   expandedProbs <- classP[condPX3[y == classP$y[i]
                                   ][condPX2[y == classP$y[i]
                                             ][condPX1[y == classP$y[i]
                                                       ][xtst, on = 'x1',
                                                         all = TRUE],
                                               on = 'x2'],
                                     on = 'x3'],
                           on = 'y']
  print(knitr::kable(expandedProbs, caption = paste("Probabilites for class",
                                              classP$y[i])))
  cat('\n\n<!-- -->\n\n')
   yPreds[, i] <- expandedProbs[, p * i.p * i.p.2 * i.p.1]
   
}

## 
## 
## Table: Probabilites for class Vegetable
## 
## |y         |         p|x3       |       i.p|i.y       |x2    |     i.p.2|i.y.1     |x1     |     i.p.1|
## |:---------|---------:|:--------|---------:|:---------|:-----|---------:|:---------|:------|---------:|
## |Vegetable | 0.2368421|Sphere   | 0.3333333|Vegetable |SouAm | 0.2222222|Vegetable |Pink   | 0.2222222|
## |Vegetable | 0.2368421|Cube     | 0.4444444|Vegetable |NorAm | 0.2222222|Vegetable |Yellow | 0.1111111|
## |Vegetable | 0.2368421|Sphere   | 0.3333333|Vegetable |SouAm | 0.2222222|NA        |Blue   |        NA|
## |NA        |        NA|Cylinder |        NA|Vegetable |Eur   | 0.1111111|Vegetable |White  | 0.1111111|
## |NA        |        NA|Cylinder |        NA|NA        |Afr   |        NA|NA        |Blue   |        NA|
## |Vegetable | 0.2368421|Pyramid  | 0.2222222|Vegetable |Eur   | 0.1111111|NA        |Blue   |        NA|
## |Vegetable | 0.2368421|Sphere   | 0.3333333|Vegetable |Aus   | 0.2222222|Vegetable |Yellow | 0.1111111|
## |Vegetable | 0.2368421|Cube     | 0.4444444|Vegetable |NorAm | 0.2222222|Vegetable |White  | 0.1111111|
## |Vegetable | 0.2368421|Sphere   | 0.3333333|Vegetable |Aisa  | 0.2222222|NA        |Blue   |        NA|
## |NA        |        NA|Cylinder |        NA|NA        |Afr   |        NA|Vegetable |Green  | 0.3333333|
## |Vegetable | 0.2368421|Cube     | 0.4444444|Vegetable |Aus   | 0.2222222|NA        |Red    |        NA|
## |Vegetable | 0.2368421|Cube     | 0.4444444|NA        |Afr   |        NA|NA        |Red    |        NA|
## 
## 
## <!-- -->
## 
## 
## 
## Table: Probabilites for class Mineral
## 
## |y       |         p|x3       |       i.p|i.y     |x2    |     i.p.2|i.y.1   |x1     |     i.p.1|
## |:-------|---------:|:--------|---------:|:-------|:-----|---------:|:-------|:------|---------:|
## |Mineral | 0.3947368|Sphere   | 0.4666667|Mineral |SouAm | 0.2666667|Mineral |Pink   | 0.0666667|
## |Mineral | 0.3947368|Cube     | 0.2666667|Mineral |NorAm | 0.4666667|NA      |Yellow |        NA|
## |Mineral | 0.3947368|Sphere   | 0.4666667|Mineral |SouAm | 0.2666667|Mineral |Blue   | 0.2000000|
## |Mineral | 0.3947368|Cylinder | 0.0666667|Mineral |Eur   | 0.0666667|Mineral |White  | 0.2000000|
## |Mineral | 0.3947368|Cylinder | 0.0666667|NA      |Afr   |        NA|Mineral |Blue   | 0.2000000|
## |Mineral | 0.3947368|Pyramid  | 0.2000000|Mineral |Eur   | 0.0666667|Mineral |Blue   | 0.2000000|
## |Mineral | 0.3947368|Sphere   | 0.4666667|Mineral |Aus   | 0.0666667|NA      |Yellow |        NA|
## |Mineral | 0.3947368|Cube     | 0.2666667|Mineral |NorAm | 0.4666667|Mineral |White  | 0.2000000|
## |Mineral | 0.3947368|Sphere   | 0.4666667|Mineral |Aisa  | 0.1333333|Mineral |Blue   | 0.2000000|
## |Mineral | 0.3947368|Cylinder | 0.0666667|NA      |Afr   |        NA|Mineral |Green  | 0.2000000|
## |Mineral | 0.3947368|Cube     | 0.2666667|Mineral |Aus   | 0.0666667|Mineral |Red    | 0.1333333|
## |Mineral | 0.3947368|Cube     | 0.2666667|NA      |Afr   |        NA|Mineral |Red    | 0.1333333|
## 
## 
## <!-- -->
## 
## 
## 
## Table: Probabilites for class Animal
## 
## |y      |         p|x3       |       i.p|i.y    |x2    |     i.p.2|i.y.1  |x1     |     i.p.1|
## |:------|---------:|:--------|---------:|:------|:-----|---------:|:------|:------|---------:|
## |Animal | 0.3684211|Sphere   | 0.2142857|Animal |SouAm | 0.2142857|Animal |Pink   | 0.0714286|
## |Animal | 0.3684211|Cube     | 0.2142857|Animal |NorAm | 0.2142857|Animal |Yellow | 0.2142857|
## |Animal | 0.3684211|Sphere   | 0.2142857|Animal |SouAm | 0.2142857|Animal |Blue   | 0.1428571|
## |Animal | 0.3684211|Cylinder | 0.2142857|NA     |Eur   |        NA|Animal |White  | 0.0714286|
## |Animal | 0.3684211|Cylinder | 0.2142857|Animal |Afr   | 0.1428571|Animal |Blue   | 0.1428571|
## |Animal | 0.3684211|Pyramid  | 0.3571429|NA     |Eur   |        NA|Animal |Blue   | 0.1428571|
## |Animal | 0.3684211|Sphere   | 0.2142857|Animal |Aus   | 0.2857143|Animal |Yellow | 0.2142857|
## |Animal | 0.3684211|Cube     | 0.2142857|Animal |NorAm | 0.2142857|Animal |White  | 0.0714286|
## |Animal | 0.3684211|Sphere   | 0.2142857|Animal |Aisa  | 0.1428571|Animal |Blue   | 0.1428571|
## |Animal | 0.3684211|Cylinder | 0.2142857|Animal |Afr   | 0.1428571|NA     |Green  |        NA|
## |Animal | 0.3684211|Cube     | 0.2142857|Animal |Aus   | 0.2857143|Animal |Red    | 0.2857143|
## |Animal | 0.3684211|Cube     | 0.2142857|Animal |Afr   | 0.1428571|Animal |Red    | 0.2857143|
## 
## 
## <!-- -->

The tables above show the class probabilities and the probabilities of the features given the classes for each row of the test set.

The product of all the probabilities, with 0 substituted for NA, is shown below.

yPreds[is.na(yPreds)] <- 0
predictions <- classP$y[max.col(yPreds)]
yPreds[, `:=`(Prediction = predictions,
              Actual = tstSet$y)]
knitr::kable(yPreds, caption = "Naïve Bayes Probabilities")

Naïve Bayes Probabilities
Vegetable	Mineral	Animal	Prediction	Actual
0.0038986	0.0032749	0.0012084	Vegetable	Mineral
0.0025991	0.0000000	0.0036251	Animal	Animal
0.0000000	0.0098246	0.0024168	Mineral	Animal
0.0000000	0.0003509	0.0000000	Mineral	Vegetable
0.0000000	0.0000000	0.0016112	Animal	Animal
0.0000000	0.0010526	0.0000000	Mineral	Vegetable
0.0019493	0.0000000	0.0048335	Animal	Mineral
0.0025991	0.0098246	0.0012084	Mineral	Animal
0.0000000	0.0049123	0.0016112	Mineral	Mineral
0.0000000	0.0000000	0.0000000	Animal	Mineral
0.0000000	0.0009357	0.0064447	Animal	Vegetable
0.0000000	0.0000000	0.0032223	Animal	Mineral

Finally, the predictions and the confusion matrix for them are shown below:

ytst <- factor(tstSet$y, levels = classP$y)
confusionMatrix(data = factor(predictions, levels = classP$y),
                reference = ytst)

## Confusion Matrix and Statistics
## 
##            Reference
## Prediction  Vegetable Mineral Animal
##   Vegetable         0       1      0
##   Mineral           2       1      2
##   Animal            1       3      2
## 
## Overall Statistics
##                                           
##                Accuracy : 0.25            
##                  95% CI : (0.0549, 0.5719)
##     No Information Rate : 0.4167          
##     P-Value [Acc > NIR] : 0.9329          
##                                           
##                   Kappa : -0.1739         
##                                           
##  Mcnemar's Test P-Value : 0.6746          
## 
## Statistics by Class:
## 
##                      Class: Vegetable Class: Mineral Class: Animal
## Sensitivity                   0.00000        0.20000        0.5000
## Specificity                   0.88889        0.42857        0.5000
## Pos Pred Value                0.00000        0.20000        0.3333
## Neg Pred Value                0.72727        0.42857        0.6667
## Prevalence                    0.25000        0.41667        0.3333
## Detection Rate                0.00000        0.08333        0.1667
## Detection Prevalence          0.08333        0.41667        0.5000
## Balanced Accuracy             0.44444        0.31429        0.5000

Discussion

The results are awful. The accuracy is below the no-information rate. However, that may be a result of the test set having conditions not found in the training set. For example, looking at the table of predictions, we see that row 4 for Vegetable is 0. However, there are no examples of vegetables being cylindrical, so the conditional probability of \(x_3\) being Cylinder given the class of vegetable is 0.

trnSet[x3 == 'Cylinder'][]

##          y    x1    x2       x3
## 1:  Animal Black   Aus Cylinder
## 2:  Animal  Blue   Aus Cylinder
## 3:  Animal   Red  Aisa Cylinder
## 4: Mineral Black SouAm Cylinder

Similarly, there is no Animal from Europe in the training set.

trnSet[x2 == 'Eur'][]

##            y    x1  x2      x3
## 1: Vegetable White Eur    Cube
## 2:   Mineral Green Eur Pyramid

So the prediction can only be Mineral, as that is the only class with instances of “White”, “Europe”, and “Cylinder”. They do not all occur in the same instance, but since naïve Bayes assumes conditional independence, the product of the priors is > 0.

trnSet[y == 'Mineral' & (x1 == 'White' | x2 == 'Eur' | x3 == 'Cylinder')]

##          y    x1    x2       x3
## 1: Mineral White NorAm   Sphere
## 2: Mineral White NorAm  Pyramid
## 3: Mineral Green   Eur  Pyramid
## 4: Mineral White SouAm   Sphere
## 5: Mineral Black SouAm Cylinder

Yet, a White European Cylindrical Vegetable does exist in the test set in row 4. This highlights one of the shortcomings of naïve Bayes: if the data set is too small, there are not enough prior probabilities greater than 0.

tstSet[4]

##            y    x1  x2       x3
## 1: Vegetable White Eur Cylinder

Comparison with Statistical Package

For comparison, below are results using the caret and bnclassify packages. Selecting method = 'none' causes fitting on the entire dataset at once instead of using cross-validation or bootstrapping. This allows direct comparison with the manual run.

xtrn <- data.table(x1 = factor(trnSet$x1,
                               levels = c("Green", "Red", "Blue", "Orange",
                                          "Yellow", "White", "Black", "Pink")),
                   x2 = factor(trnSet$x2,
                               levels = c("NorAm", "SouAm", "Eur", "Aisa",
                                          "Afr", "Aus")),
                   x3 = factor(trnSet$x3,
                               levels = c("Sphere", "Cube", "Pyramid",
                                          "Cylinder")))
ytrn <- factor(trnSet$y, levels = classP$y)

xtst <- data.table(x1 = factor(tstSet$x1,
                               levels = c("Green", "Red", "Blue", "Orange",
                                          "Yellow", "White", "Black", "Pink")),
                   x2 = factor(tstSet$x2,
                               levels = c("NorAm", "SouAm", "Eur", "Aisa",
                                          "Afr", "Aus")),
                   x3 = factor(tstSet$x3,
                               levels = c("Sphere", "Cube", "Pyramid",
                                          "Cylinder")))

                   
trC <- trainControl(method = 'none', classProbs = TRUE,
                    summaryFunction = multiClassSummary)

Fit <- train(x = xtrn, y = ytrn, method = 'nbDiscrete', trControl = trC)
Fit

## Naive Bayes Classifier 
## 
## 38 samples
##  3 predictor
##  3 classes: 'Vegetable', 'Mineral', 'Animal' 
## 
## No pre-processing
## Resampling: None

FitP <- predict(Fit, xtst)
confusionMatrix(data = FitP,
                reference = ytst)

## Confusion Matrix and Statistics
## 
##            Reference
## Prediction  Vegetable Mineral Animal
##   Vegetable         0       1      0
##   Mineral           2       1      2
##   Animal            1       3      2
## 
## Overall Statistics
##                                           
##                Accuracy : 0.25            
##                  95% CI : (0.0549, 0.5719)
##     No Information Rate : 0.4167          
##     P-Value [Acc > NIR] : 0.9329          
##                                           
##                   Kappa : -0.1739         
##                                           
##  Mcnemar's Test P-Value : 0.6746          
## 
## Statistics by Class:
## 
##                      Class: Vegetable Class: Mineral Class: Animal
## Sensitivity                   0.00000        0.20000        0.5000
## Specificity                   0.88889        0.42857        0.5000
## Pos Pred Value                0.00000        0.20000        0.3333
## Neg Pred Value                0.72727        0.42857        0.6667
## Prevalence                    0.25000        0.41667        0.3333
## Detection Rate                0.00000        0.08333        0.1667
## Detection Prevalence          0.08333        0.41667        0.5000
## Balanced Accuracy             0.44444        0.31429        0.5000

As expected, the results are exactly the same.

Larger Data Set

To test the hypothesis that the poor results are a function of the paucity of the training set, a much larger sample will be created and run.

set.seed(8176)
n <- 10000
y <- sample(c("Animal", "Vegetable", "Mineral"), n, replace = TRUE)
x1 <- sample(c("Green", "Red", "Blue", "Orange", "Yellow", "White", "Black",
               "Pink"), n, replace = TRUE)
x2 <- sample(c("NorAm", "SouAm", "Eur", "Aisa", "Afr", "Aus"), n,
             replace = TRUE)
x3 <- sample(c("Sphere", "Cube", "Pyramid", "Cylinder"), n, replace = TRUE)
DT <- data.table(y = y, x1 = x1, x2 = x2, x3 = x3)
trIDX <- createDataPartition(DT$y, p = 0.75)
trnSet <- DT[trIDX$Resample1]
tstSet <- DT[-trIDX$Resample1]
ntrn <-  dim(trnSet)[[1]]
classP <- trnSet[, .(p = .N / ntrn), by = y]
classP[]

##            y      p
## 1: Vegetable 0.3340
## 2:    Animal 0.3304
## 3:   Mineral 0.3356

condPX1 <- trnSet[, {st = .N;
                     .SD[, .(p = .N / st), by = x1]
                    }, by = y]
condPX2 <- trnSet[, {st = .N;
                     .SD[, .(p = .N / st), by = x2]
                    }, by = y]
condPX3 <- trnSet[, {st = .N;
                     .SD[, .(p = .N / st), by = x3]
                    }, by = y]
condPX1[]

##             y     x1         p
##  1: Vegetable   Blue 0.1325349
##  2: Vegetable  Black 0.1221557
##  3: Vegetable Orange 0.1173653
##  4: Vegetable  White 0.1453094
##  5: Vegetable   Pink 0.1217565
##  6: Vegetable  Green 0.1209581
##  7: Vegetable    Red 0.1229541
##  8: Vegetable Yellow 0.1169661
##  9:    Animal Orange 0.1283293
## 10:    Animal  Green 0.1230831
## 11:    Animal  Black 0.1234867
## 12:    Animal  White 0.1230831
## 13:    Animal    Red 0.1255044
## 14:    Animal   Pink 0.1368039
## 15:    Animal   Blue 0.1299435
## 16:    Animal Yellow 0.1097659
## 17:   Mineral   Blue 0.1299166
## 18:   Mineral    Red 0.1068733
## 19:   Mineral  Black 0.1283274
## 20:   Mineral  Green 0.1279301
## 21:   Mineral Orange 0.1338896
## 22:   Mineral  White 0.1239571
## 23:   Mineral Yellow 0.1279301
## 24:   Mineral   Pink 0.1211760
##             y     x1         p

condPX2[]

##             y    x2         p
##  1: Vegetable SouAm 0.1676647
##  2: Vegetable   Eur 0.1768463
##  3: Vegetable   Afr 0.1540918
##  4: Vegetable NorAm 0.1796407
##  5: Vegetable  Aisa 0.1624750
##  6: Vegetable   Aus 0.1592814
##  7:    Animal SouAm 0.1626312
##  8:    Animal  Aisa 0.1614205
##  9:    Animal NorAm 0.1836158
## 10:    Animal   Eur 0.1602098
## 11:    Animal   Aus 0.1702986
## 12:    Animal   Afr 0.1618241
## 13:   Mineral   Afr 0.1720302
## 14:   Mineral   Eur 0.1656734
## 15:   Mineral   Aus 0.1624950
## 16:   Mineral SouAm 0.1779897
## 17:   Mineral NorAm 0.1573302
## 18:   Mineral  Aisa 0.1644815

condPX3[]

##             y       x3         p
##  1: Vegetable   Sphere 0.2379242
##  2: Vegetable     Cube 0.2590818
##  3: Vegetable  Pyramid 0.2530938
##  4: Vegetable Cylinder 0.2499002
##  5:    Animal  Pyramid 0.2401130
##  6:    Animal Cylinder 0.2510089
##  7:    Animal   Sphere 0.2469734
##  8:    Animal     Cube 0.2619048
##  9:   Mineral  Pyramid 0.2487088
## 10:   Mineral     Cube 0.2495034
## 11:   Mineral Cylinder 0.2566547
## 12:   Mineral   Sphere 0.2451331

xtst <- tstSet[, -1]
ntst <- dim(xtst)[[1]]
yPreds <- data.table(Vegetable = double(ntst),
                     Mineral = double(ntst),
                     Animal = double(ntst))
for (i in seq_along(classP$y)) {
   expandedProbs <- classP[condPX3[y == classP$y[i]
                                   ][condPX2[y == classP$y[i]
                                             ][condPX1[y == classP$y[i]
                                                       ][xtst, on = 'x1',
                                                         all = TRUE],
                                               on = 'x2'],
                                     on = 'x3'],
                           on = 'y']
   yPreds[, i] <- expandedProbs[, p * i.p * i.p.2 * i.p.1]
   
}
yPreds[is.na(yPreds)] <- 0
predictions <- classP$y[max.col(yPreds)]
yPreds[, `:=`(Prediction = predictions,
              Actual = tstSet$y)]
ytst <- factor(tstSet$y, levels = classP$y)
confusionMatrix(data = factor(predictions, levels = classP$y),
                reference = ytst)

## Confusion Matrix and Statistics
## 
##            Reference
## Prediction  Vegetable Animal Mineral
##   Vegetable       241    235     248
##   Animal          226    232     224
##   Mineral         368    359     367
## 
## Overall Statistics
##                                           
##                Accuracy : 0.336           
##                  95% CI : (0.3175, 0.3549)
##     No Information Rate : 0.3356          
##     P-Value [Acc > NIR] : 0.4906          
##                                           
##                   Kappa : 0.0034          
##                                           
##  Mcnemar's Test P-Value : 7.527e-12       
## 
## Statistics by Class:
## 
##                      Class: Vegetable Class: Animal Class: Mineral
## Sensitivity                    0.2886        0.2809         0.4374
## Specificity                    0.7099        0.7312         0.5623
## Pos Pred Value                 0.3329        0.3402         0.3355
## Neg Pred Value                 0.6655        0.6733         0.6643
## Prevalence                     0.3340        0.3304         0.3356
## Detection Rate                 0.0964        0.0928         0.1468
## Detection Prevalence           0.2896        0.2728         0.4376
## Balanced Accuracy              0.4993        0.5060         0.4999

xtrn <- data.table(x1 = factor(trnSet$x1,
                               levels = c("Green", "Red", "Blue", "Orange",
                                          "Yellow", "White", "Black", "Pink")),
                   x2 = factor(trnSet$x2,
                               levels = c("NorAm", "SouAm", "Eur", "Aisa",
                                          "Afr", "Aus")),
                   x3 = factor(trnSet$x3,
                               levels = c("Sphere", "Cube", "Pyramid",
                                          "Cylinder")))
ytrn <- factor(trnSet$y, levels = classP$y)

xtst <- data.table(x1 = factor(tstSet$x1,
                               levels = c("Green", "Red", "Blue", "Orange",
                                          "Yellow", "White", "Black", "Pink")),
                   x2 = factor(tstSet$x2,
                               levels = c("NorAm", "SouAm", "Eur", "Aisa",
                                          "Afr", "Aus")),
                   x3 = factor(tstSet$x3,
                               levels = c("Sphere", "Cube", "Pyramid",
                                          "Cylinder")))

                   
trC <- trainControl(method = 'none', classProbs = TRUE,
                    summaryFunction = multiClassSummary)

Fit <- train(x = xtrn, y = ytrn, method = 'nbDiscrete', trControl = trC)
FitP <- predict(Fit, xtst)
confusionMatrix(data = FitP,
                reference = ytst)

## Confusion Matrix and Statistics
## 
##            Reference
## Prediction  Vegetable Animal Mineral
##   Vegetable       241    235     248
##   Animal          226    232     224
##   Mineral         368    359     367
## 
## Overall Statistics
##                                           
##                Accuracy : 0.336           
##                  95% CI : (0.3175, 0.3549)
##     No Information Rate : 0.3356          
##     P-Value [Acc > NIR] : 0.4906          
##                                           
##                   Kappa : 0.0034          
##                                           
##  Mcnemar's Test P-Value : 7.527e-12       
## 
## Statistics by Class:
## 
##                      Class: Vegetable Class: Animal Class: Mineral
## Sensitivity                    0.2886        0.2809         0.4374
## Specificity                    0.7099        0.7312         0.5623
## Pos Pred Value                 0.3329        0.3402         0.3355
## Neg Pred Value                 0.6655        0.6733         0.6643
## Prevalence                     0.3340        0.3304         0.3356
## Detection Rate                 0.0964        0.0928         0.1468
## Detection Prevalence           0.2896        0.2728         0.4376
## Balanced Accuracy              0.4993        0.5060         0.4999

Once again, the hand-written code and the packaged code match, which is reassuring. However, the accuracy statistics are still poor. There is a reason for this. The data is assigned randomly—there is no signal to extract so guessing at random and getting it right 1/3 of the time is about the best one can do with this dataset!

L03—Naïve Bayes

DT 622—Fall 2020

Avraham Adler

9/22/2020