Introduction
In this document I am going to attempt a Naïve Bayesian analysis on made-up data in order to try and deepen my understanding of the process. I will restrict the features to categorical variables for simplicity. I will sanity test my findings against an existing software package’s results.
Overview
Naïve Bayes predicts the class of new observations based on the distribution of the classes of existing data under two important assumptions:
- The data is iid
- The class probabilities are conditionally independent
It relies on Bayes theorem:
\[ P(class|data) = \frac{P(data|class)P(class)}{P(data)} \]
As \(P(data)\) is a constant, the implementation dispenses with the denominator and relies on:
\[ P(class|data) \propto P(data|class)P(class) \]
The conditional independence arrives in the decomposition of \(P(data|class)\). Given \(i\) observations, \(j\) sets of features, and \(k\) possible classes, the probability of the \(i^{th}\) observation being of class \(k\) is:
\[ P(y_i \in C_k|x) \propto \prod_j P(x_{ij}|C_k)P(C_k) \]
The assigned class will be the class for which the the above product is maximized:
\[ C_i = \mathop{\mathrm{arg\,max}}_k \prod_j P(x_{ij}|C_k)P(C_k) \]
Data
I will randomly select a target variable and three features. The danger with this is there not being any signal to extract from the noise.
set.seed(18426)
y <- sample(c("Animal", "Vegetable", "Mineral"), 50, replace = TRUE)
x1 <- sample(c("Green", "Red", "Blue", "Orange", "Yellow", "White", "Black",
"Pink"), 50, replace = TRUE)
x2 <- sample(c("NorAm", "SouAm", "Eur", "Aisa", "Afr", "Aus"), 50, replace = TRUE)
x3 <- sample(c("Sphere", "Cube", "Pyramid", "Cylinder"), 50, replace = TRUE)
DT <- data.table(y = y, x1 = x1, x2 = x2, x3 = x3)
knitr::kable(head(DT, 10), caption = "First 10 Entries", format = 'html')| y | x1 | x2 | x3 |
|---|---|---|---|
| Vegetable | Pink | Aus | Pyramid |
| Mineral | Black | SouAm | Sphere |
| Mineral | Red | Aisa | Sphere |
| Mineral | Pink | SouAm | Sphere |
| Animal | Yellow | NorAm | Cube |
| Vegetable | White | Eur | Cube |
| Animal | White | SouAm | Pyramid |
| Animal | Blue | SouAm | Sphere |
| Mineral | Blue | NorAm | Cube |
| Vegetable | Green | SouAm | Sphere |
Modeling
Training & Testing
Marginal Class Probabilities
The section calculates \(P(Class)\) for each feature. This is simply the count for each class in the training set as a proportion of the total observations in the training set.
## y p
## 1: Vegetable 0.2368421
## 2: Mineral 0.3947368
## 3: Animal 0.3684211
Conditional Class Probabilities
This section calculates \(P(data|class)\). This is a little more complicated. What is being measured is the fraction of all observations with feature \(x_i\) given that the class is \(C_k\). So the denominator is constantly changing as well. The code below uses data.table to set st as the count for each class y, and then divides that into the count of x1, x2, and x3 given that the class is y.
condPX1 <- trnSet[, {st = .N;
.SD[, .(p = .N / st), by = x1]
}, by = y]
condPX2 <- trnSet[, {st = .N;
.SD[, .(p = .N / st), by = x2]
}, by = y]
condPX3 <- trnSet[, {st = .N;
.SD[, .(p = .N / st), by = x3]
}, by = y]
condPX1[]## y x1 p
## 1: Vegetable Pink 0.22222222
## 2: Vegetable White 0.11111111
## 3: Vegetable Green 0.33333333
## 4: Vegetable Orange 0.22222222
## 5: Vegetable Yellow 0.11111111
## 6: Mineral Black 0.13333333
## 7: Mineral Red 0.13333333
## 8: Mineral Blue 0.20000000
## 9: Mineral Orange 0.06666667
## 10: Mineral White 0.20000000
## 11: Mineral Green 0.20000000
## 12: Mineral Pink 0.06666667
## 13: Animal White 0.07142857
## 14: Animal Yellow 0.21428571
## 15: Animal Blue 0.14285714
## 16: Animal Black 0.14285714
## 17: Animal Red 0.28571429
## 18: Animal Pink 0.07142857
## 19: Animal Orange 0.07142857
## y x2 p
## 1: Vegetable Aus 0.22222222
## 2: Vegetable Eur 0.11111111
## 3: Vegetable SouAm 0.22222222
## 4: Vegetable NorAm 0.22222222
## 5: Vegetable Aisa 0.22222222
## 6: Mineral SouAm 0.26666667
## 7: Mineral Aisa 0.13333333
## 8: Mineral NorAm 0.46666667
## 9: Mineral Aus 0.06666667
## 10: Mineral Eur 0.06666667
## 11: Animal SouAm 0.21428571
## 12: Animal NorAm 0.21428571
## 13: Animal Aus 0.28571429
## 14: Animal Aisa 0.14285714
## 15: Animal Afr 0.14285714
## y x3 p
## 1: Vegetable Pyramid 0.22222222
## 2: Vegetable Cube 0.44444444
## 3: Vegetable Sphere 0.33333333
## 4: Mineral Sphere 0.46666667
## 5: Mineral Cube 0.26666667
## 6: Mineral Pyramid 0.20000000
## 7: Mineral Cylinder 0.06666667
## 8: Animal Pyramid 0.35714286
## 9: Animal Cylinder 0.21428571
## 10: Animal Cube 0.21428571
## 11: Animal Sphere 0.21428571
Predictions
Given these conditional probabilities and class probabilities, the proportional probabilities of the test data can be calculated.
xtst <- tstSet[, -1]
ntst <- dim(xtst)[[1]]
# When you can, pre-allocate your data structures. It is faster and more
# memory efficient.
yPreds <- data.table(Vegetable = double(ntst),
Mineral = double(ntst),
Animal = double(ntst))
# This is a four-way merge. First x1 probabilities are joined to the test set.
# Then the x2, x3, and class probabilities as well. Once that is done, the
# probabilities are multiplied per naïve Bayes assumptions, and this value is
# stored in the appropriate cell of the output table, yPreds. For exposition
# the temporary join table is printed inside the loop.
for (i in seq_along(classP$y)) {
expandedProbs <- classP[condPX3[y == classP$y[i]
][condPX2[y == classP$y[i]
][condPX1[y == classP$y[i]
][xtst, on = 'x1',
all = TRUE],
on = 'x2'],
on = 'x3'],
on = 'y']
print(knitr::kable(expandedProbs, caption = paste("Probabilites for class",
classP$y[i])))
cat('\n\n<!-- -->\n\n')
yPreds[, i] <- expandedProbs[, p * i.p * i.p.2 * i.p.1]
}##
##
## Table: Probabilites for class Vegetable
##
## |y | p|x3 | i.p|i.y |x2 | i.p.2|i.y.1 |x1 | i.p.1|
## |:---------|---------:|:--------|---------:|:---------|:-----|---------:|:---------|:------|---------:|
## |Vegetable | 0.2368421|Sphere | 0.3333333|Vegetable |SouAm | 0.2222222|Vegetable |Pink | 0.2222222|
## |Vegetable | 0.2368421|Cube | 0.4444444|Vegetable |NorAm | 0.2222222|Vegetable |Yellow | 0.1111111|
## |Vegetable | 0.2368421|Sphere | 0.3333333|Vegetable |SouAm | 0.2222222|NA |Blue | NA|
## |NA | NA|Cylinder | NA|Vegetable |Eur | 0.1111111|Vegetable |White | 0.1111111|
## |NA | NA|Cylinder | NA|NA |Afr | NA|NA |Blue | NA|
## |Vegetable | 0.2368421|Pyramid | 0.2222222|Vegetable |Eur | 0.1111111|NA |Blue | NA|
## |Vegetable | 0.2368421|Sphere | 0.3333333|Vegetable |Aus | 0.2222222|Vegetable |Yellow | 0.1111111|
## |Vegetable | 0.2368421|Cube | 0.4444444|Vegetable |NorAm | 0.2222222|Vegetable |White | 0.1111111|
## |Vegetable | 0.2368421|Sphere | 0.3333333|Vegetable |Aisa | 0.2222222|NA |Blue | NA|
## |NA | NA|Cylinder | NA|NA |Afr | NA|Vegetable |Green | 0.3333333|
## |Vegetable | 0.2368421|Cube | 0.4444444|Vegetable |Aus | 0.2222222|NA |Red | NA|
## |Vegetable | 0.2368421|Cube | 0.4444444|NA |Afr | NA|NA |Red | NA|
##
##
## <!-- -->
##
##
##
## Table: Probabilites for class Mineral
##
## |y | p|x3 | i.p|i.y |x2 | i.p.2|i.y.1 |x1 | i.p.1|
## |:-------|---------:|:--------|---------:|:-------|:-----|---------:|:-------|:------|---------:|
## |Mineral | 0.3947368|Sphere | 0.4666667|Mineral |SouAm | 0.2666667|Mineral |Pink | 0.0666667|
## |Mineral | 0.3947368|Cube | 0.2666667|Mineral |NorAm | 0.4666667|NA |Yellow | NA|
## |Mineral | 0.3947368|Sphere | 0.4666667|Mineral |SouAm | 0.2666667|Mineral |Blue | 0.2000000|
## |Mineral | 0.3947368|Cylinder | 0.0666667|Mineral |Eur | 0.0666667|Mineral |White | 0.2000000|
## |Mineral | 0.3947368|Cylinder | 0.0666667|NA |Afr | NA|Mineral |Blue | 0.2000000|
## |Mineral | 0.3947368|Pyramid | 0.2000000|Mineral |Eur | 0.0666667|Mineral |Blue | 0.2000000|
## |Mineral | 0.3947368|Sphere | 0.4666667|Mineral |Aus | 0.0666667|NA |Yellow | NA|
## |Mineral | 0.3947368|Cube | 0.2666667|Mineral |NorAm | 0.4666667|Mineral |White | 0.2000000|
## |Mineral | 0.3947368|Sphere | 0.4666667|Mineral |Aisa | 0.1333333|Mineral |Blue | 0.2000000|
## |Mineral | 0.3947368|Cylinder | 0.0666667|NA |Afr | NA|Mineral |Green | 0.2000000|
## |Mineral | 0.3947368|Cube | 0.2666667|Mineral |Aus | 0.0666667|Mineral |Red | 0.1333333|
## |Mineral | 0.3947368|Cube | 0.2666667|NA |Afr | NA|Mineral |Red | 0.1333333|
##
##
## <!-- -->
##
##
##
## Table: Probabilites for class Animal
##
## |y | p|x3 | i.p|i.y |x2 | i.p.2|i.y.1 |x1 | i.p.1|
## |:------|---------:|:--------|---------:|:------|:-----|---------:|:------|:------|---------:|
## |Animal | 0.3684211|Sphere | 0.2142857|Animal |SouAm | 0.2142857|Animal |Pink | 0.0714286|
## |Animal | 0.3684211|Cube | 0.2142857|Animal |NorAm | 0.2142857|Animal |Yellow | 0.2142857|
## |Animal | 0.3684211|Sphere | 0.2142857|Animal |SouAm | 0.2142857|Animal |Blue | 0.1428571|
## |Animal | 0.3684211|Cylinder | 0.2142857|NA |Eur | NA|Animal |White | 0.0714286|
## |Animal | 0.3684211|Cylinder | 0.2142857|Animal |Afr | 0.1428571|Animal |Blue | 0.1428571|
## |Animal | 0.3684211|Pyramid | 0.3571429|NA |Eur | NA|Animal |Blue | 0.1428571|
## |Animal | 0.3684211|Sphere | 0.2142857|Animal |Aus | 0.2857143|Animal |Yellow | 0.2142857|
## |Animal | 0.3684211|Cube | 0.2142857|Animal |NorAm | 0.2142857|Animal |White | 0.0714286|
## |Animal | 0.3684211|Sphere | 0.2142857|Animal |Aisa | 0.1428571|Animal |Blue | 0.1428571|
## |Animal | 0.3684211|Cylinder | 0.2142857|Animal |Afr | 0.1428571|NA |Green | NA|
## |Animal | 0.3684211|Cube | 0.2142857|Animal |Aus | 0.2857143|Animal |Red | 0.2857143|
## |Animal | 0.3684211|Cube | 0.2142857|Animal |Afr | 0.1428571|Animal |Red | 0.2857143|
##
##
## <!-- -->
The tables above show the class probabilities and the probabilities of the features given the classes for each row of the test set.
The product of all the probabilities, with 0 substituted for NA, is shown below.
yPreds[is.na(yPreds)] <- 0
predictions <- classP$y[max.col(yPreds)]
yPreds[, `:=`(Prediction = predictions,
Actual = tstSet$y)]
knitr::kable(yPreds, caption = "Naïve Bayes Probabilities")| Vegetable | Mineral | Animal | Prediction | Actual |
|---|---|---|---|---|
| 0.0038986 | 0.0032749 | 0.0012084 | Vegetable | Mineral |
| 0.0025991 | 0.0000000 | 0.0036251 | Animal | Animal |
| 0.0000000 | 0.0098246 | 0.0024168 | Mineral | Animal |
| 0.0000000 | 0.0003509 | 0.0000000 | Mineral | Vegetable |
| 0.0000000 | 0.0000000 | 0.0016112 | Animal | Animal |
| 0.0000000 | 0.0010526 | 0.0000000 | Mineral | Vegetable |
| 0.0019493 | 0.0000000 | 0.0048335 | Animal | Mineral |
| 0.0025991 | 0.0098246 | 0.0012084 | Mineral | Animal |
| 0.0000000 | 0.0049123 | 0.0016112 | Mineral | Mineral |
| 0.0000000 | 0.0000000 | 0.0000000 | Animal | Mineral |
| 0.0000000 | 0.0009357 | 0.0064447 | Animal | Vegetable |
| 0.0000000 | 0.0000000 | 0.0032223 | Animal | Mineral |
Finally, the predictions and the confusion matrix for them are shown below:
ytst <- factor(tstSet$y, levels = classP$y)
confusionMatrix(data = factor(predictions, levels = classP$y),
reference = ytst)## Confusion Matrix and Statistics
##
## Reference
## Prediction Vegetable Mineral Animal
## Vegetable 0 1 0
## Mineral 2 1 2
## Animal 1 3 2
##
## Overall Statistics
##
## Accuracy : 0.25
## 95% CI : (0.0549, 0.5719)
## No Information Rate : 0.4167
## P-Value [Acc > NIR] : 0.9329
##
## Kappa : -0.1739
##
## Mcnemar's Test P-Value : 0.6746
##
## Statistics by Class:
##
## Class: Vegetable Class: Mineral Class: Animal
## Sensitivity 0.00000 0.20000 0.5000
## Specificity 0.88889 0.42857 0.5000
## Pos Pred Value 0.00000 0.20000 0.3333
## Neg Pred Value 0.72727 0.42857 0.6667
## Prevalence 0.25000 0.41667 0.3333
## Detection Rate 0.00000 0.08333 0.1667
## Detection Prevalence 0.08333 0.41667 0.5000
## Balanced Accuracy 0.44444 0.31429 0.5000
Discussion
The results are awful. The accuracy is below the no-information rate. However, that may be a result of the test set having conditions not found in the training set. For example, looking at the table of predictions, we see that row 4 for Vegetable is 0. However, there are no examples of vegetables being cylindrical, so the conditional probability of \(x_3\) being Cylinder given the class of vegetable is 0.
## y x1 x2 x3
## 1: Animal Black Aus Cylinder
## 2: Animal Blue Aus Cylinder
## 3: Animal Red Aisa Cylinder
## 4: Mineral Black SouAm Cylinder
Similarly, there is no Animal from Europe in the training set.
## y x1 x2 x3
## 1: Vegetable White Eur Cube
## 2: Mineral Green Eur Pyramid
So the prediction can only be Mineral, as that is the only class with instances of “White”, “Europe”, and “Cylinder”. They do not all occur in the same instance, but since naïve Bayes assumes conditional independence, the product of the priors is > 0.
## y x1 x2 x3
## 1: Mineral White NorAm Sphere
## 2: Mineral White NorAm Pyramid
## 3: Mineral Green Eur Pyramid
## 4: Mineral White SouAm Sphere
## 5: Mineral Black SouAm Cylinder
Yet, a White European Cylindrical Vegetable does exist in the test set in row 4. This highlights one of the shortcomings of naïve Bayes: if the data set is too small, there are not enough prior probabilities greater than 0.
## y x1 x2 x3
## 1: Vegetable White Eur Cylinder
Comparison with Statistical Package
For comparison, below are results using the caret and bnclassify packages. Selecting method = 'none' causes fitting on the entire dataset at once instead of using cross-validation or bootstrapping. This allows direct comparison with the manual run.
xtrn <- data.table(x1 = factor(trnSet$x1,
levels = c("Green", "Red", "Blue", "Orange",
"Yellow", "White", "Black", "Pink")),
x2 = factor(trnSet$x2,
levels = c("NorAm", "SouAm", "Eur", "Aisa",
"Afr", "Aus")),
x3 = factor(trnSet$x3,
levels = c("Sphere", "Cube", "Pyramid",
"Cylinder")))
ytrn <- factor(trnSet$y, levels = classP$y)
xtst <- data.table(x1 = factor(tstSet$x1,
levels = c("Green", "Red", "Blue", "Orange",
"Yellow", "White", "Black", "Pink")),
x2 = factor(tstSet$x2,
levels = c("NorAm", "SouAm", "Eur", "Aisa",
"Afr", "Aus")),
x3 = factor(tstSet$x3,
levels = c("Sphere", "Cube", "Pyramid",
"Cylinder")))
trC <- trainControl(method = 'none', classProbs = TRUE,
summaryFunction = multiClassSummary)
Fit <- train(x = xtrn, y = ytrn, method = 'nbDiscrete', trControl = trC)
Fit## Naive Bayes Classifier
##
## 38 samples
## 3 predictor
## 3 classes: 'Vegetable', 'Mineral', 'Animal'
##
## No pre-processing
## Resampling: None
## Confusion Matrix and Statistics
##
## Reference
## Prediction Vegetable Mineral Animal
## Vegetable 0 1 0
## Mineral 2 1 2
## Animal 1 3 2
##
## Overall Statistics
##
## Accuracy : 0.25
## 95% CI : (0.0549, 0.5719)
## No Information Rate : 0.4167
## P-Value [Acc > NIR] : 0.9329
##
## Kappa : -0.1739
##
## Mcnemar's Test P-Value : 0.6746
##
## Statistics by Class:
##
## Class: Vegetable Class: Mineral Class: Animal
## Sensitivity 0.00000 0.20000 0.5000
## Specificity 0.88889 0.42857 0.5000
## Pos Pred Value 0.00000 0.20000 0.3333
## Neg Pred Value 0.72727 0.42857 0.6667
## Prevalence 0.25000 0.41667 0.3333
## Detection Rate 0.00000 0.08333 0.1667
## Detection Prevalence 0.08333 0.41667 0.5000
## Balanced Accuracy 0.44444 0.31429 0.5000
As expected, the results are exactly the same.
Larger Data Set
To test the hypothesis that the poor results are a function of the paucity of the training set, a much larger sample will be created and run.
set.seed(8176)
n <- 10000
y <- sample(c("Animal", "Vegetable", "Mineral"), n, replace = TRUE)
x1 <- sample(c("Green", "Red", "Blue", "Orange", "Yellow", "White", "Black",
"Pink"), n, replace = TRUE)
x2 <- sample(c("NorAm", "SouAm", "Eur", "Aisa", "Afr", "Aus"), n,
replace = TRUE)
x3 <- sample(c("Sphere", "Cube", "Pyramid", "Cylinder"), n, replace = TRUE)
DT <- data.table(y = y, x1 = x1, x2 = x2, x3 = x3)
trIDX <- createDataPartition(DT$y, p = 0.75)
trnSet <- DT[trIDX$Resample1]
tstSet <- DT[-trIDX$Resample1]
ntrn <- dim(trnSet)[[1]]
classP <- trnSet[, .(p = .N / ntrn), by = y]
classP[]## y p
## 1: Vegetable 0.3340
## 2: Animal 0.3304
## 3: Mineral 0.3356
condPX1 <- trnSet[, {st = .N;
.SD[, .(p = .N / st), by = x1]
}, by = y]
condPX2 <- trnSet[, {st = .N;
.SD[, .(p = .N / st), by = x2]
}, by = y]
condPX3 <- trnSet[, {st = .N;
.SD[, .(p = .N / st), by = x3]
}, by = y]
condPX1[]## y x1 p
## 1: Vegetable Blue 0.1325349
## 2: Vegetable Black 0.1221557
## 3: Vegetable Orange 0.1173653
## 4: Vegetable White 0.1453094
## 5: Vegetable Pink 0.1217565
## 6: Vegetable Green 0.1209581
## 7: Vegetable Red 0.1229541
## 8: Vegetable Yellow 0.1169661
## 9: Animal Orange 0.1283293
## 10: Animal Green 0.1230831
## 11: Animal Black 0.1234867
## 12: Animal White 0.1230831
## 13: Animal Red 0.1255044
## 14: Animal Pink 0.1368039
## 15: Animal Blue 0.1299435
## 16: Animal Yellow 0.1097659
## 17: Mineral Blue 0.1299166
## 18: Mineral Red 0.1068733
## 19: Mineral Black 0.1283274
## 20: Mineral Green 0.1279301
## 21: Mineral Orange 0.1338896
## 22: Mineral White 0.1239571
## 23: Mineral Yellow 0.1279301
## 24: Mineral Pink 0.1211760
## y x1 p
## y x2 p
## 1: Vegetable SouAm 0.1676647
## 2: Vegetable Eur 0.1768463
## 3: Vegetable Afr 0.1540918
## 4: Vegetable NorAm 0.1796407
## 5: Vegetable Aisa 0.1624750
## 6: Vegetable Aus 0.1592814
## 7: Animal SouAm 0.1626312
## 8: Animal Aisa 0.1614205
## 9: Animal NorAm 0.1836158
## 10: Animal Eur 0.1602098
## 11: Animal Aus 0.1702986
## 12: Animal Afr 0.1618241
## 13: Mineral Afr 0.1720302
## 14: Mineral Eur 0.1656734
## 15: Mineral Aus 0.1624950
## 16: Mineral SouAm 0.1779897
## 17: Mineral NorAm 0.1573302
## 18: Mineral Aisa 0.1644815
## y x3 p
## 1: Vegetable Sphere 0.2379242
## 2: Vegetable Cube 0.2590818
## 3: Vegetable Pyramid 0.2530938
## 4: Vegetable Cylinder 0.2499002
## 5: Animal Pyramid 0.2401130
## 6: Animal Cylinder 0.2510089
## 7: Animal Sphere 0.2469734
## 8: Animal Cube 0.2619048
## 9: Mineral Pyramid 0.2487088
## 10: Mineral Cube 0.2495034
## 11: Mineral Cylinder 0.2566547
## 12: Mineral Sphere 0.2451331
xtst <- tstSet[, -1]
ntst <- dim(xtst)[[1]]
yPreds <- data.table(Vegetable = double(ntst),
Mineral = double(ntst),
Animal = double(ntst))
for (i in seq_along(classP$y)) {
expandedProbs <- classP[condPX3[y == classP$y[i]
][condPX2[y == classP$y[i]
][condPX1[y == classP$y[i]
][xtst, on = 'x1',
all = TRUE],
on = 'x2'],
on = 'x3'],
on = 'y']
yPreds[, i] <- expandedProbs[, p * i.p * i.p.2 * i.p.1]
}
yPreds[is.na(yPreds)] <- 0
predictions <- classP$y[max.col(yPreds)]
yPreds[, `:=`(Prediction = predictions,
Actual = tstSet$y)]
ytst <- factor(tstSet$y, levels = classP$y)
confusionMatrix(data = factor(predictions, levels = classP$y),
reference = ytst)## Confusion Matrix and Statistics
##
## Reference
## Prediction Vegetable Animal Mineral
## Vegetable 241 235 248
## Animal 226 232 224
## Mineral 368 359 367
##
## Overall Statistics
##
## Accuracy : 0.336
## 95% CI : (0.3175, 0.3549)
## No Information Rate : 0.3356
## P-Value [Acc > NIR] : 0.4906
##
## Kappa : 0.0034
##
## Mcnemar's Test P-Value : 7.527e-12
##
## Statistics by Class:
##
## Class: Vegetable Class: Animal Class: Mineral
## Sensitivity 0.2886 0.2809 0.4374
## Specificity 0.7099 0.7312 0.5623
## Pos Pred Value 0.3329 0.3402 0.3355
## Neg Pred Value 0.6655 0.6733 0.6643
## Prevalence 0.3340 0.3304 0.3356
## Detection Rate 0.0964 0.0928 0.1468
## Detection Prevalence 0.2896 0.2728 0.4376
## Balanced Accuracy 0.4993 0.5060 0.4999
xtrn <- data.table(x1 = factor(trnSet$x1,
levels = c("Green", "Red", "Blue", "Orange",
"Yellow", "White", "Black", "Pink")),
x2 = factor(trnSet$x2,
levels = c("NorAm", "SouAm", "Eur", "Aisa",
"Afr", "Aus")),
x3 = factor(trnSet$x3,
levels = c("Sphere", "Cube", "Pyramid",
"Cylinder")))
ytrn <- factor(trnSet$y, levels = classP$y)
xtst <- data.table(x1 = factor(tstSet$x1,
levels = c("Green", "Red", "Blue", "Orange",
"Yellow", "White", "Black", "Pink")),
x2 = factor(tstSet$x2,
levels = c("NorAm", "SouAm", "Eur", "Aisa",
"Afr", "Aus")),
x3 = factor(tstSet$x3,
levels = c("Sphere", "Cube", "Pyramid",
"Cylinder")))
trC <- trainControl(method = 'none', classProbs = TRUE,
summaryFunction = multiClassSummary)
Fit <- train(x = xtrn, y = ytrn, method = 'nbDiscrete', trControl = trC)
FitP <- predict(Fit, xtst)
confusionMatrix(data = FitP,
reference = ytst)## Confusion Matrix and Statistics
##
## Reference
## Prediction Vegetable Animal Mineral
## Vegetable 241 235 248
## Animal 226 232 224
## Mineral 368 359 367
##
## Overall Statistics
##
## Accuracy : 0.336
## 95% CI : (0.3175, 0.3549)
## No Information Rate : 0.3356
## P-Value [Acc > NIR] : 0.4906
##
## Kappa : 0.0034
##
## Mcnemar's Test P-Value : 7.527e-12
##
## Statistics by Class:
##
## Class: Vegetable Class: Animal Class: Mineral
## Sensitivity 0.2886 0.2809 0.4374
## Specificity 0.7099 0.7312 0.5623
## Pos Pred Value 0.3329 0.3402 0.3355
## Neg Pred Value 0.6655 0.6733 0.6643
## Prevalence 0.3340 0.3304 0.3356
## Detection Rate 0.0964 0.0928 0.1468
## Detection Prevalence 0.2896 0.2728 0.4376
## Balanced Accuracy 0.4993 0.5060 0.4999
Once again, the hand-written code and the packaged code match, which is reassuring. However, the accuracy statistics are still poor. There is a reason for this. The data is assigned randomly—there is no signal to extract so guessing at random and getting it right 1/3 of the time is about the best one can do with this dataset!