Reading the data:
> known <- read.csv("naive_known.csv")
> unknown <- read.csv("naive_unknown.csv")
Cleaning the data a bit:
> known$X <- NULL # removing the first useless column
> names(known) <- sub("data8.", "", names(known)) # removing "data8." from variables' names
> # and the same for the unknown:
> unknown$X <- NULL
> names(unknown) <- sub("data8.", "", names(unknown))
> # removing pathogen column for unknown:
> unknown$pathogen5 <- NULL
Removing missing values:
> known <- known[known$EatCookRawMeat != "Unknown", ] # removing unknown for raw meat
> known$EatCookRawMeat <- factor(known$EatCookRawMeat) # recoding the factor
> known <- na.exclude(known) # removing missing values
The packages:
> library(caret) # createDataPartition, predict.train, train, trainControl
Loading required package: lattice
Loading required package: ggplot2
Partitioning the known dataset into train and test data sets with 75% of data for training and 25% of data for testing:
> index <- createDataPartition(known$pathogen5, p = .75, list = FALSE)
> train <- known[index, ]
> test <- known[-index, ]
Separating predictors from covariates for both the train and test datasets:
> xTrain = train[, -1]
> yTrain = train$pathogen5
> xTest = test[, -1]
> yTest = test$pathogen5
Seting the method of training to a 10-fold cross-validation:
> ctrl <- trainControl("cv", 10)
And additionally doing some tuning on the laplace and usekernel arguments of the naivebayes::naive_bayes function:
> tgrd <- expand.grid(laplace = 0:5, usekernel = c(FALSE, TRUE), adjust = 1)
Training the naive Bayes model, using the naivebayes::naive_bayes function:
> model <- train(xTrain, yTrain, "naive_bayes", trControl = ctrl, tuneGrid = tgrd)
Which gives:
> model
Naive Bayes
264 samples
17 predictor
8 classes: 'DENV', 'Enterovirus', 'HSV', 'JEV', 'N. meningitidis', 'Other', 'S. Pneumoniae', 'S. Suis'
No pre-processing
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 238, 240, 236, 238, 238, 238, ...
Resampling results across tuning parameters:
laplace usekernel Accuracy Kappa
0 FALSE 0.3403805 0.1815506
0 TRUE 0.3199371 0.1685653
1 FALSE 0.3928637 0.2025005
1 TRUE 0.3678698 0.1789218
2 FALSE 0.3973208 0.2011511
2 TRUE 0.3680281 0.1708140
3 FALSE 0.4003620 0.1995447
3 TRUE 0.3769043 0.1727865
4 FALSE 0.3953762 0.1888870
4 TRUE 0.3765864 0.1694003
5 FALSE 0.3776351 0.1633945
5 TRUE 0.3590996 0.1414896
Tuning parameter 'adjust' was held constant at a value of 1
Accuracy was used to select the optimal model using the largest value.
The final values used for the model were laplace = 3, usekernel =
FALSE and adjust = 1.
Testing the model on the test dataset to get the model’s performance:
> predictions <- predict(model, xTest)
> (cm <- confusionMatrix(predictions, yTest))
Confusion Matrix and Statistics
Reference
Prediction DENV Enterovirus HSV JEV N. meningitidis Other
DENV 0 0 0 0 0 0
Enterovirus 0 1 0 0 0 1
HSV 0 0 0 0 0 0
JEV 0 0 0 5 0 8
N. meningitidis 0 0 0 0 0 0
Other 1 4 1 11 1 11
S. Pneumoniae 0 0 1 1 0 0
S. Suis 1 1 1 1 0 5
Reference
Prediction S. Pneumoniae S. Suis
DENV 0 0
Enterovirus 0 0
HSV 0 0
JEV 1 1
N. meningitidis 0 0
Other 8 5
S. Pneumoniae 1 0
S. Suis 1 14
Overall Statistics
Accuracy : 0.3721
95% CI : (0.2702, 0.483)
No Information Rate : 0.2907
P-Value [Acc > NIR] : 0.06371
Kappa : 0.1634
Mcnemar's Test P-Value : NA
Statistics by Class:
Class: DENV Class: Enterovirus Class: HSV Class: JEV
Sensitivity 0.00000 0.16667 0.00000 0.27778
Specificity 1.00000 0.98750 1.00000 0.85294
Pos Pred Value NaN 0.50000 NaN 0.33333
Neg Pred Value 0.97674 0.94048 0.96512 0.81690
Prevalence 0.02326 0.06977 0.03488 0.20930
Detection Rate 0.00000 0.01163 0.00000 0.05814
Detection Prevalence 0.00000 0.02326 0.00000 0.17442
Balanced Accuracy 0.50000 0.57708 0.50000 0.56536
Class: N. meningitidis Class: Other
Sensitivity 0.00000 0.4400
Specificity 1.00000 0.4918
Pos Pred Value NaN 0.2619
Neg Pred Value 0.98837 0.6818
Prevalence 0.01163 0.2907
Detection Rate 0.00000 0.1279
Detection Prevalence 0.00000 0.4884
Balanced Accuracy 0.50000 0.4659
Class: S. Pneumoniae Class: S. Suis
Sensitivity 0.09091 0.7000
Specificity 0.97333 0.8485
Pos Pred Value 0.33333 0.5833
Neg Pred Value 0.87952 0.9032
Prevalence 0.12791 0.2326
Detection Rate 0.01163 0.1628
Detection Prevalence 0.03488 0.2791
Balanced Accuracy 0.53212 0.7742
The model is doing rather well for Japanese encephalitis and S. suis, rather crap for the rest. Let’s now use this model to infer the unknown pathogens:
> inferences <- predict(model, unknown)
> sort(table(inferences) / length(inferences), TRUE)
inferences
Other S. Suis JEV S. Pneumoniae
0.70502431 0.15072934 0.10534846 0.02755267
Enterovirus DENV HSV N. meningitidis
0.01134522 0.00000000 0.00000000 0.00000000
to compare with:
> sort(table(known$pathogen5) / nrow(known), TRUE)
Other S. Suis JEV S. Pneumoniae
0.28571429 0.22857143 0.20571429 0.12857143
Enterovirus HSV DENV N. meningitidis
0.07142857 0.03714286 0.02285714 0.02000000
The following function uses the positive and negative predictive values from the output of confusionMatrix to calculate the expected number of missed cases for a given pathogen.
> expected_missed <- function(path, confmat) {
+ x <- sum(inferences == path) / length(inferences)
+ class <- paste("Class:", path)
+ ppv <- cm$byClass[class, "Pos Pred Value"]
+ npv <- cm$byClass[class, "Neg Pred Value"]
+ x * ppv + (1 - x) * (1 - npv)
+ }
Let’s apply it to Japanese encephalitis:
> expected_missed("JEV", cm)
[1] 0.1989256
And to S. suis:
> expected_missed("S. Suis", cm)
[1] 0.1701129