Inferring infections’ etiology using naive Bayes

Reading the data:

> known <- read.csv("naive_known.csv")
> unknown <- read.csv("naive_unknown.csv")

Cleaning the data a bit:

> known$X <- NULL # removing the first useless column
> names(known) <- sub("data8.", "", names(known)) # removing "data8." from variables' names
> # and the same for the unknown:
> unknown$X <- NULL
> names(unknown) <- sub("data8.", "", names(unknown))
> # removing pathogen column for unknown:
> unknown$pathogen5 <- NULL

Removing missing values:

> known <- known[known$EatCookRawMeat != "Unknown", ] # removing unknown for raw meat
> known$EatCookRawMeat <- factor(known$EatCookRawMeat) # recoding the factor
> known <- na.exclude(known) # removing missing values

The packages:

> library(caret) # createDataPartition, predict.train, train, trainControl
Loading required package: lattice
Loading required package: ggplot2

Partitioning the known dataset into train and test data sets with 75% of data for training and 25% of data for testing:

> index <- createDataPartition(known$pathogen5, p = .75, list = FALSE)
> train <- known[index, ]
> test <- known[-index, ]

Separating predictors from covariates for both the train and test datasets:

> xTrain = train[, -1]
> yTrain = train$pathogen5
> xTest = test[, -1]
> yTest = test$pathogen5

Seting the method of training to a 10-fold cross-validation:

> ctrl <- trainControl("cv", 10)

And additionally doing some tuning on the laplace and usekernel arguments of the naivebayes::naive_bayes function:

> tgrd <- expand.grid(laplace = 0:5, usekernel = c(FALSE, TRUE), adjust = 1)

Training the naive Bayes model, using the naivebayes::naive_bayes function:

> model <- train(xTrain, yTrain, "naive_bayes", trControl = ctrl, tuneGrid = tgrd)

Which gives:

> model
Naive Bayes 

264 samples
 17 predictor
  8 classes: 'DENV', 'Enterovirus', 'HSV', 'JEV', 'N. meningitidis', 'Other', 'S. Pneumoniae', 'S. Suis' 

No pre-processing
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 238, 240, 236, 238, 238, 238, ... 
Resampling results across tuning parameters:

  laplace  usekernel  Accuracy   Kappa    
  0        FALSE      0.3403805  0.1815506
  0         TRUE      0.3199371  0.1685653
  1        FALSE      0.3928637  0.2025005
  1         TRUE      0.3678698  0.1789218
  2        FALSE      0.3973208  0.2011511
  2         TRUE      0.3680281  0.1708140
  3        FALSE      0.4003620  0.1995447
  3         TRUE      0.3769043  0.1727865
  4        FALSE      0.3953762  0.1888870
  4         TRUE      0.3765864  0.1694003
  5        FALSE      0.3776351  0.1633945
  5         TRUE      0.3590996  0.1414896

Tuning parameter 'adjust' was held constant at a value of 1
Accuracy was used to select the optimal model using the largest value.
The final values used for the model were laplace = 3, usekernel =
 FALSE and adjust = 1.

Testing the model on the test dataset to get the model’s performance:

> predictions <- predict(model, xTest)
> (cm <- confusionMatrix(predictions, yTest))
Confusion Matrix and Statistics

                 Reference
Prediction        DENV Enterovirus HSV JEV N. meningitidis Other
  DENV               0           0   0   0               0     0
  Enterovirus        0           1   0   0               0     1
  HSV                0           0   0   0               0     0
  JEV                0           0   0   5               0     8
  N. meningitidis    0           0   0   0               0     0
  Other              1           4   1  11               1    11
  S. Pneumoniae      0           0   1   1               0     0
  S. Suis            1           1   1   1               0     5
                 Reference
Prediction        S. Pneumoniae S. Suis
  DENV                        0       0
  Enterovirus                 0       0
  HSV                         0       0
  JEV                         1       1
  N. meningitidis             0       0
  Other                       8       5
  S. Pneumoniae               1       0
  S. Suis                     1      14

Overall Statistics
                                         
               Accuracy : 0.3721         
                 95% CI : (0.2702, 0.483)
    No Information Rate : 0.2907         
    P-Value [Acc > NIR] : 0.06371        
                                         
                  Kappa : 0.1634         
 Mcnemar's Test P-Value : NA             

Statistics by Class:

                     Class: DENV Class: Enterovirus Class: HSV Class: JEV
Sensitivity              0.00000            0.16667    0.00000    0.27778
Specificity              1.00000            0.98750    1.00000    0.85294
Pos Pred Value               NaN            0.50000        NaN    0.33333
Neg Pred Value           0.97674            0.94048    0.96512    0.81690
Prevalence               0.02326            0.06977    0.03488    0.20930
Detection Rate           0.00000            0.01163    0.00000    0.05814
Detection Prevalence     0.00000            0.02326    0.00000    0.17442
Balanced Accuracy        0.50000            0.57708    0.50000    0.56536
                     Class: N. meningitidis Class: Other
Sensitivity                         0.00000       0.4400
Specificity                         1.00000       0.4918
Pos Pred Value                          NaN       0.2619
Neg Pred Value                      0.98837       0.6818
Prevalence                          0.01163       0.2907
Detection Rate                      0.00000       0.1279
Detection Prevalence                0.00000       0.4884
Balanced Accuracy                   0.50000       0.4659
                     Class: S. Pneumoniae Class: S. Suis
Sensitivity                       0.09091         0.7000
Specificity                       0.97333         0.8485
Pos Pred Value                    0.33333         0.5833
Neg Pred Value                    0.87952         0.9032
Prevalence                        0.12791         0.2326
Detection Rate                    0.01163         0.1628
Detection Prevalence              0.03488         0.2791
Balanced Accuracy                 0.53212         0.7742

The model is doing rather well for Japanese encephalitis and S. suis, rather crap for the rest. Let’s now use this model to infer the unknown pathogens:

> inferences <- predict(model, unknown)
> sort(table(inferences) / length(inferences), TRUE)
inferences
          Other         S. Suis             JEV   S. Pneumoniae 
     0.70502431      0.15072934      0.10534846      0.02755267 
    Enterovirus            DENV             HSV N. meningitidis 
     0.01134522      0.00000000      0.00000000      0.00000000

to compare with:

> sort(table(known$pathogen5) / nrow(known), TRUE)

          Other         S. Suis             JEV   S. Pneumoniae 
     0.28571429      0.22857143      0.20571429      0.12857143 
    Enterovirus             HSV            DENV N. meningitidis 
     0.07142857      0.03714286      0.02285714      0.02000000

The following function uses the positive and negative predictive values from the output of confusionMatrix to calculate the expected number of missed cases for a given pathogen.

> expected_missed <- function(path, confmat) {
+   x <- sum(inferences == path) / length(inferences)
+   class <- paste("Class:", path)
+   ppv <- cm$byClass[class, "Pos Pred Value"]
+   npv <- cm$byClass[class, "Neg Pred Value"]
+   x * ppv + (1 - x) * (1 - npv)
+ }

Let’s apply it to Japanese encephalitis:

> expected_missed("JEV", cm)
[1] 0.1989256

And to S. suis:

> expected_missed("S. Suis", cm)
[1] 0.1701129