Statistical Inference for Data Science: Bayes Theorem

Bayes Theorem

Bayes' Theorem is a theorem of probability theory originally stated by the Reverend Thomas Bayes. It can be seen as a way of understanding how the probability that a theory is true is affected by a new piece of evidence.

Prior Probability * Test Evidence = Posterior Probability

An Example: Cancer Screening

Pr(A|X) = Chance of having cancer (A) given a positive test (X). This is what we want to know: How likely is it to have cancer with a positive result? In our case it was 7.8%.

Pr(X|A) = Chance of a positive test (X) given that you had cancer (A). This is the chance of a true positive, 80% in our case.

Pr(A) = Chance of having cancer (1%).

Pr(not A) = Chance of not having cancer (99%).

Pr(X | not A) = Chance of a positive test (X) given that you didn’t have cancer (~A). This is a false positive, 9.6% in our case.

We plug these values into the following equation:

PRIOR_incidence <- .01
NO_cancer <- .99
TRUE_positive <- .80
FALSE_positive <- .096

numerator <- TRUE_positive * PRIOR_incidence
denominator <- ((TRUE_positive * PRIOR_incidence) + (FALSE_positive * NO_cancer))

POSTERIOR <- (numerator/denominator)*100

So the probability that a given patient actually has cancer given a positive test result is approximately 7.8%:

a <- signif(POSTERIOR, digits = 2)
sprintf("There is a %s percent chance that you have cancer given that your test result was positive.", a)

## [1] "There is a 7.8 percent chance that you have cancer given that your test result was positive."

Keep in mind that this equation can be stated more succinctly, where the denominator, P(X), acts as a normalizing constant, which represents the chance of getting ANY positive result (you’ll notice this is simply a summation of the denominator in the previous equation):

Implementing Bayes Theorem with Naive Bayes Classification

NOTE: Some statisticians are disturbed by the widespread use of Naive Bayes Classifiers, which they dub “Idiot’s Bayes”, because that naive assumption of independence is almost always invalid in the real world. However, the method has been shown to perform surprisingly well in a wide variety of contexts.

We're using the well known “iris” dataset in this example:

library("klaR")
library("caret")
library(dplyr)

#splitting independent and dependent variables 
attach(iris)
x <- select(iris, -Species)
y <- Species

#(x = attributes, y = labels, 'nb' = tells trainer to use Naive Bayes, trainControl = tells train method to use cross-validation with 10 folds)
model = train(x, y, 'nb', trControl = trainControl(method = 'cv', number=10))
model

## Naive Bayes 
## 
## 150 samples
##   4 predictor
##   3 classes: 'setosa', 'versicolor', 'virginica' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## 
## Summary of sample sizes: 135, 135, 135, 135, 135, 135, ... 
## 
## Resampling results across tuning parameters:
## 
##   usekernel  Accuracy   Kappa  Accuracy SD  Kappa SD  
##   FALSE      0.9533333  0.93   0.05488484   0.08232726
##    TRUE      0.9600000  0.94   0.04661373   0.06992059
## 
## Tuning parameter 'fL' was held constant at a value of 0
## Accuracy was used to select the optimal model using  the largest value.
## The final values used for the model were fL = 0 and usekernel = TRUE.

Interpreting the output of our training set:

Kappa is a less biased estimator of the classification model's accuracy than the regular “accuracy” measure, as it compares the observed accuracy with the expected accuracy (random chance).

A more thorough explanation is given here: http://stats.stackexchange.com/questions/82162/kappa-statistic-in-plain-english

Test Set:

predict(model$finalModel, x)

confusionMatrix(predict(model$finalModel, x)$class, y)

## Confusion Matrix and Statistics
## 
##             Reference
## Prediction   setosa versicolor virginica
##   setosa         50          0         0
##   versicolor      0         47         3
##   virginica       0          3        47
## 
## Overall Statistics
##                                          
##                Accuracy : 0.96           
##                  95% CI : (0.915, 0.9852)
##     No Information Rate : 0.3333         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.94           
##  Mcnemar's Test P-Value : NA             
## 
## Statistics by Class:
## 
##                      Class: setosa Class: versicolor Class: virginica
## Sensitivity                 1.0000            0.9400           0.9400
## Specificity                 1.0000            0.9700           0.9700
## Pos Pred Value              1.0000            0.9400           0.9400
## Neg Pred Value              1.0000            0.9700           0.9700
## Prevalence                  0.3333            0.3333           0.3333
## Detection Rate              0.3333            0.3133           0.3133
## Detection Prevalence        0.3333            0.3333           0.3333
## Balanced Accuracy           1.0000            0.9550           0.9550