Bayesian Analysis and Naive Bayes Classifier

Jeff Hung

Data scientist of the Institute of Manufacturing Information and Systems of National Cheng Kung University
Gmail LinkedIn Github Polab

Outlines

Bayesian Analysis
Naive Bayes Classifier
- Naive Bayes Classifier
- An Example in R

Bayesian Analysis

Bayesian Thinking

Statistical Inference (Frequentist)

Define Problem and Collect Data => Modeling => Estimation => Hypothesis Test (CI) => Prediction => Decision
Bayesian Thinking

Considers not only what the data have to say, but what your “expertise” or “experience” tells you as well.
Frequentist versus Bayesian

The frequentist definition sees probability as the long-run expected frequency of occurrence, e.g. \(P(A) = \frac{n}{N}\).
The Bayesian view of probability is related to degree of “belief”. It is a measure of the plausibility of an event given incomplete knowledge.

Bayes’ Theorem

Bayes’ Theorem
Bayesian methods are derived from the principles of Bayesian inference, which the process of inductive learning via Bayes’ Theorem.

\(P(A|B) = \frac{P(B|A)P(A)}{P(B)} = \frac{P(B|A)P(A)}{P(B|A)P(A) + P(B|A^{c})P(A^{c})}\)

\(P(\theta|y) = \frac{P(y|\theta)P(\theta)}{\int_{\theta}P(y|\theta)P(\theta)d\theta}\)

\(Posterior = \frac{Likelihood*Prior}{constant}\)

The parameter space Θ is the set of possible parameter values, from which we hope to identify the value that best represents the true population characteristics.
Prior distribution 𝒑(𝛉) describes our belief that 𝛉 represents the true population characteristics.
Sampling model 𝒑(𝐲|𝛉) describes our belief that 𝐲 would be the outcome of our study if we knew 𝛉 to be true.
Once we obtain the data 𝐲, the last step is to update our beliefs about 𝛉
Posterior distribution 𝒑(𝛉|𝐲) describes our belief that 𝛉 is the true value, having observed dataset 𝐲.

One-parameter Model

The Binomial Model
We would like to study the happiness for college students of age between 18 to 22 in Taiwan. Each student was asked whether or not they were generally happy. A reasonable sampling model for this study is

\(X_1, ..., X_n \sim Bernoulli(\theta)\)
\(Y=X_1+ ...+ X_n \sim Binomial(n, \theta)\)

where 𝜽 is the fraction of happy student of age between 18 to 22 in Taiwan and it’s between 0 and 1. Suppose we found that 118 out of 129 students are generally happy.
Prior (Conjugate) Distribution

…
Posterior Distribution

…
Posterior Mean

…

Strategies for Prior Determination

Strategies
How do we choose our priors?
  1. We typically have some information (e.g., from literature or scientific knowledge) about.
  2. We can elicit information from experts.
  3. We can choose a prior that is mathematically convenient.
In all cases, the difficulty is condensing the information in the form of a distribution.

Naive Bayes Classifier

In machine learning, naive Bayes classifiers are a family of simple “probabilistic classifiers” based on applying Bayes’ theorem with strong (naive) independence assumptions between the features.

Naive Bayes Classifier

Probability Model
Abstractly, naive Bayes is a conditional probability model: given a problem instance to be classified. A vector x representing some n features (independent variables), it assigns to this instance probabilities
\(P(C_k|x_1, \dots, x_n)\)
for each of K possible outcomes or classes Ck.
Using Bayes’ theorem, the conditional probability can be decomposed as
\(P(C_k|x)=\frac{P(x|C_k)P(C_k)}{P(x)}\)
Using Bayesian probability terminology, the above equation can be written as
\(Posterior = \frac{Likelihood*Prior}{Constant(Evidence)}\)
Now the “naive” conditional independence assumptions come into play: assume that each feature xi is conditionally independent of every other feature xj for j not equal to i, given the category Ck. Thus, the joint model can be expressed as \[\begin{align} P(C_k|x_1, \dots, x_n) &\propto P(C_k,x_1, \dots, x_n) \\ &= P(C_k)P(x_1|C_k)P(x_2|C_k)P(x_3|C_k)... \\ &= P(C_k)\prod_{i=1}^{n} P(x_i|C_k) \end{align}\]
Constructing a classifier from the probability model
The naive Bayes classifier combines this model with a decision rule. One common rule is to pick the hypothesis that is most probable; this is known as the maximum a posteriori or MAP decision rule. \[\begin{align} \hat{y} = \mathop{\arg\max}_{k \in {1,\dots,K}} P(C_k)\prod_{i=1}^{n} P(x_i|C_k) \end{align}\]

An Example in R

The Data
For domonstration purpose, we will make a Niave Bayes classifier here. We will use a data set that contains information of 200 students. Their scores in different subjects and their educational choices (general, academic or vocational). There are other variables indicating their socio economic status and their gender. We will make a Naive Bayer classifier here.

library(foreign)
data <- read.dta("/Users/hungyushin/Desktop/R/Website/Rpubs/Bayesian Analysis and Naive Bayes Classifier/hsbdemo.dta")
head(data,5)

##    id female    ses schtyp     prog read write math science socst
## 1  45 female    low public vocation   34    35   41      29    26
## 2 108   male middle public  general   34    33   41      36    36
## 3  15   male   high public vocation   39    39   44      26    42
## 4  67   male    low public vocation   37    37   42      33    32
## 5 153   male middle public vocation   39    31   40      39    51
##         honors awards cid
## 1 not enrolled      0   1
## 2 not enrolled      0   1
## 3 not enrolled      0   1
## 4 not enrolled      0   1
## 5 not enrolled      0   1

str(data)

## 'data.frame':    200 obs. of  13 variables:
##  $ id     : num  45 108 15 67 153 51 164 133 2 53 ...
##  $ female : Factor w/ 2 levels "male","female": 2 1 1 1 1 2 1 1 2 1 ...
##  $ ses    : Factor w/ 3 levels "low","middle",..: 1 2 3 1 2 3 2 2 2 2 ...
##  $ schtyp : Factor w/ 2 levels "public","private": 1 1 1 1 1 1 1 1 1 1 ...
##  $ prog   : Factor w/ 3 levels "general","academic",..: 3 1 3 3 3 1 3 3 3 3 ...
##  $ read   : num  34 34 39 37 39 42 31 50 39 34 ...
##  $ write  : num  35 33 39 37 31 36 36 31 41 37 ...
##  $ math   : num  41 41 44 42 40 42 46 40 33 46 ...
##  $ science: num  29 36 26 33 39 31 39 34 42 39 ...
##  $ socst  : num  26 36 42 32 51 39 46 31 41 31 ...
##  $ honors : Factor w/ 2 levels "not enrolled",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ awards : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ cid    : int  1 1 1 1 1 1 1 1 1 1 ...
##  - attr(*, "datalabel")= chr "highschool and beyond (200 cases)"
##  - attr(*, "time.stamp")= chr "30 Oct 2009 14:13"
##  - attr(*, "formats")= chr  "%9.0g" "%9.0g" "%9.0g" "%9.0g" ...
##  - attr(*, "types")= int  254 254 254 254 254 254 254 254 254 254 ...
##  - attr(*, "val.labels")= chr  "" "fl" "sl" "scl" ...
##  - attr(*, "var.labels")= chr  "" "" "" "type of school" ...
##  - attr(*, "version")= int 8
##  - attr(*, "label.table")=List of 5
##   ..$ sl    : Named int  1 2 3
##   .. ..- attr(*, "names")= chr  "low" "middle" "high"
##   ..$ scl   : Named int  1 2
##   .. ..- attr(*, "names")= chr  "public" "private"
##   ..$ sel   : Named int  1 2 3
##   .. ..- attr(*, "names")= chr  "general" "academic" "vocation"
##   ..$ fl    : Named int  0 1
##   .. ..- attr(*, "names")= chr  "male" "female"
##   ..$ honlab: Named int  0 1
##   .. ..- attr(*, "names")= chr  "not enrolled" "enrolled"

summary(data)

##        id            female        ses         schtyp          prog    
##  Min.   :  1.00   male  : 91   low   :47   public :168   general : 45  
##  1st Qu.: 50.75   female:109   middle:95   private: 32   academic:105  
##  Median :100.50                high  :58                 vocation: 50  
##  Mean   :100.50                                                        
##  3rd Qu.:150.25                                                        
##  Max.   :200.00                                                        
##       read           write            math          science     
##  Min.   :28.00   Min.   :31.00   Min.   :33.00   Min.   :26.00  
##  1st Qu.:44.00   1st Qu.:45.75   1st Qu.:45.00   1st Qu.:44.00  
##  Median :50.00   Median :54.00   Median :52.00   Median :53.00  
##  Mean   :52.23   Mean   :52.77   Mean   :52.65   Mean   :51.85  
##  3rd Qu.:60.00   3rd Qu.:60.00   3rd Qu.:59.00   3rd Qu.:58.00  
##  Max.   :76.00   Max.   :67.00   Max.   :75.00   Max.   :74.00  
##      socst                honors        awards          cid       
##  Min.   :26.00   not enrolled:147   Min.   :0.00   Min.   : 1.00  
##  1st Qu.:46.00   enrolled    : 53   1st Qu.:0.00   1st Qu.: 5.00  
##  Median :52.00                      Median :1.00   Median :10.50  
##  Mean   :52.41                      Mean   :1.67   Mean   :10.43  
##  3rd Qu.:61.00                      3rd Qu.:2.00   3rd Qu.:15.00  
##  Max.   :71.00                      Max.   :7.00   Max.   :20.00

Naive Bayes classifier
Now we will make a Naive Bayes classsifier for our data. We will make a 70/30 partitioning for the training and testing set. We used the createDataPartition from caret package to make a balanced partitioning. This is very important in this context because we are going to calculate the prior probabilities from the count of the training data. So, it should follow the proportion of the parent dataset.

# Training and Testing Set
library(caret)

## Loading required package: lattice

## Loading required package: ggplot2

set.seed(1)
trainIndex <- createDataPartition(data$prog, p=0.7)$Resample1
train <- data[trainIndex, ]
test<- data[-trainIndex, ]

# Classifier
library(e1071)
NBclassfier <- naiveBayes(prog~., data=train)

# Prior
NBclassfier$apriori

## Y
##  general academic vocation 
##       32       74       35

# Posterior
NBclassfier$tables$science

##           science
## Y              [,1]      [,2]
##   general  52.90625  8.726524
##   academic 54.14865  8.582688
##   vocation 47.68571 11.260998

NBclassfier$tables$honors

##           honors
## Y          not enrolled   enrolled
##   general    0.78125000 0.21875000
##   academic   0.64864865 0.35135135
##   vocation   0.91428571 0.08571429

The naiveBayes function assumed gaussian distributions for numeric varibles. For numeric variable “science”, the Y values are the means and standard deviations of the predictors within each class. For categorical variable “honors”, the Y values are the conditional probability of the predictors within each class.

Now let us make prediction for the training data and for the test data.

trainPred <- predict(NBclassfier, newdata = train, type = "class")
trainTable <- table(train$prog, trainPred)
testPred <- predict(NBclassfier, newdata=test, type="class")
testTable <- table(test$prog, testPred)
trainAcc <- sum(diag(trainTable))/sum(trainTable)
testAcc<- sum(diag(testTable))/sum(testTable)
message("Confusion Matrix for Training Data")

## Confusion Matrix for Training Data

print(trainTable)

##           trainPred
##            general academic vocation
##   general        4       14       14
##   academic       7       54       13
##   vocation       4        7       24

message("Confusion Matrix for Test Data")

## Confusion Matrix for Test Data

print(testTable)

##           testPred
##            general academic vocation
##   general        5        3        5
##   academic       2       20        9
##   vocation       5        1        9

message("Accuracy")

## Accuracy

print(round(cbind(trainAccuracy=trainAcc, testAccuracy=testAcc),3))

##      trainAccuracy testAccuracy
## [1,]         0.582        0.576

References

https://en.wikipedia.org/wiki/Naive_Bayes_classifier https://rpubs.com/riazakhan94/naive_bayes_classifier_e1071 https://www.byclb.com/TR/Tutorials/neural_networks/ch4_1.htm

Bayesian Analysis and Naive Bayes Classifier

Jeff Hung

Jeff Hung

Outlines

Bayesian Analysis

Bayesian Thinking

Statistical Inference (Frequentist)

Bayesian Thinking

Frequentist versus Bayesian

Bayes’ Theorem

Bayes’ Theorem

One-parameter Model

The Binomial Model

Prior (Conjugate) Distribution

Posterior Distribution

Posterior Mean

Strategies for Prior Determination

Strategies

Naive Bayes Classifier

Naive Bayes Classifier

Probability Model

Constructing a classifier from the probability model

An Example in R

The Data

Naive Bayes classifier

References