Jeff Hung

Data scientist of the Institute of Manufacturing Information and Systems of National Cheng Kung University
Gmail LinkedIn Github Polab


Outlines

  1. Bayesian Analysis
  2. Naive Bayes Classifier

Bayesian Analysis

Bayesian Thinking

  • Statistical Inference (Frequentist)

    Define Problem and Collect Data => Modeling => Estimation => Hypothesis Test (CI) => Prediction => Decision

  • Bayesian Thinking

    Considers not only what the data have to say, but what your “expertise” or “experience” tells you as well.

  • Frequentist versus Bayesian

  1. The frequentist definition sees probability as the long-run expected frequency of occurrence, e.g. \(P(A) = \frac{n}{N}\).
  2. The Bayesian view of probability is related to degree of “belief”. It is a measure of the plausibility of an event given incomplete knowledge.

Bayes’ Theorem

  • Bayes’ Theorem

    Bayesian methods are derived from the principles of Bayesian inference, which the process of inductive learning via Bayes’ Theorem.

    \(P(A|B) = \frac{P(B|A)P(A)}{P(B)} = \frac{P(B|A)P(A)}{P(B|A)P(A) + P(B|A^{c})P(A^{c})}\)

    \(P(\theta|y) = \frac{P(y|\theta)P(\theta)}{\int_{\theta}P(y|\theta)P(\theta)d\theta}\)

    \(Posterior = \frac{Likelihood*Prior}{constant}\)

    The parameter space Θ is the set of possible parameter values, from which we hope to identify the value that best represents the true population characteristics.
  • Prior distribution 𝒑(𝛉) describes our belief that 𝛉 represents the true population characteristics.
  • Sampling model 𝒑(𝐲|𝛉) describes our belief that 𝐲 would be the outcome of our study if we knew 𝛉 to be true.
    Once we obtain the data 𝐲, the last step is to update our beliefs about 𝛉
  • Posterior distribution 𝒑(𝛉|𝐲) describes our belief that 𝛉 is the true value, having observed dataset 𝐲.

One-parameter Model

  • The Binomial Model

    We would like to study the happiness for college students of age between 18 to 22 in Taiwan. Each student was asked whether or not they were generally happy. A reasonable sampling model for this study is

    \(X_1, ..., X_n \sim Bernoulli(\theta)\)
    \(Y=X_1+ ...+ X_n \sim Binomial(n, \theta)\)

    where 𝜽 is the fraction of happy student of age between 18 to 22 in Taiwan and it’s between 0 and 1. Suppose we found that 118 out of 129 students are generally happy.

  • Prior (Conjugate) Distribution

  • Posterior Distribution

  • Posterior Mean

Strategies for Prior Determination

  • Strategies

    How do we choose our priors?
      1. We typically have some information (e.g., from literature or scientific knowledge) about.
      2. We can elicit information from experts.
      3. We can choose a prior that is mathematically convenient.
    In all cases, the difficulty is condensing the information in the form of a distribution.

Naive Bayes Classifier

In machine learning, naive Bayes classifiers are a family of simple “probabilistic classifiers” based on applying Bayes’ theorem with strong (naive) independence assumptions between the features.

Naive Bayes Classifier

  • Probability Model

    Abstractly, naive Bayes is a conditional probability model: given a problem instance to be classified. A vector x representing some n features (independent variables), it assigns to this instance probabilities

    \(P(C_k|x_1, \dots, x_n)\)

    for each of K possible outcomes or classes Ck.
    Using Bayes’ theorem, the conditional probability can be decomposed as

    \(P(C_k|x)=\frac{P(x|C_k)P(C_k)}{P(x)}\)

    Using Bayesian probability terminology, the above equation can be written as

    \(Posterior = \frac{Likelihood*Prior}{Constant(Evidence)}\)

    Now the “naive” conditional independence assumptions come into play: assume that each feature xi is conditionally independent of every other feature xj for j not equal to i, given the category Ck. Thus, the joint model can be expressed as \[\begin{align} P(C_k|x_1, \dots, x_n) &\propto P(C_k,x_1, \dots, x_n) \\ &= P(C_k)P(x_1|C_k)P(x_2|C_k)P(x_3|C_k)... \\ &= P(C_k)\prod_{i=1}^{n} P(x_i|C_k) \end{align}\]
  • Constructing a classifier from the probability model

    The naive Bayes classifier combines this model with a decision rule. One common rule is to pick the hypothesis that is most probable; this is known as the maximum a posteriori or MAP decision rule. \[\begin{align} \hat{y} = \mathop{\arg\max}_{k \in {1,\dots,K}} P(C_k)\prod_{i=1}^{n} P(x_i|C_k) \end{align}\]

An Example in R

  • The Data

    For domonstration purpose, we will make a Niave Bayes classifier here. We will use a data set that contains information of 200 students. Their scores in different subjects and their educational choices (general, academic or vocational). There are other variables indicating their socio economic status and their gender. We will make a Naive Bayer classifier here.
library(foreign)
data <- read.dta("/Users/hungyushin/Desktop/R/Website/Rpubs/Bayesian Analysis and Naive Bayes Classifier/hsbdemo.dta")
head(data,5)
##    id female    ses schtyp     prog read write math science socst
## 1  45 female    low public vocation   34    35   41      29    26
## 2 108   male middle public  general   34    33   41      36    36
## 3  15   male   high public vocation   39    39   44      26    42
## 4  67   male    low public vocation   37    37   42      33    32
## 5 153   male middle public vocation   39    31   40      39    51
##         honors awards cid
## 1 not enrolled      0   1
## 2 not enrolled      0   1
## 3 not enrolled      0   1
## 4 not enrolled      0   1
## 5 not enrolled      0   1
str(data)
## 'data.frame':    200 obs. of  13 variables:
##  $ id     : num  45 108 15 67 153 51 164 133 2 53 ...
##  $ female : Factor w/ 2 levels "male","female": 2 1 1 1 1 2 1 1 2 1 ...
##  $ ses    : Factor w/ 3 levels "low","middle",..: 1 2 3 1 2 3 2 2 2 2 ...
##  $ schtyp : Factor w/ 2 levels "public","private": 1 1 1 1 1 1 1 1 1 1 ...
##  $ prog   : Factor w/ 3 levels "general","academic",..: 3 1 3 3 3 1 3 3 3 3 ...
##  $ read   : num  34 34 39 37 39 42 31 50 39 34 ...
##  $ write  : num  35 33 39 37 31 36 36 31 41 37 ...
##  $ math   : num  41 41 44 42 40 42 46 40 33 46 ...
##  $ science: num  29 36 26 33 39 31 39 34 42 39 ...
##  $ socst  : num  26 36 42 32 51 39 46 31 41 31 ...
##  $ honors : Factor w/ 2 levels "not enrolled",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ awards : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ cid    : int  1 1 1 1 1 1 1 1 1 1 ...
##  - attr(*, "datalabel")= chr "highschool and beyond (200 cases)"
##  - attr(*, "time.stamp")= chr "30 Oct 2009 14:13"
##  - attr(*, "formats")= chr  "%9.0g" "%9.0g" "%9.0g" "%9.0g" ...
##  - attr(*, "types")= int  254 254 254 254 254 254 254 254 254 254 ...
##  - attr(*, "val.labels")= chr  "" "fl" "sl" "scl" ...
##  - attr(*, "var.labels")= chr  "" "" "" "type of school" ...
##  - attr(*, "version")= int 8
##  - attr(*, "label.table")=List of 5
##   ..$ sl    : Named int  1 2 3
##   .. ..- attr(*, "names")= chr  "low" "middle" "high"
##   ..$ scl   : Named int  1 2
##   .. ..- attr(*, "names")= chr  "public" "private"
##   ..$ sel   : Named int  1 2 3
##   .. ..- attr(*, "names")= chr  "general" "academic" "vocation"
##   ..$ fl    : Named int  0 1
##   .. ..- attr(*, "names")= chr  "male" "female"
##   ..$ honlab: Named int  0 1
##   .. ..- attr(*, "names")= chr  "not enrolled" "enrolled"
summary(data)
##        id            female        ses         schtyp          prog    
##  Min.   :  1.00   male  : 91   low   :47   public :168   general : 45  
##  1st Qu.: 50.75   female:109   middle:95   private: 32   academic:105  
##  Median :100.50                high  :58                 vocation: 50  
##  Mean   :100.50                                                        
##  3rd Qu.:150.25                                                        
##  Max.   :200.00                                                        
##       read           write            math          science     
##  Min.   :28.00   Min.   :31.00   Min.   :33.00   Min.   :26.00  
##  1st Qu.:44.00   1st Qu.:45.75   1st Qu.:45.00   1st Qu.:44.00  
##  Median :50.00   Median :54.00   Median :52.00   Median :53.00  
##  Mean   :52.23   Mean   :52.77   Mean   :52.65   Mean   :51.85  
##  3rd Qu.:60.00   3rd Qu.:60.00   3rd Qu.:59.00   3rd Qu.:58.00  
##  Max.   :76.00   Max.   :67.00   Max.   :75.00   Max.   :74.00  
##      socst                honors        awards          cid       
##  Min.   :26.00   not enrolled:147   Min.   :0.00   Min.   : 1.00  
##  1st Qu.:46.00   enrolled    : 53   1st Qu.:0.00   1st Qu.: 5.00  
##  Median :52.00                      Median :1.00   Median :10.50  
##  Mean   :52.41                      Mean   :1.67   Mean   :10.43  
##  3rd Qu.:61.00                      3rd Qu.:2.00   3rd Qu.:15.00  
##  Max.   :71.00                      Max.   :7.00   Max.   :20.00
  • Naive Bayes classifier

    Now we will make a Naive Bayes classsifier for our data. We will make a 70/30 partitioning for the training and testing set. We used the createDataPartition from caret package to make a balanced partitioning. This is very important in this context because we are going to calculate the prior probabilities from the count of the training data. So, it should follow the proportion of the parent dataset.
# Training and Testing Set
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
set.seed(1)
trainIndex <- createDataPartition(data$prog, p=0.7)$Resample1
train <- data[trainIndex, ]
test<- data[-trainIndex, ]

# Classifier
library(e1071)
NBclassfier <- naiveBayes(prog~., data=train)

# Prior
NBclassfier$apriori
## Y
##  general academic vocation 
##       32       74       35
# Posterior
NBclassfier$tables$science
##           science
## Y              [,1]      [,2]
##   general  52.90625  8.726524
##   academic 54.14865  8.582688
##   vocation 47.68571 11.260998
NBclassfier$tables$honors
##           honors
## Y          not enrolled   enrolled
##   general    0.78125000 0.21875000
##   academic   0.64864865 0.35135135
##   vocation   0.91428571 0.08571429

The naiveBayes function assumed gaussian distributions for numeric varibles. For numeric variable “science”, the Y values are the means and standard deviations of the predictors within each class. For categorical variable “honors”, the Y values are the conditional probability of the predictors within each class.

Now let us make prediction for the training data and for the test data.

trainPred <- predict(NBclassfier, newdata = train, type = "class")
trainTable <- table(train$prog, trainPred)
testPred <- predict(NBclassfier, newdata=test, type="class")
testTable <- table(test$prog, testPred)
trainAcc <- sum(diag(trainTable))/sum(trainTable)
testAcc<- sum(diag(testTable))/sum(testTable)
message("Confusion Matrix for Training Data")
## Confusion Matrix for Training Data
print(trainTable)
##           trainPred
##            general academic vocation
##   general        4       14       14
##   academic       7       54       13
##   vocation       4        7       24
message("Confusion Matrix for Test Data")
## Confusion Matrix for Test Data
print(testTable)
##           testPred
##            general academic vocation
##   general        5        3        5
##   academic       2       20        9
##   vocation       5        1        9
message("Accuracy")
## Accuracy
print(round(cbind(trainAccuracy=trainAcc, testAccuracy=testAcc),3))
##      trainAccuracy testAccuracy
## [1,]         0.582        0.576