Jeff Hung
Data scientist of the Institute of Manufacturing Information and Systems of National Cheng Kung University
Gmail
LinkedIn
Github
Polab
Outlines
- Bayesian Analysis
- Naive Bayes Classifier
Bayesian Analysis
Bayesian Thinking
Statistical Inference (Frequentist)
Define Problem and Collect Data => Modeling => Estimation => Hypothesis Test (CI) => Prediction => Decision
Bayesian Thinking
Considers not only what the data have to say, but what your “expertise” or “experience” tells you as well.
Frequentist versus Bayesian
- The frequentist definition sees probability as the long-run expected frequency of occurrence, e.g. \(P(A) = \frac{n}{N}\).
- The Bayesian view of probability is related to degree of “belief”. It is a measure of the plausibility of an event given incomplete knowledge.
Bayes’ Theorem
Bayes’ Theorem
Bayesian methods are derived from the principles of Bayesian inference, which the process of inductive learning via Bayes’ Theorem.
\(P(A|B) = \frac{P(B|A)P(A)}{P(B)} = \frac{P(B|A)P(A)}{P(B|A)P(A) + P(B|A^{c})P(A^{c})}\)
\(P(\theta|y) = \frac{P(y|\theta)P(\theta)}{\int_{\theta}P(y|\theta)P(\theta)d\theta}\)
\(Posterior = \frac{Likelihood*Prior}{constant}\)
The parameter space Θ is the set of possible parameter values, from which we hope to identify the value that best represents the true population characteristics.- Prior distribution 𝒑(𝛉) describes our belief that 𝛉 represents the true population characteristics.
- Sampling model 𝒑(𝐲|𝛉) describes our belief that 𝐲 would be the outcome of our study if we knew 𝛉 to be true.
Once we obtain the data 𝐲, the last step is to update our beliefs about 𝛉 - Posterior distribution 𝒑(𝛉|𝐲) describes our belief that 𝛉 is the true value, having observed dataset 𝐲.
One-parameter Model
The Binomial Model
We would like to study the happiness for college students of age between 18 to 22 in Taiwan. Each student was asked whether or not they were generally happy. A reasonable sampling model for this study is
\(X_1, ..., X_n \sim Bernoulli(\theta)\)
\(Y=X_1+ ...+ X_n \sim Binomial(n, \theta)\)
where 𝜽 is the fraction of happy student of age between 18 to 22 in Taiwan and it’s between 0 and 1. Suppose we found that 118 out of 129 students are generally happy.
Prior (Conjugate) Distribution
…
Posterior Distribution
…
Posterior Mean
…
Strategies for Prior Determination
Strategies
How do we choose our priors?
1. We typically have some information (e.g., from literature or scientific knowledge) about.
2. We can elicit information from experts.
3. We can choose a prior that is mathematically convenient.
In all cases, the difficulty is condensing the information in the form of a distribution.
Naive Bayes Classifier
In machine learning, naive Bayes classifiers are a family of simple “probabilistic classifiers” based on applying Bayes’ theorem with strong (naive) independence assumptions between the features.
Naive Bayes Classifier
Probability Model
Abstractly, naive Bayes is a conditional probability model: given a problem instance to be classified. A vector x representing some n features (independent variables), it assigns to this instance probabilities\(P(C_k|x_1, \dots, x_n)\)
for each of K possible outcomes or classes Ck.
Using Bayes’ theorem, the conditional probability can be decomposed as\(P(C_k|x)=\frac{P(x|C_k)P(C_k)}{P(x)}\)
Using Bayesian probability terminology, the above equation can be written as\(Posterior = \frac{Likelihood*Prior}{Constant(Evidence)}\)
Now the “naive” conditional independence assumptions come into play: assume that each feature xi is conditionally independent of every other feature xj for j not equal to i, given the category Ck. Thus, the joint model can be expressed as \[\begin{align} P(C_k|x_1, \dots, x_n) &\propto P(C_k,x_1, \dots, x_n) \\ &= P(C_k)P(x_1|C_k)P(x_2|C_k)P(x_3|C_k)... \\ &= P(C_k)\prod_{i=1}^{n} P(x_i|C_k) \end{align}\]Constructing a classifier from the probability model
The naive Bayes classifier combines this model with a decision rule. One common rule is to pick the hypothesis that is most probable; this is known as the maximum a posteriori or MAP decision rule. \[\begin{align} \hat{y} = \mathop{\arg\max}_{k \in {1,\dots,K}} P(C_k)\prod_{i=1}^{n} P(x_i|C_k) \end{align}\]
An Example in R
The Data
For domonstration purpose, we will make a Niave Bayes classifier here. We will use a data set that contains information of 200 students. Their scores in different subjects and their educational choices (general, academic or vocational). There are other variables indicating their socio economic status and their gender. We will make a Naive Bayer classifier here.
library(foreign)
data <- read.dta("/Users/hungyushin/Desktop/R/Website/Rpubs/Bayesian Analysis and Naive Bayes Classifier/hsbdemo.dta")
head(data,5)## id female ses schtyp prog read write math science socst
## 1 45 female low public vocation 34 35 41 29 26
## 2 108 male middle public general 34 33 41 36 36
## 3 15 male high public vocation 39 39 44 26 42
## 4 67 male low public vocation 37 37 42 33 32
## 5 153 male middle public vocation 39 31 40 39 51
## honors awards cid
## 1 not enrolled 0 1
## 2 not enrolled 0 1
## 3 not enrolled 0 1
## 4 not enrolled 0 1
## 5 not enrolled 0 1
str(data)## 'data.frame': 200 obs. of 13 variables:
## $ id : num 45 108 15 67 153 51 164 133 2 53 ...
## $ female : Factor w/ 2 levels "male","female": 2 1 1 1 1 2 1 1 2 1 ...
## $ ses : Factor w/ 3 levels "low","middle",..: 1 2 3 1 2 3 2 2 2 2 ...
## $ schtyp : Factor w/ 2 levels "public","private": 1 1 1 1 1 1 1 1 1 1 ...
## $ prog : Factor w/ 3 levels "general","academic",..: 3 1 3 3 3 1 3 3 3 3 ...
## $ read : num 34 34 39 37 39 42 31 50 39 34 ...
## $ write : num 35 33 39 37 31 36 36 31 41 37 ...
## $ math : num 41 41 44 42 40 42 46 40 33 46 ...
## $ science: num 29 36 26 33 39 31 39 34 42 39 ...
## $ socst : num 26 36 42 32 51 39 46 31 41 31 ...
## $ honors : Factor w/ 2 levels "not enrolled",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ awards : num 0 0 0 0 0 0 0 0 0 0 ...
## $ cid : int 1 1 1 1 1 1 1 1 1 1 ...
## - attr(*, "datalabel")= chr "highschool and beyond (200 cases)"
## - attr(*, "time.stamp")= chr "30 Oct 2009 14:13"
## - attr(*, "formats")= chr "%9.0g" "%9.0g" "%9.0g" "%9.0g" ...
## - attr(*, "types")= int 254 254 254 254 254 254 254 254 254 254 ...
## - attr(*, "val.labels")= chr "" "fl" "sl" "scl" ...
## - attr(*, "var.labels")= chr "" "" "" "type of school" ...
## - attr(*, "version")= int 8
## - attr(*, "label.table")=List of 5
## ..$ sl : Named int 1 2 3
## .. ..- attr(*, "names")= chr "low" "middle" "high"
## ..$ scl : Named int 1 2
## .. ..- attr(*, "names")= chr "public" "private"
## ..$ sel : Named int 1 2 3
## .. ..- attr(*, "names")= chr "general" "academic" "vocation"
## ..$ fl : Named int 0 1
## .. ..- attr(*, "names")= chr "male" "female"
## ..$ honlab: Named int 0 1
## .. ..- attr(*, "names")= chr "not enrolled" "enrolled"
summary(data)## id female ses schtyp prog
## Min. : 1.00 male : 91 low :47 public :168 general : 45
## 1st Qu.: 50.75 female:109 middle:95 private: 32 academic:105
## Median :100.50 high :58 vocation: 50
## Mean :100.50
## 3rd Qu.:150.25
## Max. :200.00
## read write math science
## Min. :28.00 Min. :31.00 Min. :33.00 Min. :26.00
## 1st Qu.:44.00 1st Qu.:45.75 1st Qu.:45.00 1st Qu.:44.00
## Median :50.00 Median :54.00 Median :52.00 Median :53.00
## Mean :52.23 Mean :52.77 Mean :52.65 Mean :51.85
## 3rd Qu.:60.00 3rd Qu.:60.00 3rd Qu.:59.00 3rd Qu.:58.00
## Max. :76.00 Max. :67.00 Max. :75.00 Max. :74.00
## socst honors awards cid
## Min. :26.00 not enrolled:147 Min. :0.00 Min. : 1.00
## 1st Qu.:46.00 enrolled : 53 1st Qu.:0.00 1st Qu.: 5.00
## Median :52.00 Median :1.00 Median :10.50
## Mean :52.41 Mean :1.67 Mean :10.43
## 3rd Qu.:61.00 3rd Qu.:2.00 3rd Qu.:15.00
## Max. :71.00 Max. :7.00 Max. :20.00
Naive Bayes classifier
Now we will make a Naive Bayes classsifier for our data. We will make a 70/30 partitioning for the training and testing set. We used the createDataPartition from caret package to make a balanced partitioning. This is very important in this context because we are going to calculate the prior probabilities from the count of the training data. So, it should follow the proportion of the parent dataset.
# Training and Testing Set
library(caret)## Loading required package: lattice
## Loading required package: ggplot2
set.seed(1)
trainIndex <- createDataPartition(data$prog, p=0.7)$Resample1
train <- data[trainIndex, ]
test<- data[-trainIndex, ]
# Classifier
library(e1071)
NBclassfier <- naiveBayes(prog~., data=train)
# Prior
NBclassfier$apriori## Y
## general academic vocation
## 32 74 35
# Posterior
NBclassfier$tables$science## science
## Y [,1] [,2]
## general 52.90625 8.726524
## academic 54.14865 8.582688
## vocation 47.68571 11.260998
NBclassfier$tables$honors## honors
## Y not enrolled enrolled
## general 0.78125000 0.21875000
## academic 0.64864865 0.35135135
## vocation 0.91428571 0.08571429
The naiveBayes function assumed gaussian distributions for numeric varibles. For numeric variable “science”, the Y values are the means and standard deviations of the predictors within each class. For categorical variable “honors”, the Y values are the conditional probability of the predictors within each class.
Now let us make prediction for the training data and for the test data.
trainPred <- predict(NBclassfier, newdata = train, type = "class")
trainTable <- table(train$prog, trainPred)
testPred <- predict(NBclassfier, newdata=test, type="class")
testTable <- table(test$prog, testPred)
trainAcc <- sum(diag(trainTable))/sum(trainTable)
testAcc<- sum(diag(testTable))/sum(testTable)
message("Confusion Matrix for Training Data")## Confusion Matrix for Training Data
print(trainTable)## trainPred
## general academic vocation
## general 4 14 14
## academic 7 54 13
## vocation 4 7 24
message("Confusion Matrix for Test Data")## Confusion Matrix for Test Data
print(testTable)## testPred
## general academic vocation
## general 5 3 5
## academic 2 20 9
## vocation 5 1 9
message("Accuracy")## Accuracy
print(round(cbind(trainAccuracy=trainAcc, testAccuracy=testAcc),3))## trainAccuracy testAccuracy
## [1,] 0.582 0.576