keywords: Naive Bayes classifier, supervised learning, classification, package e1071, package naive_bayes
Naive Bayes classifier is a simple classifier that has its foundation on the well known Bayes’s theorem. Despite its simplicity, it remained a popular choice for text classification1.
In this tutorial we will cover
As stated earlier, Naive Bayes classifier applies the well know Bayes theorem for conditional probability. In simplest form for event \(A\) and \(B\), Bayes theorem relates two conditional probabilities as follows:
\[ P(A\cap B)=P(A, B)=P(A)P(B|A)=P(B)P(A|B) \]
\[ \implies P(B|A)=\frac{P(B)P(A|B)}{P(A)} \] Now let us see how this simple formula can be used to make a classifier.
In a classification porblem, we have some predictors (features/covarites/independent variables) and an outcome (target/class/ dependent variable). Each observartion has some values for the predictors and a class. From these predictors and associated classes we want to learn so that if the feature values are given, we can predict the class (that is all about supervised leargning!). Now in Naive Bayes, the algorithm evaluates a probability for each class, when the predictor values are given. And intuitively, we can go for the class, that has highest probability.
Now, suppose there are \(n\) predictors, denoted by \(X_1, X_2, X_3\dots, X_n\). And we have the outcome variable \(y\), coming from one of the \(k\) classes, denoted by \(C_1, C_2, C_3,\dots ,C_k\). Now suppose the predictor values are given as follows:
\[ X_1=x_1,\ X_2=x_2,\ X_3=x_3,\dots , X_n=x_n \] Now what we want to do, to evaluate the probability of the observation coming from any of the \(K\) classes, say \(C_k\). We can write this in terms of conditional probability notation:
\[ P(y=C_k|X_1=x_1,\ X_2=x_2,\ X_3=x_3,\dots , X_n=x_n), \\ \text{or in short,}\ \ \ \ \ \ P(C_k|x_1, x_2,x_3,\dots , x_n) \]
Now in the above Bayes formula, if we plug in \(B=C_k\) and \(A=x_1\cap x_2,\dots,\cap x_n=\{x_1, x_2,\dots, x_n\}\), we can write the above conditional probabiltiy as follows:
\[ P(C_k|x_1, x_2,x_3,\dots , x_n)=\frac{P(C_k)P(x_1, x_2,\dots,x_n|C_k)}{P(x_1, x_2,\dots,x_n)} \]
Now the numerator of the above is just the joint probability of \(C_k\) and \(\{x_1, x_2,\dots,x_n\}\), that is \(P(C_k\cap \{ x_1, x_2,\dots,x_n\})=P(C_k, x_1, x_2,\dots,x_n)\). We can apply the Bayes formula repeatively to get the following results:
\[ P(C_k, x_1, x_2,\dots,x_n)=P(x_1, x_2,\dots,x_n, C_k)=P(x_2,x_3,\dots,x_n,C_k)\ P(x_1|x_2,x_3,\dots,x_n,C_k) \]
\[ =P(x_3,x_4,\dots,x_n,C_k)\ P(x_2|x_3,x_4,\dots,x_n,C_k)P(x_1|x_2,x_3,\dots,x_n,C_k) \] \[ =P(x_4,x_5,\dots,x_n,C_k)P(x_3|x_4,x_5,\dots,x_n,C_k) P(x_2|x_3,x_4,\dots,x_n,C_k)P(x_1|x_2,x_3,\dots,x_n,C_k) \]
\[ .... \] \[ .... \]
\[ =P(C_k)P(x_n|C_k)P(x_{n-1}|x_n,C_k)\dots P(x_1|x_2,x_3, \dots,x_n, C_k) \]
The expression above is quite long, also involves number of conditional probabilities. But with a conditional independence assumption, this long expression can be reduced to a very simple form. The conditional independence assumption is that given a class, say \(C_k\), the predictor/feature values are independent of each other. There is no correlation between the features for a certain class. Mathematically, if event \(A\) and \(B\) are independent conditioned on event \(C\), then the following holds:
\[ P(A|B,C)=P(A|C) \]
Now this formula can be applied for our case. We assumed that all the predictors \(x_1,x_2,\dots,x_n\) are independent conditioned on class \(C_k\). Therefore,
\[ P(x_1|x_2,x_3,\dots, x_n,C_k)=P(x_1|C_k) \]
\[ P(x_2|x_3,x_4,\dots, x_n,C_k)=P(x_2|C_k) \]
and so on. So the long expression above can be written as:
\[ P(x_1, x_2,\dots,x_n, C_k)=P(C_k)P(x_n|C_k)P(x_{n-1}|C_k)P(x_{n-2}|C_k)\dots P(x_1|C_k) \]
\[ \implies P(x_1, x_2,\dots,x_n, C_k)=P(C_k)\prod_{j=1}^{n}P(x_j|C_k) \]
So, we have
\[ P(C_k|x_1, x_2,x_3,\dots , x_n)=\frac{P(C_k)\prod_{j=1}^{n}P(x_j|C_k)}{P(x_1, x_2,\dots,x_n)} \]
Now we are trying to get the probability of \(C_k\) for given feature values. The denominator of above is a constant term for given features. So it will be enough to consider only the numerator part for comparing the probabilities of different class conditioned in fixed feature values. That is, we evaluate the numerator part for every possible values of \(k\subset \{1,2,\dots,K\}\) and vote for the one with highest value.
\(P(C_k)\) and \(\prod_{j=1}^{n}P(x_j|C_k)\) in the above expression are known as the prior and the likelihood respectively. The prior can be estimated as \(P(C_k)=\frac{n_{class \ k}}{n_{total}}\). For the likelihood, we need the conditional probability distributions, \(f(X_j|C_k)\) for \(j=1,2,\dots, n\). If \(X_j\) is continuous, normality is a common assumption. If \(X_k\) is discrete, a common assumption is mulnomial distribution. Also we can apply non parametric distributions.
For domonstration purpose, we will make a Niave Bayes classifier here. We will use a data set that contains information of 200 students. Their scores in different subjects and their educational choices (general, academic or vocational). There are other variables indicating their socio economic status and their gender. We will make a Naive Bayer classifier here.
Let us look at the data2.
## female ses prog read write math science socst honors awards cid
## 1 1 1 vocational 34 35 41 29 26 0 0 1
## 2 0 2 general 34 33 41 36 36 0 0 1
## 3 0 3 vocational 39 39 44 26 42 0 0 1
## 4 0 1 vocational 37 37 42 33 32 0 0 1
## 5 0 2 vocational 39 31 40 39 51 0 0 1
## 6 1 3 general 42 36 42 31 39 0 0 1
## Correlation matrix within general
## read write math science socst
## read 1.0000000 0.4739121 0.3945974 0.6586988 0.5418732
## write 0.4739121 1.0000000 0.3586417 0.5629392 0.6505204
## math 0.3945974 0.3586417 1.0000000 0.5752819 0.3787115
## science 0.6586988 0.5629392 0.5752819 1.0000000 0.4222026
## socst 0.5418732 0.6505204 0.3787115 0.4222026 1.0000000
## Correlation matrix within academic
## read write math science socst
## read 1.0000000 0.5608413 0.6917634 0.6250391 0.5851566
## write 0.5608413 1.0000000 0.6130255 0.5128848 0.4538175
## math 0.6917634 0.6130255 1.0000000 0.6410174 0.4591657
## science 0.6250391 0.5128848 0.6410174 1.0000000 0.4383806
## socst 0.5851566 0.4538175 0.4591657 0.4383806 1.0000000
## Correlation matrix within vocational
## read write math science socst
## read 1.0000000 0.4615702 0.4570520 0.5132068 0.4325037
## write 0.4615702 1.0000000 0.5090928 0.5225355 0.4926333
## math 0.4570520 0.5090928 1.0000000 0.5706508 0.3769207
## science 0.5132068 0.5225355 0.5706508 1.0000000 0.3348232
## socst 0.4325037 0.4926333 0.3769207 0.3348232 1.0000000
We see the numeric variables are higly correlated. But the Naive Bayes assumed conditional independence. So for demonstration purpose, we will only choose science and socst variables to make our classifier.
Let us have a look if our chosen predictors have discriminative power. Below we plotted the boxplot and density of the chosen predictors for each category of the target.
We observer couple of things from the visualization:
We have some discriminitive power for the chosen variables.
Most of the distributions are approximately normal. At least, they are not terribly far from normality. Let us run a formal test (Anderson Darling test) for conditional normality:
## P values of Anderson Darling Test
## GenScience GenSoct AcaScience AcaSoct VocScience VocSoct
## [1,] 0.48 0.1 0.01 0 0.38 0.1
So we see scores of academic group are statistically different from normal. But for the other 4, we have high p values from the test, suggesting they are not statistically different from normal.
Now we will make a Naive Bayes classsifier for our data. We will make a 70/30 partitioning for the training and validation set. We used the createDataPartition from caret package to make a balanced partitioning. This is very important in this context because we are going to calculate the prior probabilities from the count of the training data. So, it should follow the proportion of the parent dataset.
library(caret)
## Warning: package 'caret' was built under R version 3.4.3
set.seed(7267166)
trainIndex=createDataPartition(mydata$prog, p=0.7)$Resample1
train=mydata[trainIndex, ]
test=mydata[-trainIndex, ]
## check the balance
print(table(mydata$prog))
##
## academic general vocational
## 105 45 50
print(table(train$prog))
##
## academic general vocational
## 74 32 35
Now we will make the classifier. We will utilize e1071 package.
library(e1071)
## Warning: package 'e1071' was built under R version 3.4.3
NBclassfier=naiveBayes(prog~science+socst, data=train)
print(NBclassfier)
##
## Naive Bayes Classifier for Discrete Predictors
##
## Call:
## naiveBayes.default(x = X, y = Y, laplace = laplace)
##
## A-priori probabilities:
## Y
## academic general vocational
## 0.5248227 0.2269504 0.2482270
##
## Conditional probabilities:
## science
## Y [,1] [,2]
## academic 54.21622 9.360761
## general 52.18750 8.847954
## vocational 47.31429 9.969871
##
## socst
## Y [,1] [,2]
## academic 56.58108 9.635845
## general 51.12500 8.377196
## vocational 44.82857 10.279865
The naiveBayes function assumed gaussian distributions for numeric varibles. Also, the priori are calculated from the porportion of the training data. The prioris are shown when the object is printed. The Y values are the means and standard deviations of the predictors within each class.
Now let us make prediction for the training data and for the test data.
printALL=function(model){
trainPred=predict(model, newdata = train, type = "class")
trainTable=table(train$prog, trainPred)
testPred=predict(NBclassfier, newdata=test, type="class")
testTable=table(test$prog, testPred)
trainAcc=(trainTable[1,1]+trainTable[2,2]+trainTable[3,3])/sum(trainTable)
testAcc=(testTable[1,1]+testTable[2,2]+testTable[3,3])/sum(testTable)
message("Contingency Table for Training Data")
print(trainTable)
message("Contingency Table for Test Data")
print(testTable)
message("Accuracy")
print(round(cbind(trainAccuracy=trainAcc, testAccuracy=testAcc),3))
}
printALL(NBclassfier)
## Contingency Table for Training Data
## trainPred
## academic general vocational
## academic 60 0 14
## general 24 0 8
## vocational 15 0 20
## Contingency Table for Test Data
## testPred
## academic general vocational
## academic 27 0 4
## general 7 0 6
## vocational 5 0 10
## Accuracy
## trainAccuracy testAccuracy
## [1,] 0.567 0.627
So in this case, we have a higher test accuracy in test data! And the 62.7% accuracy for the test is not that bad, given that we used just two variables. But one thing that this classifier is doing terribly bad is that it is not predicting any “general” class. The middle columns are all zeros in the contingency tables.
Let us give another try using socio economic status (ses) variable.
## Results after including ses
## Contingency Table for Training Data
## trainPred
## academic general vocational
## academic 58 5 11
## general 16 10 6
## vocational 16 1 18
## Contingency Table for Test Data
## testPred
## academic general vocational
## academic 27 0 4
## general 7 0 6
## vocational 5 0 10
## Accuracy
## trainAccuracy testAccuracy
## [1,] 0.61 0.627
So the inclusion of ses variable makes a good improvement. Both traing and test accuracy went up. Also, now there are some observation classified as “general” at least for the training data.
Above we included the ses variable as numeric. But including as a factr gives a mere improvement.
## Results after coverting ses as factor
## Contingency Table for Training Data
## trainPred
## academic general vocational
## academic 60 3 11
## general 15 10 7
## vocational 16 1 18
## Contingency Table for Test Data
## testPred
## academic general vocational
## academic 27 0 4
## general 7 0 6
## vocational 5 0 10
## Accuracy
## trainAccuracy testAccuracy
## [1,] 0.624 0.627
After a qucik search, I found another package called naivebayes, dedicated for making Naive Bayes classifier. The good thing about this package that we can use Kernel based density for the continuous predictors. Let us use the Kernel based option with ses as factor.
library(naivebayes)
## Warning: package 'naivebayes' was built under R version 3.4.3
newNBclassifier=naive_bayes(prog~sesf+science+socst,usekernel=T, data=train)
printALL(newNBclassifier)
## trainPred
## academic general vocational
## academic 61 4 9
## general 14 12 6
## vocational 13 4 18
## testPred
## academic general vocational
## academic 27 0 4
## general 7 0 6
## vocational 5 0 10
## trainAccuracy testAccuracy
## [1,] 0.645 0.627
So, what it did, it raised the training accuracy a little higher, with no change in the test set.