keywords: Naive Bayes classifier, supervised learning, classification, package e1071, package naive_bayes


Introduction


Naive Bayes classifier is a simple classifier that has its foundation on the well known Bayes’s theorem. Despite its simplicity, it remained a popular choice for text classification1.

In this tutorial we will cover


The maths of Naive Bayes classifier


As stated earlier, Naive Bayes classifier applies the well know Bayes theorem for conditional probability. In simplest form for event \(A\) and \(B\), Bayes theorem relates two conditional probabilities as follows:

\[ P(A\cap B)=P(A, B)=P(A)P(B|A)=P(B)P(A|B) \]

\[ \implies P(B|A)=\frac{P(B)P(A|B)}{P(A)} \] Now let us see how this simple formula can be used to make a classifier.

In a classification porblem, we have some predictors (features/covarites/independent variables) and an outcome (target/class/ dependent variable). Each observartion has some values for the predictors and a class. From these predictors and associated classes we want to learn so that if the feature values are given, we can predict the class (that is all about supervised leargning!). Now in Naive Bayes, the algorithm evaluates a probability for each class, when the predictor values are given. And intuitively, we can go for the class, that has highest probability.

Now, suppose there are \(n\) predictors, denoted by \(X_1, X_2, X_3\dots, X_n\). And we have the outcome variable \(y\), coming from one of the \(k\) classes, denoted by \(C_1, C_2, C_3,\dots ,C_k\). Now suppose the predictor values are given as follows:

\[ X_1=x_1,\ X_2=x_2,\ X_3=x_3,\dots , X_n=x_n \] Now what we want to do, to evaluate the probability of the observation coming from any of the \(K\) classes, say \(C_k\). We can write this in terms of conditional probability notation:

\[ P(y=C_k|X_1=x_1,\ X_2=x_2,\ X_3=x_3,\dots , X_n=x_n), \\ \text{or in short,}\ \ \ \ \ \ P(C_k|x_1, x_2,x_3,\dots , x_n) \]

Now in the above Bayes formula, if we plug in \(B=C_k\) and \(A=x_1\cap x_2,\dots,\cap x_n=\{x_1, x_2,\dots, x_n\}\), we can write the above conditional probabiltiy as follows:

\[ P(C_k|x_1, x_2,x_3,\dots , x_n)=\frac{P(C_k)P(x_1, x_2,\dots,x_n|C_k)}{P(x_1, x_2,\dots,x_n)} \]

Now the numerator of the above is just the joint probability of \(C_k\) and \(\{x_1, x_2,\dots,x_n\}\), that is \(P(C_k\cap \{ x_1, x_2,\dots,x_n\})=P(C_k, x_1, x_2,\dots,x_n)\). We can apply the Bayes formula repeatively to get the following results:

\[ P(C_k, x_1, x_2,\dots,x_n)=P(x_1, x_2,\dots,x_n, C_k)=P(x_2,x_3,\dots,x_n,C_k)\ P(x_1|x_2,x_3,\dots,x_n,C_k) \]

\[ =P(x_3,x_4,\dots,x_n,C_k)\ P(x_2|x_3,x_4,\dots,x_n,C_k)P(x_1|x_2,x_3,\dots,x_n,C_k) \] \[ =P(x_4,x_5,\dots,x_n,C_k)P(x_3|x_4,x_5,\dots,x_n,C_k) P(x_2|x_3,x_4,\dots,x_n,C_k)P(x_1|x_2,x_3,\dots,x_n,C_k) \]

\[ .... \] \[ .... \]

\[ =P(C_k)P(x_n|C_k)P(x_{n-1}|x_n,C_k)\dots P(x_1|x_2,x_3, \dots,x_n, C_k) \]

The conditional independence assumption

The expression above is quite long, also involves number of conditional probabilities. But with a conditional independence assumption, this long expression can be reduced to a very simple form. The conditional independence assumption is that given a class, say \(C_k\), the predictor/feature values are independent of each other. There is no correlation between the features for a certain class. Mathematically, if event \(A\) and \(B\) are independent conditioned on event \(C\), then the following holds:

\[ P(A|B,C)=P(A|C) \]

Now this formula can be applied for our case. We assumed that all the predictors \(x_1,x_2,\dots,x_n\) are independent conditioned on class \(C_k\). Therefore,

\[ P(x_1|x_2,x_3,\dots, x_n,C_k)=P(x_1|C_k) \]

\[ P(x_2|x_3,x_4,\dots, x_n,C_k)=P(x_2|C_k) \]

and so on. So the long expression above can be written as:

\[ P(x_1, x_2,\dots,x_n, C_k)=P(C_k)P(x_n|C_k)P(x_{n-1}|C_k)P(x_{n-2}|C_k)\dots P(x_1|C_k) \]

\[ \implies P(x_1, x_2,\dots,x_n, C_k)=P(C_k)\prod_{j=1}^{n}P(x_j|C_k) \]

So, we have

\[ P(C_k|x_1, x_2,x_3,\dots , x_n)=\frac{P(C_k)\prod_{j=1}^{n}P(x_j|C_k)}{P(x_1, x_2,\dots,x_n)} \]

Now we are trying to get the probability of \(C_k\) for given feature values. The denominator of above is a constant term for given features. So it will be enough to consider only the numerator part for comparing the probabilities of different class conditioned in fixed feature values. That is, we evaluate the numerator part for every possible values of \(k\subset \{1,2,\dots,K\}\) and vote for the one with highest value.

The prior and the likelihood

\(P(C_k)\) and \(\prod_{j=1}^{n}P(x_j|C_k)\) in the above expression are known as the prior and the likelihood respectively. The prior can be estimated as \(P(C_k)=\frac{n_{class \ k}}{n_{total}}\). For the likelihood, we need the conditional probability distributions, \(f(X_j|C_k)\) for \(j=1,2,\dots, n\). If \(X_j\) is continuous, normality is a common assumption. If \(X_k\) is discrete, a common assumption is mulnomial distribution. Also we can apply non parametric distributions.


Naive Bayes in R- an example


The data

For domonstration purpose, we will make a Niave Bayes classifier here. We will use a data set that contains information of 200 students. Their scores in different subjects and their educational choices (general, academic or vocational). There are other variables indicating their socio economic status and their gender. We will make a Naive Bayer classifier here.

Let us look at the data2.

##   female ses       prog read write math science socst honors awards cid
## 1      1   1 vocational   34    35   41      29    26      0      0   1
## 2      0   2    general   34    33   41      36    36      0      0   1
## 3      0   3 vocational   39    39   44      26    42      0      0   1
## 4      0   1 vocational   37    37   42      33    32      0      0   1
## 5      0   2 vocational   39    31   40      39    51      0      0   1
## 6      1   3    general   42    36   42      31    39      0      0   1
## Correlation matrix within general
##              read     write      math   science     socst
## read    1.0000000 0.4739121 0.3945974 0.6586988 0.5418732
## write   0.4739121 1.0000000 0.3586417 0.5629392 0.6505204
## math    0.3945974 0.3586417 1.0000000 0.5752819 0.3787115
## science 0.6586988 0.5629392 0.5752819 1.0000000 0.4222026
## socst   0.5418732 0.6505204 0.3787115 0.4222026 1.0000000
## Correlation matrix within academic
##              read     write      math   science     socst
## read    1.0000000 0.5608413 0.6917634 0.6250391 0.5851566
## write   0.5608413 1.0000000 0.6130255 0.5128848 0.4538175
## math    0.6917634 0.6130255 1.0000000 0.6410174 0.4591657
## science 0.6250391 0.5128848 0.6410174 1.0000000 0.4383806
## socst   0.5851566 0.4538175 0.4591657 0.4383806 1.0000000
## Correlation matrix within vocational
##              read     write      math   science     socst
## read    1.0000000 0.4615702 0.4570520 0.5132068 0.4325037
## write   0.4615702 1.0000000 0.5090928 0.5225355 0.4926333
## math    0.4570520 0.5090928 1.0000000 0.5706508 0.3769207
## science 0.5132068 0.5225355 0.5706508 1.0000000 0.3348232
## socst   0.4325037 0.4926333 0.3769207 0.3348232 1.0000000

We see the numeric variables are higly correlated. But the Naive Bayes assumed conditional independence. So for demonstration purpose, we will only choose science and socst variables to make our classifier.

Some visualizations

Let us have a look if our chosen predictors have discriminative power. Below we plotted the boxplot and density of the chosen predictors for each category of the target.

We observer couple of things from the visualization:

  1. We have some discriminitive power for the chosen variables.

  2. Most of the distributions are approximately normal. At least, they are not terribly far from normality. Let us run a formal test (Anderson Darling test) for conditional normality:

## P values of Anderson Darling Test
##      GenScience GenSoct AcaScience AcaSoct VocScience VocSoct
## [1,]       0.48     0.1       0.01       0       0.38     0.1

So we see scores of academic group are statistically different from normal. But for the other 4, we have high p values from the test, suggesting they are not statistically different from normal.

Make a Naive Bayes classifier

Now we will make a Naive Bayes classsifier for our data. We will make a 70/30 partitioning for the training and validation set. We used the createDataPartition from caret package to make a balanced partitioning. This is very important in this context because we are going to calculate the prior probabilities from the count of the training data. So, it should follow the proportion of the parent dataset.

library(caret)
## Warning: package 'caret' was built under R version 3.4.3
set.seed(7267166)
trainIndex=createDataPartition(mydata$prog, p=0.7)$Resample1
train=mydata[trainIndex, ]
test=mydata[-trainIndex, ]

## check the balance
print(table(mydata$prog))
## 
##   academic    general vocational 
##        105         45         50
print(table(train$prog))
## 
##   academic    general vocational 
##         74         32         35

Now we will make the classifier. We will utilize e1071 package.

library(e1071)
## Warning: package 'e1071' was built under R version 3.4.3
NBclassfier=naiveBayes(prog~science+socst, data=train)
print(NBclassfier)
## 
## Naive Bayes Classifier for Discrete Predictors
## 
## Call:
## naiveBayes.default(x = X, y = Y, laplace = laplace)
## 
## A-priori probabilities:
## Y
##   academic    general vocational 
##  0.5248227  0.2269504  0.2482270 
## 
## Conditional probabilities:
##             science
## Y                [,1]     [,2]
##   academic   54.21622 9.360761
##   general    52.18750 8.847954
##   vocational 47.31429 9.969871
## 
##             socst
## Y                [,1]      [,2]
##   academic   56.58108  9.635845
##   general    51.12500  8.377196
##   vocational 44.82857 10.279865

The naiveBayes function assumed gaussian distributions for numeric varibles. Also, the priori are calculated from the porportion of the training data. The prioris are shown when the object is printed. The Y values are the means and standard deviations of the predictors within each class.

Now let us make prediction for the training data and for the test data.

printALL=function(model){
  trainPred=predict(model, newdata = train, type = "class")
  trainTable=table(train$prog, trainPred)
  testPred=predict(NBclassfier, newdata=test, type="class")
  testTable=table(test$prog, testPred)
  trainAcc=(trainTable[1,1]+trainTable[2,2]+trainTable[3,3])/sum(trainTable)
  testAcc=(testTable[1,1]+testTable[2,2]+testTable[3,3])/sum(testTable)
  message("Contingency Table for Training Data")
  print(trainTable)
  message("Contingency Table for Test Data")
  print(testTable)
  message("Accuracy")
  print(round(cbind(trainAccuracy=trainAcc, testAccuracy=testAcc),3))
}
printALL(NBclassfier)
## Contingency Table for Training Data
##             trainPred
##              academic general vocational
##   academic         60       0         14
##   general          24       0          8
##   vocational       15       0         20
## Contingency Table for Test Data
##             testPred
##              academic general vocational
##   academic         27       0          4
##   general           7       0          6
##   vocational        5       0         10
## Accuracy
##      trainAccuracy testAccuracy
## [1,]         0.567        0.627

So in this case, we have a higher test accuracy in test data! And the 62.7% accuracy for the test is not that bad, given that we used just two variables. But one thing that this classifier is doing terribly bad is that it is not predicting any “general” class. The middle columns are all zeros in the contingency tables.

Let us give another try using socio economic status (ses) variable.

## Results after including ses
## Contingency Table for Training Data
##             trainPred
##              academic general vocational
##   academic         58       5         11
##   general          16      10          6
##   vocational       16       1         18
## Contingency Table for Test Data
##             testPred
##              academic general vocational
##   academic         27       0          4
##   general           7       0          6
##   vocational        5       0         10
## Accuracy
##      trainAccuracy testAccuracy
## [1,]          0.61        0.627

So the inclusion of ses variable makes a good improvement. Both traing and test accuracy went up. Also, now there are some observation classified as “general” at least for the training data.

Above we included the ses variable as numeric. But including as a factr gives a mere improvement.

## Results after coverting ses as factor
## Contingency Table for Training Data
##             trainPred
##              academic general vocational
##   academic         60       3         11
##   general          15      10          7
##   vocational       16       1         18
## Contingency Table for Test Data
##             testPred
##              academic general vocational
##   academic         27       0          4
##   general           7       0          6
##   vocational        5       0         10
## Accuracy
##      trainAccuracy testAccuracy
## [1,]         0.624        0.627

After a qucik search, I found another package called naivebayes, dedicated for making Naive Bayes classifier. The good thing about this package that we can use Kernel based density for the continuous predictors. Let us use the Kernel based option with ses as factor.

library(naivebayes)
## Warning: package 'naivebayes' was built under R version 3.4.3
newNBclassifier=naive_bayes(prog~sesf+science+socst,usekernel=T, data=train)
printALL(newNBclassifier)
##             trainPred
##              academic general vocational
##   academic         61       4          9
##   general          14      12          6
##   vocational       13       4         18
##             testPred
##              academic general vocational
##   academic         27       0          4
##   general           7       0          6
##   vocational        5       0         10
##      trainAccuracy testAccuracy
## [1,]         0.645        0.627

So, what it did, it raised the training accuracy a little higher, with no change in the test set.


  1. https://en.wikipedia.org/wiki/Naive_Bayes_classifier

  2. data link: https://stats.idre.ucla.edu/stat/data/hsbdemo.dta