A tutorial for Naive Bayes Classifier

Example dataset

There are 10 companies with the information regarding whether they were in legal trouble last year or not, their size (big or small), and whether the company was found to be truthful or fraudulent. In this tutorial, we are going to use the Naive Bayes Classifier to classify a company as truthful or fraudulent based on its other two predictors (size and trouble).

First display the data in a nice table format using the following libraries.

library(knitr)
library(kableExtra)

kab <- knitr::kable(company.df, caption = "Company Dataset for Naive Bayes",
                         booktabs = T, label = "dataset table")

kable_classic_2(kab, full_width = T)

Company Dataset for Naive Bayes
Trouble	Size	Status
Yes	Small	Truthful
No	Small	Truthful
No	Large	Truthful
No	Large	Truthful
No	Small	Truthful
No	Small	Truthful
Yes	Small	Fraudulent
Yes	Large	Fraudulent
No	Large	Fraudulent
Yes	Large	Fraudulent

Two types of probabilities

A-priori probabilities are probabilities computed without the knowledge of predictors. Example: Given the above dataset, what is the probability that a company is fraudulent? Out of 10 companies, 4 companies are fraudulent. Hence, this probability is 4/10 = 0.4.

Conditional probabilities: What is the probability of getting a predictor attribute given the class information? For example, given that a company is fraudulent, what is the probability that it was NOT in legal trouble last year? From the output, this value is 0.25.

##Full dataset for training (no partition)
train.full.nb <- naiveBayes(Status ~ ., data = company.df)
train.full.nb

## 
## Naive Bayes Classifier for Discrete Predictors
## 
## Call:
## naiveBayes.default(x = X, y = Y, laplace = laplace)
## 
## A-priori probabilities:
## Y
## Fraudulent   Truthful 
##        0.4        0.6 
## 
## Conditional probabilities:
##             Trouble
## Y               No   Yes
##   Fraudulent 0.250 0.750
##   Truthful   0.833 0.167
## 
##             Size
## Y            Large Small
##   Fraudulent 0.750 0.250
##   Truthful   0.333 0.667

Now we want to use the NB Classifier to classify companies based on their predictors. We will use the whole dataset.

## predict probabilities
pred.prob <- predict(train.full.nb, newdata=company.df, type="raw")
pred.prob

##       Fraudulent Truthful
##  [1,]     0.5294    0.471
##  [2,]     0.0698    0.930
##  [3,]     0.3103    0.690
##  [4,]     0.3103    0.690
##  [5,]     0.0698    0.930
##  [6,]     0.0698    0.930
##  [7,]     0.5294    0.471
##  [8,]     0.8710    0.129
##  [9,]     0.3103    0.690
## [10,]     0.8710    0.129

## predict class membership
pred.class <- predict(train.full.nb, newdata=company.df)
pred.class

##  [1] Fraudulent Truthful   Truthful   Truthful   Truthful   Truthful  
##  [7] Fraudulent Fraudulent Truthful   Fraudulent
## Levels: Fraudulent Truthful

pred.df <- data.frame(pred.prob, pred.class)
complete.df <- cbind(company.df, pred.df)
################################

Display the result of classification

kab <- knitr::kable(complete.df, caption = "Classification Result",
                         booktabs = T, label = "Result table")

kable_classic_2(kab, full_width = T)

Classification Result
Trouble	Size	Status	Fraudulent	Truthful	pred.class
Yes	Small	Truthful	0.529	0.471	Fraudulent
No	Small	Truthful	0.070	0.930	Truthful
No	Large	Truthful	0.310	0.690	Truthful
No	Large	Truthful	0.310	0.690	Truthful
No	Small	Truthful	0.070	0.930	Truthful
No	Small	Truthful	0.070	0.930	Truthful
Yes	Small	Fraudulent	0.529	0.471	Fraudulent
Yes	Large	Fraudulent	0.871	0.129	Fraudulent
No	Large	Fraudulent	0.310	0.690	Truthful
Yes	Large	Fraudulent	0.871	0.129	Fraudulent

Display Confusion Matrix

confusionMatrix(predict(train.full.nb,newdata=company.df),company.df$Status )

## Confusion Matrix and Statistics
## 
##             Reference
## Prediction   Fraudulent Truthful
##   Fraudulent          3        1
##   Truthful            1        5
##                                         
##                Accuracy : 0.8           
##                  95% CI : (0.444, 0.975)
##     No Information Rate : 0.6           
##     P-Value [Acc > NIR] : 0.167         
##                                         
##                   Kappa : 0.583         
##                                         
##  Mcnemar's Test P-Value : 1.000         
##                                         
##             Sensitivity : 0.750         
##             Specificity : 0.833         
##          Pos Pred Value : 0.750         
##          Neg Pred Value : 0.833         
##              Prevalence : 0.400         
##          Detection Rate : 0.300         
##    Detection Prevalence : 0.400         
##       Balanced Accuracy : 0.792         
##                                         
##        'Positive' Class : Fraudulent    
##

A tutorial for Naive Bayes Classifier

2025-03-07

Example dataset

First display the data in a nice table format using the following libraries.

Two types of probabilities

A-priori probabilities are probabilities computed without the knowledge of predictors. Example: Given the above dataset, what is the probability that a company is fraudulent? Out of 10 companies, 4 companies are fraudulent. Hence, this probability is 4/10 = 0.4.

Conditional probabilities: What is the probability of getting a predictor attribute given the class information? For example, given that a company is fraudulent, what is the probability that it was NOT in legal trouble last year? From the output, this value is 0.25.

Now we want to use the NB Classifier to classify companies based on their predictors. We will use the whole dataset.

Display the result of classification

Display Confusion Matrix