Lab 5: Naive Bayes: Toy example

Learning objectives:

calculate conditional probabilities by hand
estimate naive Bayes model using naiveBayes()
calculate conditional probabilities using naive Bayes model and predict()

1. Introduction

In this lab we will illustrate Naive Bayes using a very simple and very small data example. We will calculate conditional probabilities by hand and also using an R function. This will give us confidence in that function when we apply it to complicated and very big data later on.

2. Create example data

Let’s create a data frame with example data. There are two variables type which identifies an email message as ham or spam, and viagra which identifies whether or not the word “viagra” is contained in the message.

train <- data.frame(class=c("spam","ham","ham","ham"), 
                    viagra=c("yes","no","no","yes"))
train

##   class viagra
## 1  spam    yes
## 2   ham     no
## 3   ham     no
## 4   ham    yes

3. Calculate conditional probabilities by hand

Suppose we see a message that has the word “viagra” in it. What is the probability that it is a spam? Using Bayes formula \(P(spam|viagra=yes)=\frac{P(viagra|spam)P(spam)}{P(viagra=yes)}\). To get \(P(viagra|spam)\) we look at the frequency of spam messages that have viagra in it. In this case we have only one spam message and that message has “viagra”" in it, so \(P(viagra|spam)=1\). The probability of spam is \(P(spam)=\tfrac{1}{4}\) since one in four messages is spam. The probability of a message with “viagra” in it is \(P(viagra=yes)=1/2\) since two of the four messages have the word “viagra” in it. So, \(P(spam|viagra=yes)=\frac{1 \cdot \tfrac{1}{4}}{1/2}=\frac{1}{2}\). This makes sense since of the messages with “viagra” one is spam and one is ham.

4. Estimating naive Bayes model

We will use the naiveBayes() function which is part of e1071 package. There two main arguments of the function. The first is the formula that lists the variable to predict and a list of predictors. The second is the data. In our case we predict class using the variable viagra. The function returns a ‘model’ object.

library(e1071)
classifier <- naiveBayes(class ~ viagra,train)
classifier

## 
## Naive Bayes Classifier for Discrete Predictors
## 
## Call:
## naiveBayes.default(x = X, y = Y, laplace = laplace)
## 
## A-priori probabilities:
## Y
##  ham spam 
## 0.75 0.25 
## 
## Conditional probabilities:
##       viagra
## Y             no       yes
##   ham  0.6666667 0.3333333
##   spam 0.0000000 1.0000000

The classifier object contains prior probabilities of different values of the predictors (viagra) given different values of the predicted variable (class). For example \(P(viagra=no|ham)=0.67\) (two thirds of ham messages have no ‘viagra’ in them), \(P(viagra=no|spam)=0\) (no spam messages without ‘viagra’). What we are interested in, however, is the probability of spam given certain value of the predictor variables.

5. Using estimated model to calculate conditional probability

We need to create a test data set that contains those certain values of our predictor variables. For example, let’s create a test data set that has the variable viagra which takes on value yes.

test <- data.frame(viagra=c("yes"))
test$viagra <- factor(test$viagra, levels=c("no","yes"))
test

##   viagra
## 1    yes

Our test data has just one variable with one observation. This is because we only have one predictor variable viagra, and we want to predict type given only one value of viagra variable, viagra=yes. Notice that we needed to specify that the variable viagra is a factor that can take two values. Without this the classifier would not interpret the test data correctly. Note that the factor levels have to be specified in the same order as in the training dataset. Normally, factor levels are sorted alphabetically.

So, now we are ready to feed our data to our classifier object. For this we use the function predict which takes as arguments a ‘model’ object and test data. We also specify that we want the fiction to return ‘raw’ predictions which in the case of naive Bayes ‘model’ object means conditional probabilities.

prediction <- predict(classifier, test ,type="raw")
prediction

##      ham spam
## [1,] 0.5  0.5

Fantastic, we got R to compute the probability of spam given that a message includes the word “viagra.” \(P(spam|viagra=yes)=0.5\). Reassuringly, this is the same as what we computed by hand.

6. Doing the same with two predictors (what makes Naive Bayes naive)

Let’s add variable meet to our spam data. This variable indicates whether or not the word “meet” appears in the message.

train <- data.frame(type=c("spam","ham","ham","ham"), 
                    viagra=c("yes","no","no","yes"),
                    meet=c("yes","yes","yes", "no"))
train

##   type viagra meet
## 1 spam    yes  yes
## 2  ham     no  yes
## 3  ham     no  yes
## 4  ham    yes   no

7. Calculate conditional probabilities by hand

Suppose we see a message that has the word “viagra” and the word “meet” in it. What is the probability that it is a spam? Using Bayes formula \(P(spam|viagra=yes, meet=yes)=\frac{P(viagra=yes, meet=yes|spam)P(spam)}{P(viagra=yes,meet=yes)}\). To get \(P(viagra=yes, meet=yes|spam)\) we look at the frequency of spam messages that have both “viagra” and “meet” in it. In this case we have only one spam message and that message has both “viagra” and “meet” in it, so \(P(viagra|spam)=1\). The probability of spam is \(P(spam)=\tfrac{1}{4}\) since one in four messages is spam. The probability of a message with “viagra” and “meet” in it is \(P(viagra=yes, meet=yes)=1/4\). So, \(P(spam|viagra=yes, meet=yes)=\frac{1 \cdot 1/4}{1/4}=1\). This makes sense since there is only one instance of a message that has both “viagra” and “meet”, and that message is spam.

8. Estimating naive Bayes model

We have two predictors of type viagra and meet.

library(e1071)
classifier <- naiveBayes(type ~ viagra + meet,train)
classifier

## 
## Naive Bayes Classifier for Discrete Predictors
## 
## Call:
## naiveBayes.default(x = X, y = Y, laplace = laplace)
## 
## A-priori probabilities:
## Y
##  ham spam 
## 0.75 0.25 
## 
## Conditional probabilities:
##       viagra
## Y             no       yes
##   ham  0.6666667 0.3333333
##   spam 0.0000000 1.0000000
## 
##       meet
## Y             no       yes
##   ham  0.3333333 0.6666667
##   spam 0.0000000 1.0000000

9. Using estimated model to calculate conditional probability

Our test data set has two variables viagra and meet with both variables taking on value yes.

test <- data.frame(viagra=c("yes"), meet=c("yes"))
test$viagra <- factor(test$viagra, levels=c("no","yes"))
test$meet <- factor(test$meet, levels=c("no","yes"))
test

##   viagra meet
## 1    yes  yes

We are ready to feed our test data to our classifier object.

prediction <- predict(classifier, test ,type="raw")
prediction

##      ham spam
## [1,] 0.4  0.6

The naive Bayes algorithm says that given that the message contains words “viagra” and “meet” the probability of it being a spam is 0.6. This is DIFFERENT from the true conditional probability we calculated by hand above. Why?

10. What makes naive Bayes naive?

The naive Bayes algorithm makes the assumption that the predictors are independent. For example, in our case it assumes that the probability that a message contains “viagra” given that it is spam is independent of whether or not the message contains “meet”, i.e. \(P(viagra=yes, meet=yes|spam) = P(viagra=yes|spam) \cdot P(meet=yes|spam)\). This assumption greatly simplifies the numerator in the conditional probability formula. Ignoring the denominator for a moment we have: \(P(spam|viagra=yes, meet=yes) \propto P(viagra=yes|spam) \cdot P(meet=yes|spam) \cdot P(spam)= 1 \cdot 1 \cdot 1/4=1/4\). Similarly, \(P(ham|viagra=yes, meet=yes) \propto P(viagra=yes|ham) \cdot P(meet=yes|ham) \cdot P(ham)=1/3 \cdot 2/3 \cdot 3/4 =1/6\). These numbers are proportional to probabilities, to get the actual probabilities we scale each number by the sum of the two numbers, getting \(P(spam|viagra=yes, meet=yes)=\frac{1/4}{1/4+1/6}=3/5=0.6\) and \(P(ham|viagra=yes, meet=yes)=\frac{1/6}{1/4+1/6}=3/5=0.4\) which is what the computer calculated.

Intuitively, in our case the naive Bayes understates the fact that ‘meet’ and ‘viagra’ go together in spam messages. The true probability is 1 but naive Bayes only gives us 0.6.

Excercises

Suppose you have a database on four customers. You know their income and whether or not they bought your product. Create a data frame with this data.

##   buy income
## 1 yes   high
## 2  no   high
## 3  no medium
## 4 yes    low

Using Bayes rule calculate the probability that a customer will buy your product given that he or she has high income.
Estimate naive Bayes model using your data above. What is the prior probability of someone buying your product? What is the probability that a customer has a high income give that he or she bought your product?
Using the model you estimated above predict the probability of buying given that a customer has high income. Is your result the same as the one you calculated by hand in question 1?
Suppose you have a database on four customers. You know their gender, income and whether or not they bought your product. Create a data frame with this data.

##   buy income gender
## 1 yes   high   male
## 2  no   high female
## 3  no medium female
## 4 yes    low   male

Using Bayes rule calculate the probability that a customer will buy your product given that he has high income and male.
Estimate naive Bayes model using your data above. What is the prior probability of someone buying your product? What is the probability that a customer has a high income given that he or she bought your product? What is the probability that a customer is male given that he bought your product?
Using the model you estimated above, predict the probability of buying given that a customer has a high income and is male. Is your result the same as the one you calculated by hand in question 1?