- calculate conditional probabilities by hand
- estimate naive Bayes model using
`naiveBayes()`

- calculate conditional probabilities using naive Bayes model and
`predict()`

In this lab we will illustrate Naive Bayes using a very simple and very small data example. We will calculate conditional probabilities by hand and also using an `R`

function. This will give us confidence in that function when we apply it to complicated and very big data later on.

Let’s create a data frame with example data. There are two variables `type`

which identifies an email message as ham or spam, and `viagra`

which identifies whether or not the word “viagra” is contained in the message.

```
train <- data.frame(class=c("spam","ham","ham","ham"),
viagra=c("yes","no","no","yes"))
train
```

```
## class viagra
## 1 spam yes
## 2 ham no
## 3 ham no
## 4 ham yes
```

Suppose we see a message that has the word “viagra” in it. What is the probability that it is a spam? Using Bayes formula \(P(spam|viagra=yes)=\frac{P(viagra|spam)P(spam)}{P(viagra=yes)}\). To get \(P(viagra|spam)\) we look at the frequency of spam messages that have viagra in it. In this case we have only one spam message and that message has “viagra”" in it, so \(P(viagra|spam)=1\). The probability of spam is \(P(spam)=\tfrac{1}{4}\) since one in four messages is spam. The probability of a message with “viagra” in it is \(P(viagra=yes)=1/2\) since two of the four messages have the word “viagra” in it. So, \(P(spam|viagra=yes)=\frac{1 \cdot \tfrac{1}{4}}{1/2}=\frac{1}{2}\). This makes sense since of the messages with “viagra” one is spam and one is ham.

We will use the `naiveBayes()`

function which is part of `e1071`

package. There two main arguments of the function. The first is the formula that lists the variable to predict and a list of predictors. The second is the data. In our case we predict `class`

using the variable `viagra`

. The function returns a ‘model’ object.

```
library(e1071)
classifier <- naiveBayes(class ~ viagra,train)
classifier
```

```
##
## Naive Bayes Classifier for Discrete Predictors
##
## Call:
## naiveBayes.default(x = X, y = Y, laplace = laplace)
##
## A-priori probabilities:
## Y
## ham spam
## 0.75 0.25
##
## Conditional probabilities:
## viagra
## Y no yes
## ham 0.6666667 0.3333333
## spam 0.0000000 1.0000000
```

The classifier object contains prior probabilities of different values of the predictors (`viagra`

) given different values of the predicted variable (`class`

). For example \(P(viagra=no|ham)=0.67\) (two thirds of ham messages have no ‘viagra’ in them), \(P(viagra=no|spam)=0\) (no spam messages without ‘viagra’). What we are interested in, however, is the probability of spam given certain value of the predictor variables.

We need to create a test data set that contains those certain values of our predictor variables. For example, let’s create a test data set that has the variable `viagra`

which takes on value `yes`

.

```
test <- data.frame(viagra=c("yes"))
test$viagra <- factor(test$viagra, levels=c("no","yes"))
test
```

```
## viagra
## 1 yes
```

Our test data has just one variable with one observation. This is because we only have one predictor variable `viagra`

, and we want to predict type given only one value of `viagra`

variable, `viagra=yes`

. Notice that we needed to specify that the variable `viagra`

is a factor that can take two values. Without this the classifier would not interpret the test data correctly. Note that the factor levels have to be specified in the same order as in the training dataset. Normally, factor levels are sorted alphabetically.

So, now we are ready to feed our data to our classifier object. For this we use the function `predict`

which takes as arguments a ‘model’ object and test data. We also specify that we want the fiction to return ‘raw’ predictions which in the case of naive Bayes ‘model’ object means conditional probabilities.

```
prediction <- predict(classifier, test ,type="raw")
prediction
```

```
## ham spam
## [1,] 0.5 0.5
```

Fantastic, we got `R`

to compute the probability of spam given that a message includes the word “viagra.” \(P(spam|viagra=yes)=0.5\). Reassuringly, this is the same as what we computed by hand.

Let’s add variable `meet`

to our spam data. This variable indicates whether or not the word “meet” appears in the message.

```
train <- data.frame(type=c("spam","ham","ham","ham"),
viagra=c("yes","no","no","yes"),
meet=c("yes","yes","yes", "no"))
train
```

```
## type viagra meet
## 1 spam yes yes
## 2 ham no yes
## 3 ham no yes
## 4 ham yes no
```

Suppose we see a message that has the word “viagra” and the word “meet” in it. What is the probability that it is a spam? Using Bayes formula \(P(spam|viagra=yes, meet=yes)=\frac{P(viagra=yes, meet=yes|spam)P(spam)}{P(viagra=yes,meet=yes)}\). To get \(P(viagra=yes, meet=yes|spam)\) we look at the frequency of spam messages that have both “viagra” and “meet” in it. In this case we have only one spam message and that message has both “viagra” and “meet” in it, so \(P(viagra|spam)=1\). The probability of spam is \(P(spam)=\tfrac{1}{4}\) since one in four messages is spam. The probability of a message with “viagra” and “meet” in it is \(P(viagra=yes, meet=yes)=1/4\). So, \(P(spam|viagra=yes, meet=yes)=\frac{1 \cdot 1/4}{1/4}=1\). This makes sense since there is only one instance of a message that has both “viagra” and “meet”, and that message is spam.

We have two predictors of type `viagra`

and `meet`

.

```
library(e1071)
classifier <- naiveBayes(type ~ viagra + meet,train)
classifier
```

```
##
## Naive Bayes Classifier for Discrete Predictors
##
## Call:
## naiveBayes.default(x = X, y = Y, laplace = laplace)
##
## A-priori probabilities:
## Y
## ham spam
## 0.75 0.25
##
## Conditional probabilities:
## viagra
## Y no yes
## ham 0.6666667 0.3333333
## spam 0.0000000 1.0000000
##
## meet
## Y no yes
## ham 0.3333333 0.6666667
## spam 0.0000000 1.0000000
```

Our test data set has two variables `viagra`

and `meet`

with both variables taking on value `yes`

.

```
test <- data.frame(viagra=c("yes"), meet=c("yes"))
test$viagra <- factor(test$viagra, levels=c("no","yes"))
test$meet <- factor(test$meet, levels=c("no","yes"))
test
```

```
## viagra meet
## 1 yes yes
```

We are ready to feed our test data to our classifier object.

```
prediction <- predict(classifier, test ,type="raw")
prediction
```

```
## ham spam
## [1,] 0.4 0.6
```

The naive Bayes algorithm says that given that the message contains words “viagra” and “meet” the probability of it being a spam is 0.6. This is DIFFERENT from the true conditional probability we calculated by hand above. Why?

The naive Bayes algorithm makes the assumption that the predictors are independent. For example, in our case it assumes that the probability that a message contains “viagra” given that it is spam is independent of whether or not the message contains “meet”, i.e. \(P(viagra=yes, meet=yes|spam) = P(viagra=yes|spam) \cdot P(meet=yes|spam)\). This assumption greatly simplifies the numerator in the conditional probability formula. Ignoring the denominator for a moment we have: \(P(spam|viagra=yes, meet=yes) \propto P(viagra=yes|spam) \cdot P(meet=yes|spam) \cdot P(spam)= 1 \cdot 1 \cdot 1/4=1/4\). Similarly, \(P(ham|viagra=yes, meet=yes) \propto P(viagra=yes|ham) \cdot P(meet=yes|ham) \cdot P(ham)=1/3 \cdot 2/3 \cdot 3/4 =1/6\). These numbers are proportional to probabilities, to get the actual probabilities we scale each number by the sum of the two numbers, getting \(P(spam|viagra=yes, meet=yes)=\frac{1/4}{1/4+1/6}=3/5=0.6\) and \(P(ham|viagra=yes, meet=yes)=\frac{1/6}{1/4+1/6}=3/5=0.4\) which is what the computer calculated.

Intuitively, in our case the naive Bayes understates the fact that ‘meet’ and ‘viagra’ go together in spam messages. The true probability is 1 but naive Bayes only gives us 0.6.

- Suppose you have a database on four customers. You know their income and whether or not they bought your product. Create a data frame with this data.

```
## buy income
## 1 yes high
## 2 no high
## 3 no medium
## 4 yes low
```

Using Bayes rule calculate the probability that a customer will buy your product given that he or she has high income.

Estimate naive Bayes model using your data above. What is the prior probability of someone buying your product? What is the probability that a customer has a high income give that he or she bought your product?

Using the model you estimated above predict the probability of buying given that a customer has high income. Is your result the same as the one you calculated by hand in question 1?

Suppose you have a database on four customers. You know their gender, income and whether or not they bought your product. Create a data frame with this data.

```
## buy income gender
## 1 yes high male
## 2 no high female
## 3 no medium female
## 4 yes low male
```

Using Bayes rule calculate the probability that a customer will buy your product given that he has high income and male.

Estimate naive Bayes model using your data above. What is the prior probability of someone buying your product? What is the probability that a customer has a high income given that he or she bought your product? What is the probability that a customer is male given that he bought your product?

Using the model you estimated above, predict the probability of buying given that a customer has a high income and is male. Is your result the same as the one you calculated by hand in question 1?