Math 301 Lab 1

General Instructions

Each lab in this course will have multiple components. First, there will a piece like the document below, which includes instructions, tutorials, and problems to be addressed in your write-up. Any part of the document below marked with a star, \(\star\), is a problem for your write-up. Second, there will be an R script where you will do all of the computations required for the lab. And third, you will complete a write-up to be turned in by the date indicated.

If you are comfortable doing so, I strongly suggest using RMarkdown to type your lab write-up. However, if you are new to R, you may complete your write-up on LaTeX. All of your computational work will be done in RStudio Cloud, and both your lab write-up and your R script will be considered when grading your work.

Lab Overview

In this lab, you will leverage Bayes’s Theorem to build a spam filter for SMS text messages. This lab uses a data set (hosted at the UCI Machine Learning Repository) of 5,574 sms messages in English collected from multiple sources. None of these sources are from the United States, so slang and spellings may be unfamiliar. The general workflow of this lab is inspired by an RPubs project done by Jason Chan. In particular, the data has been preprocessed by:

removing all non-letter characters (e.g., numbers, punctuation), except for spaces
making all letters lower case
removing extra spaces
removing the most frequent words (called stop words, e.g., “the”, “a”, “is”)

and formatted in a way that is appropriate for the analysis you will do.

R Tutorial

We need a little tutorial in manipulating data frames to compute probabilities. If you are comfortable with the dplyr package in R, feel free to skip this section.

Below is a fake data set that will motivate the filtering and counting tools we’ll use in this lab.

data

##    group grade count recheck
## 1      1     A     3     Yes
## 2      1     B     4     Yes
## 3      1     C     1     Yes
## 4      1     E     1      No
## 5      1     G     2      No
## 6      4     B     7     Yes
## 7      4     C    10      No
## 8      4     G    14     Yes
## 9      8     A     3      No
## 10     8     B     9      No
## 11     8     E     4     Yes
## 12     8     G    11     Yes

Suppose we want only the entries in the data frame that correspond to grade B. We can use the filter function from the dplyr package to find these rows. We’ll save this filtered data set so we can use it later.

data_filtered <- filter(data, grade == "B")
data_filtered

##   group grade count recheck
## 1     1     B     4     Yes
## 2     4     B     7     Yes
## 3     8     B     9      No

The count column in this data frame represents how many objects have the given group and grade. For example, there are 7 objects in group 4 with a grade of “B”. Suppose we wanted to find the total number of objects with grade “B”. We can sum the count column of the filtered data frame to find this value.

sum(data_filtered$count)

## [1] 20

We can also filter a data frame more than once. For example, suppose we are now only interested in the objects with a grade of “B” that are also selected for rechecking. We can filter by the value of “Yes” in the recheck column.

data_filtered2 <- filter(data_filtered, recheck == "Yes")
data_filtered2

##   group grade count recheck
## 1     1     B     4     Yes
## 2     4     B     7     Yes

We could also do this in a single step in the filter function, by specifying the order we want the filtering done in.

filter(data, grade == "B", recheck == "Yes")

##   group grade count recheck
## 1     1     B     4     Yes
## 2     4     B     7     Yes

Finally, we can quickly count the number of rows in a dataframe using the nrow function.

nrow(data_filtered2)

## [1] 2

Background and Theory

We want to know, based on the words included in a text message, if our phone’s software should block that message as spam or not (we will, from now on, refer to non-spam as “ham”). In particular, we want to know the probability that a message is spam given its contents. For now, let’s say that the message gets classified as spam if that probability is larger than 0.5.

As a toy example, consider the word “win”. Let \(A_{win}\) be the event that a message contains the word “win”, let \(B_{spam}\) be the event that a message is spam, and let \(B_{ham}\) be the event that a message is ham.

(\(\star\)) Explain how Bayes’s Theorem can be used to rewrite the probability of a message being spam given that it contains the word “win”. Compute this probability in your R script and report its value here.

Of course, most text messages contain more than one word. Say we get a text message that says “win new iphone”. What is the probability that this message is spam? We will let \(A_{new}\) be the event that a message contains the word “new”, and \(A_{iphone}\) be the event that a message contains the word “iphone”. So, we can interpret our desired probability as \[P(B_{spam} \,\vert\, A_{win}\cap A_{new} \cap A_{iphone})\].

We will use a method called naive Bayes classification to deal with this issue. The term “naive” comes from the assumption we will make that all events of the form \(A_{word1}\) and \(A_{word2}\) are independent.

(\(\star\)) With this independence assumption, how can you use Bayes’s Theorem and independence to compute \(P(B_{spam} \,\vert\, A_{win}\cap A_{new} \cap A_{iphone})\)?

Model

It would be (very) tedious to do all of these computations by hand, particulary when we have large numbers of words to deal with. Fortunately, the package e1071 includes a function called naiveBayes that does all of the dirty work for you. Even better, you can use the output of this function to take a new text message and predict if it is spam or ham.

The data set in this lab was split into two chunks, a training set and a test set. The training set was used to construct the model (i.e., give estimates for all of the probabilities we use when computing Bayes’s Theorem with our independence assumption). We can then use the test set to determine how well our model did by matching the predicted classes (spam or ham) with the actual classes.

In your R script, run the code to build the model, predict the classes in the test set, and compare to the actual classes. All this code is written for you, and you do not need to alter anything to run it.

Now, make your own example of a text message using the template in your R script. Compute the predicted class of your example. Do this for at least three examples, recording your results (your message along with the predicted class.) for your write-up.

(\(\star\)) Discuss the overall behavior of the model in terms of how well it classified messages. What are some reasons you can think of for why we don’t have perfect accuarcy in this model? How well do you think the model did with the examples you came up with?
(\(\star\)) Discuss the reasonableness of the independence assumptions we made. Do you think the events \(A_{word1}\) and \(A_{word2}\) are truly independent? Why or why not? What is the impact of this assumption on the model?