Author: Daniel R. Brown, Jr.

Discussion

Defining Naïve Bayes

The Bayes Theorem

To understand the Naïve Bayes algorithm, one must first understand the Bayes Theorem, which states \[ P\left(A\vert B\right) = \frac{P\left(B\vert A\right) P\left( A \right )}{P\left(B\right)} \] \(P\left(A\vert B\right)\) is known as the conditional probability or the posterior probability (as in, looking back one can see it more clearly), and is read as the probability of A when B occurs. If one considers basic probability to be the ratio of the amount of times something happens versus the amount of times something could happen, then conditional probability is the ratio of event A happening when B happens versus the amount of times B happens regardless of A.

Along with the conditional probability, there is the prior probability, or \(P\left(B\right)\). This is an educated guess as to how likely something is to have a certain attribute. Lantz, for example, mentions how twenty percent of emails in the text scenario received were spam emails. Next we have the likelihood. Similar to our prior probability, likelihood is a measure of how many times something occurred in previous results. Finally, our marginal likelihood is the probability that event B occurred at all.

Lantz puts it into simple terms best, saying “…if we know event B occurred, the probability of event A is higher the more often that A and B occur together each time B is observed”. (Lantz 2015)

What is Naïve Bayes?

Naïve Bayes is an interesting application of the the Bayes theorem to the problem of classification. Many machine learning methods use Bayesian Probability, but Naïve Bayes is the most common.

Why Naïve?

The reason the algorithm is called naïve is because it is assumed that 1.) each feature is linearly independent of the other features, and 2.) each feature in the dataset are as important as any other feature (often neither is the case). Fortunately enough Naïve Bayes is still a strong enough algorithm that has strong results even when this assumption is violated.

There are some distinct advantages for using Naïve Bayes to classify data:

  1. Algorithm is simple, fast, and effective
  2. Algorithm can handle missing or disorganized datasets relatively well
  3. Algorithm requires less training examples than other methods
  4. Algorithm can easily obtain the estimated probability for a prediction.

The method also has some disadvantages:

  1. Algorithm relies on the above two assumptions:
    1. Equally important features
    2. Independent features
  2. Algorithm is not as well equipped to handle datasets with many numerical features.
  3. Estimated probabilities are less reliable than predicted classes.

Using the Naïve Bayes Algorithm

Lantz’s book utilizes a collection of sms text messages that are previously classified as either spam (unwanted messages) or ham (desired messages). The dataset provided contains 5559 messages that belong to either of the two previously mentioned categories(Almeida and Hidalgo 2011). By looking at the data and checking for various patterns, we can form probabilities that a message is spam or ham based on the words within the texts.

Naïve Bayes on sms data

Step one - collecting data

The data mentioned above has already been collected and as such we can move on to step two. ### Step two - exploring and preparing the data Let’s begin by taking a look at the dataset

sms_raw <- read.csv("sms_spam.csv", stringsAsFactors = FALSE)
str(sms_raw)
'data.frame':   5559 obs. of  2 variables:
 $ type: chr  "ham" "ham" "ham" "spam" ...
 $ text: chr  "Hope you are having a good week. Just checking in" "K..give back my thanks." "Am also doing in cbe only. But have to pay." "complimentary 4 STAR Ibiza Holiday or £10,000 cash needs your URGENT collection. 09066364349 NOW from Landline "| __truncated__ ...

This shows us that we have two features - one containing the categorical data of either spam or ham called type and one called text, which has the actual text messages contained within. As the data in type is currently not represented as categorical data, we can set that manually:

sms_raw$type <- factor(sms_raw$type)
str(sms_raw$type) #Success!
 Factor w/ 2 levels "ham","spam": 1 1 1 2 2 1 1 1 2 1 ...

Cleaning and standardizing text data

We can now utilize the R package tm to clean the strings of text contained in the text field. The packages is actually a text mining package developed by Ingo Feinerer(Ingo Feinerer and Meyer 2008).

The tm package contains a function called VCorpus() which creates a corpus, or a “body” of text documents as a list object. The V is short for volatile, meaning it is stored into your computer’s RAM and is not permanent.

sms_corpus <- VCorpus(VectorSource(sms_raw$text))
typeof(sms_corpus) # Just to show that it is a list
[1] "list"
print(sms_corpus)
<<VCorpus>>
Metadata:  corpus specific: 0, document level (indexed): 0
Content:  documents: 5559

If we would like to see a specific document or sms from our corpus, we can inspect() individual sms from our new list sms_corpus. The book selected the top two results from the list but I think we can go a little classier than that and select two random results from the list using sample() and the pipe operator %>% from the package dyplr (conveniently loaded in our tidyverse package1)

length(sms_corpus) %>%
sample(replace = FALSE) %>%
  sort.list(decreasing = FALSE) %>%
  head(2) %>%
  sms_corpus[.] %>%
  inspect()
<<VCorpus>>
Metadata:  corpus specific: 0, document level (indexed): 0
Content:  documents: 2

[[1]]
<<PlainTextDocument>>
Metadata:  7
Content:  chars: 70

[[2]]
<<PlainTextDocument>>
Metadata:  7
Content:  chars: 24

Next, we will need to standardize the documents within our corpus to ensure that the computer can properly characterize them. This includes modifing character strings like Hello!, Hello, and hello and remove stop words like to, and, but, and or. Stop words are basically articles of speech that humans use but are technically unnecesary for a computer. After that we will stem our documents. Stemming refers to the process where a word is stripped to its root word. For example, a verb that has the suffixes ed, ing and s will have those suffixes removed, i.e. jump, jumped, jumping and jumps will all be standardized to jump. It may be important to note that words that are spelled slightly differently (like run and ran) may not get corrected by this process. Stemming can be performed using the SnowballC package within, demonstrated below:

wordStem(c("jump", "jumping", "jumped", "jumps"))
[1] "jump" "jump" "jump" "jump"

After performing our stemming, we will need to remove the whitespace or blank spaces that are left from the previous preprocessing steps using the function stripWhitespace().

sms_corpus_clean <- sms_corpus %>%
  tm_map(content_transformer(tolower)) %>%
  tm_map(removeNumbers) %>%
  tm_map(removeWords, stopwords()) %>%
  tm_map(removePunctuation) %>%
  tm_map(stemDocument) %>%
  tm_map(stripWhitespace)

The text has us do all of these steps individually, but by using the tidyverse package and the pipe operator %>%, we are able to do this in one step without continuously modifying our sms_corpus_clean list every time. Now we can look at the change before processing and after processing.

cat("The text document prior processing:", "\n")
The text document prior processing: 
for(i in 1:3){
  print(as.character(sms_corpus[[i]]))
}
[1] "Hope you are having a good week. Just checking in"
[1] "K..give back my thanks."
[1] "Am also doing in cbe only. But have to pay."
cat("\n")
cat("The text document after processing:", "\n")
The text document after processing: 
for(i in 1:3){
  print(as.character(sms_corpus_clean[[i]]))
}
[1] "hope good week just check"
[1] "kgive back thank"
[1] "also cbe pay"

As a note, I had some difficulty using the code as.character(sms_corpus[1:3]) to print just the text from the sms_corpus. R would output this as with a bunch of metadata that we did may not want to display.

Splitting text documents into words

The final step in our standardization journey is to split the messages into individual words while still maintaining the identity of the messages. This can be done through a process called tokenization, where each word in the text string is considered a token.

To do this, we can use another function of the tm package: the DocumentTermMatrix() or DTM function. A DRM is a data structure provided by tm that splits a single text message into a row where each word is a column, creating an \(N\times M\) matrix. The package also contains functionality for the transpose of our \(N\times M\) matrix, which may be useful for performing analysis on fewer documents that are longer. This is called (surprise!) a Term Document Matrix or TDM. The DTM or TDM can both be considered a sparse matrix because there will be a row for every individual word contained in the corpus. The “sparseness” of the matrix is because each row will have a majority of the fields as \(0\) because there is such a small chance that each document will contain many of the total words.

sms_dtm <- DocumentTermMatrix(sms_corpus_clean)

For future reference, you can also apply all of the data preproccessing we performed in the previous steps to the DocumentTermMatrix() function, as shown below:

sms_dtm_no_prep <- DocumentTermMatrix(
  sms_corpus,
  control = list(
    tolower = TRUE,
    removeNumbers = TRUE,
    stopwords = TRUE,
    removePunctuation = TRUE,
    stemming = TRUE
  )
)

Let’s compare the two matrices:

cat("Our matrix with preprocessing:", "\n")
Our matrix with preprocessing: 
sms_dtm
<<DocumentTermMatrix (documents: 5559, terms: 6559)>>
Non-/sparse entries: 42147/36419334
Sparsity           : 100%
Maximal term length: 40
Weighting          : term frequency (tf)
cat("\n")
cat("Our matrix without preprocessing:", "\n")
Our matrix without preprocessing: 
sms_dtm_no_prep
<<DocumentTermMatrix (documents: 5559, terms: 6961)>>
Non-/sparse entries: 43221/38652978
Sparsity           : 100%
Maximal term length: 40
Weighting          : term frequency (tf)

Creating training and test datasets

Now we are in the familiar territory of splitting our data into test and training sets. Because the data was stored randomly, we can simply take the first 75% of our entries as our training set and the remainder can be taken as our test set.

sms_dtm_train <- sms_dtm[1:4169, ]
sms_dtm_test <- sms_dtm[4170:5559, ]
sms_train_labels <- sms_raw[1:4169, ]$type
sms_test_labels <- sms_raw[4170:5559, ]$type

A proportion table will help show if our test and training sets are both representative of the whole dataset.

cat("Our training data")
Our training data
sms_train_labels %>%
  table %>%
  prop.table
.
      ham      spam 
0.8647158 0.1352842 
cat("\n")
cat("Our testing data")
Our testing data
sms_test_labels %>%
  table %>%
  prop.table
.
      ham      spam 
0.8683453 0.1316547 

Creating a Word Cloud visualization

A word cloud will allow us to visually look at our corpus to quickly see the frequency of words within the corpus. Words that appear more frequently will be shown in a larger font size in the cloud, and less frequently a smaller font size. This can be done via the wordcloud R package.

wordcloud(sms_corpus_clean, min.freq = 50, random.order = FALSE)

We can also create visualizations of frequency for our raw sms data using the subset() function using the categorical feature $type:

par(mfcol = c(1, 2))
spam <- sms_raw %>%
  subset(type == "spam")
spamCloud <- wordcloud(spam$text, max.words = 40, scale = c(3, 0.5))
ham <- sms_raw %>%
  subset(type == "ham")
hamCloud <- wordcloud(ham$text, max.words = 40, scale = c(3, 0.5))

Creating indicator features for frequent words

To complete our data preprocessing, we must reduce the number of features in our test and training DT Matrices. To do this, we will use the findFreqTerms() function (again found in the tm package).

sms_dtm_freq_train <- sms_dtm_train %>%
  findFreqTerms(5) %>%
  sms_dtm_train[ , .]
sms_dtm_freq_test <- sms_dtm_test %>%
  findFreqTerms(5) %>%
  sms_dtm_test[ , .]

Now we shall write a function that converts our sparse DT matrices from numeric to categorical “yes/no” matrices that our algorithm can process.

convert_counts <- function(x) {
  x <- ifelse(x > 0, "Yes", "No")
}

Applying our convert_counts function:

sms_train <- sms_dtm_freq_train %>%
  apply(MARGIN = 2, convert_counts)
sms_test <- sms_dtm_freq_test %>%
  apply(MARGIN = 2, convert_counts)

Step 3 - training a model on the data

Now comes the fairly straightforward step of training our model on the data and then using that classifier to make predictions on the test set. This requires the e1071 package to apply the naiveBays() function:

sms_classifier <- naiveBayes(sms_train, sms_train_labels)
sms_pred <- predict(sms_classifier, sms_test)

Step 4 - evaluating model performance

Now we can use the CrossTable() function from the gmodels package to see how our predictions fared.

CrossTable(sms_pred, sms_test_labels, prop.chisq = FALSE, chisq = FALSE, 
           prop.t = FALSE,
           dnn = c("Predicted", "Actual"))

 
   Cell Contents
|-------------------------|
|                       N |
|           N / Row Total |
|           N / Col Total |
|-------------------------|

 
Total Observations in Table:  1390 

 
             | Actual 
   Predicted |       ham |      spam | Row Total | 
-------------|-----------|-----------|-----------|
         ham |      1201 |        30 |      1231 | 
             |     0.976 |     0.024 |     0.886 | 
             |     0.995 |     0.164 |           | 
-------------|-----------|-----------|-----------|
        spam |         6 |       153 |       159 | 
             |     0.038 |     0.962 |     0.114 | 
             |     0.005 |     0.836 |           | 
-------------|-----------|-----------|-----------|
Column Total |      1207 |       183 |      1390 | 
             |     0.868 |     0.132 |           | 
-------------|-----------|-----------|-----------|

 

Our cross table shows us that we have \(\approx 97.4\%\) accuracuy in our classifier predicting whether or not a message is ham or spam. Can we do better?

Step 5 - Improving model performance

Let’s attempt to improve our model performance by using a different laplace value for our classifier

sms_classifier2 <- naiveBayes(sms_train, sms_train_labels, laplace = 1)
sms_pred2 <- predict(sms_classifier2, sms_test)
CrossTable(sms_pred2, sms_test_labels, prop.chisq = FALSE, chisq = FALSE, 
           prop.t = FALSE,
           dnn = c("Predicted", "Actual"))

 
   Cell Contents
|-------------------------|
|                       N |
|           N / Row Total |
|           N / Col Total |
|-------------------------|

 
Total Observations in Table:  1390 

 
             | Actual 
   Predicted |       ham |      spam | Row Total | 
-------------|-----------|-----------|-----------|
         ham |      1202 |        29 |      1231 | 
             |     0.976 |     0.024 |     0.886 | 
             |     0.996 |     0.158 |           | 
-------------|-----------|-----------|-----------|
        spam |         5 |       154 |       159 | 
             |     0.031 |     0.969 |     0.114 | 
             |     0.004 |     0.842 |           | 
-------------|-----------|-----------|-----------|
Column Total |      1207 |       183 |      1390 | 
             |     0.868 |     0.132 |           | 
-------------|-----------|-----------|-----------|

 

We obtain an even better result of \(97.5\%\) accuracy! But what did adding a Laplace Smoothing Variable do for us?

Laplace Smoothing

That is smoothing? In a research paper from 1996, professors Chen and Goodman of Harvard University stated that “Smoothing is a technique essential in the construction of n-gram language models… [or] probability distributions over strings \(P\left(s\right)\) that attempts to reflect the frequrency with which each string \(s\) occurs as a sentence in natural text.”(Chen and Goodman 1996) Smoothing simply avoids giving words that appear less frequently less weight than words that appear more frequently. For example, if the word “Supercalifragilisticexpialidocious” only occurs in our text messages once in a spam message, should it be considered noise that it occurs at all? Smoothing avoids this entirely. In the book *Introduction to Information retrieval, the authors have this to say about smoothing: “[T]he probability of words occurring once in the document is normally over-estimated, since their one occurrence was partly by chance… the smoothing of terms actually implements major parts of the term weighting component. It is not just that an unsmoothed model has conjunctive semantics [or words occuring by chance]; an unsmoothed model works badly because it lacks parts of the term weighting compontents”(Christopher D. Manning and Schütze 2008).

Laplace smoothing is simply one of the more common methods of smoothing to be used. It is also known as additive smoothing or Lidstone smoothing and can be expressed as \[ \hat{P}\left(t\vert d\right) = \frac{tf_{t,d} + \alpha\hat{P}\left(t\vert M_{c}\right)}{L_{d}+\alpha} \] where \(tf_{t,d}\) is the raw term frequency of term \(t\) in document \(d\), \(L_{d}\) is the number of tokens in the document, \(M_{c}\) is a language model built from all documents, \(\alpha\) refers to a pseudocount which is a number corresponding to the use of a uniform distribution of the prior probability and represents our belief in the strength of uniformity of our distribution.

Long story short; our use of Laplace Smoothing makes a great model even better because it helps to weight the words in our corpus.

Creating analysis on iris dataset

Again we will be taking a look at the iris dataset(Anderson 1936)

Collecting data

Data is already available in base R

iris.df <- iris

Exploring and preparing the data

random <- sample(nrow(iris.df), 112, replace = FALSE)
iris_test <- iris.df[-random, ]
iris_train <- iris.df[random, ]

Training a model on the data

iris_classifier <- naiveBayes(iris_train[ , -5], iris_train$Species)
iris_pred <- predict(iris_classifier, iris_test)

Evaluating model performance

CrossTable(iris_pred, iris_test$Species, prop.chisq = FALSE, chisq = FALSE, 
           prop.t = FALSE,
           dnn = c("Predicted", "Actual"))

 
   Cell Contents
|-------------------------|
|                       N |
|           N / Row Total |
|           N / Col Total |
|-------------------------|

 
Total Observations in Table:  38 

 
             | Actual 
   Predicted |     setosa | versicolor |  virginica |  Row Total | 
-------------|------------|------------|------------|------------|
      setosa |         11 |          0 |          0 |         11 | 
             |      1.000 |      0.000 |      0.000 |      0.289 | 
             |      1.000 |      0.000 |      0.000 |            | 
-------------|------------|------------|------------|------------|
  versicolor |          0 |         16 |          2 |         18 | 
             |      0.000 |      0.889 |      0.111 |      0.474 | 
             |      0.000 |      1.000 |      0.182 |            | 
-------------|------------|------------|------------|------------|
   virginica |          0 |          0 |          9 |          9 | 
             |      0.000 |      0.000 |      1.000 |      0.237 | 
             |      0.000 |      0.000 |      0.818 |            | 
-------------|------------|------------|------------|------------|
Column Total |         11 |         16 |         11 |         38 | 
             |      0.289 |      0.421 |      0.289 |            | 
-------------|------------|------------|------------|------------|

 

This is about \(92\%\) effective. Let’s smack on that Laplace Smoothing and see what it does for us. ### Improving model performance

iris_classifier2 <- naiveBayes(iris_train[ , -5], iris_train$Species, laplace = 1)
iris_pred2 <- predict(iris_classifier2, iris_test)

CrossTable(iris_pred2, iris_test$Species, prop.chisq = FALSE, chisq = FALSE, 
           prop.t = FALSE,
           dnn = c("Predicted", "Actual"))

Well, that didn’t change much.

Conclusion

We did show the high accuracy that the Naive Bayes algorithm has for classifying text base data, and that it is less accurate on numerical data. The reason that the iris data set was select was because I really like working with it and also because I wanted to try to make a model with numerical values the second time around.

References

Almeida, Tiago A., and José María Gómez Hidalgo. 2011. “SMS Spam Collection.” Federal University of Sao Carlos. http://www.dt.fee.unicamp.br/~tiago/smsspamcollection/.

Anderson, Edgar. 1936. “The Species Problem in Iris.” Annals of the Missouri Botanical Garden. doi:10.2307/2394164.

Chen, Stanley F., and Joshua Goodman. 1996. “An Empirical Study of Smoothing Techniques for Language Modeling.” In Proceedings of the 34th Annual Meeting of the Association for Computational Linguistics, 310–18. Santa Cruz, California, USA: Association for Computational Linguistics. doi:10.3115/981863.981904.

Christopher D. Manning, Prabhaker Raghavan, and Hirich Schütze. 2008. Introduction to Information Retrieval. New York, New York: Cambridge University Press.

Ingo Feinerer, Kurt Hornik, and David Meyer. 2008. “Text Mining Infrastructure in R.” Journal of Statistical Software. doi:10.18637/jss.v025.i05.

Lantz, Brett. 2015. Machine Learning with R. Birmingham, United Kingdom: Packt Publishing.

Wickham, Hadley. 2017. “Tidyverse.” https://www.tidyverse.org/.


  1. Just as a shameless plug for tidyverse(Wickham 2017), because I absolutely love the package and think that it’s an amazingly useful and dare I say fun package to work with.

---
title: "Using Naïve Bayes algorithm for predictive classification"
output:
  html_notebook:
    toc: yes
    toc_float: yes
  html_document:
    df_print: paged
    toc: yes
bibliography: ref.bib
---
```{r Packages, message=FALSE, warning=FALSE, include=FALSE, paged.print=FALSE}
library(tidyverse)
library(tm)
library(SnowballC)
library(wordcloud)
library(e1071)
library(gmodels)
```

Author: Daniel R. Brown, Jr.

# Discussion
## Defining Naïve Bayes
### The Bayes Theorem
To understand the Naïve Bayes algorithm, one must first understand the Bayes
Theorem, which states
$$
P\left(A\vert B\right) = \frac{P\left(B\vert A\right) P\left( A \right )}{P\left(B\right)}
$$
$P\left(A\vert B\right)$ is known as the *conditional probability* or the *posterior probability* (as in, looking back one can see it more clearly), and is read as the probability of *A* when *B* occurs. If one considers basic probability to be the ratio of the amount of times something happens versus the amount of times something *could* happen, then conditional probability is the ratio of event *A* happening when *B* happens versus the amount of times *B* happens regardless of *A*.

Along with the conditional probability, there is the *prior probability*, or $P\left(B\right)$. This is an educated guess as to how likely something is to have a certain attribute. Lantz, for example, mentions how twenty percent of emails in the text scenario received were spam emails. Next we have the *likelihood*. Similar to our prior probability, likelihood is a measure of how many times something occurred in previous results. Finally, our *marginal likelihood* is the probability that event *B* occurred at all.

Lantz puts it into simple terms best, saying "...if we know event *B* occurred, the probability of event *A* is higher the more often that *A* and *B* occur together each time *B* is observed". [@lantz15]

## What is Naïve Bayes?
**Naïve Bayes** is an interesting application of the the Bayes theorem to the problem of classification. Many machine learning methods use Bayesian Probability, but Naïve Bayes is the most common.

### Why Naïve?
The reason the algorithm is called *naïve* is because it is assumed that *1.)* each feature is linearly independent of the other features, and *2.)* each feature in the dataset are as important as any other feature (often neither is the case). Fortunately enough Naïve Bayes is still a strong enough algorithm that has strong results even when this assumption is violated.

There are some distinct advantages for using Naïve Bayes to classify data:

1. Algorithm is simple, fast, and effective
2. Algorithm can handle missing or disorganized datasets relatively well
3. Algorithm requires less training examples than other methods
4. Algorithm can easily obtain the estimated probability for a prediction.

The method also has some disadvantages:

1. Algorithm relies on the above two assumptions:
    1. Equally important features
    2. Independent features
2. Algorithm is not as well equipped to handle datasets with many numerical
features.
3. Estimated probabilities are less reliable than predicted classes.

## Using the Naïve Bayes Algorithm
Lantz's book utilizes a collection of sms text messages that are previously classified as either **spam** (unwanted messages) or **ham** (desired messages). The dataset provided contains 5559 messages that belong to either of the two previously mentioned categories[@almei11]. By looking at the data and checking for various patterns, we can form probabilities that a message is spam or ham based on the words within the texts.

## Naïve Bayes on sms data
### Step one - collecting data
The data mentioned above has already been collected and as such we can move on to step two.
### Step two - exploring and preparing the data
Let's begin by taking a look at the dataset
```{r Importing sms dataset}
sms_raw <- read.csv("sms_spam.csv", stringsAsFactors = FALSE)

str(sms_raw)
```
This shows us that we have two features - one containing the categorical data of either spam or ham called **type** and one called **text**, which has the actual text messages contained within. As the data in **type** is currently not represented as categorical data, we can set that manually:
```{r encoding char into factor}
sms_raw$type <- factor(sms_raw$type)
str(sms_raw$type) #Success!
```
#### Cleaning and standardizing text data
We can now utilize the R package ```tm``` to clean the strings of text contained in the text field. The packages is actually a text mining package developed by Ingo Feinerer[@feine08].

The ```tm``` package contains a function called ```VCorpus()``` which creates a **corpus**, or a "body" of text documents as a list object. The *V* is short for volatile, meaning it is stored into your computer's RAM and is not permanent.

```{r creating our VCorpus}
sms_corpus <- VCorpus(VectorSource(sms_raw$text))

typeof(sms_corpus) # Just to show that it is a list
print(sms_corpus)
```
If we would like to see a specific document or sms from our corpus, we can ```inspect()``` individual sms from our new list ```sms_corpus```. The book selected the top two results from the list but I think we can go a little classier than that and select two random results from the list using ```sample()``` and the pipe operator ```%>%``` from the package ```dyplr``` (conveniently loaded in our ```tidyverse``` package[^1])

```{r Inspecting our VCorpus}
length(sms_corpus) %>%
sample(replace = FALSE) %>%
  sort.list(decreasing = FALSE) %>%
  head(2) %>%
  sms_corpus[.] %>%
  inspect()
```
Next, we will need to standardize the documents within our corpus to ensure that the computer can properly characterize them. This includes modifing character strings like **Hello!**, **Hello**, and **hello** and remove **stop words** like **to**, **and**, **but**, and **or**. Stop words are basically articles of speech that humans use but are technically unnecesary for a computer. After that we will **stem** our documents. **Stemming** refers to the process where a word is stripped to its root word. For example, a verb that has the suffixes *ed*, *ing* and *s* will have those suffixes removed, i.e. *jump*, *jumped*, *jumping* and *jumps* will all be standardized to *jump*. It may be important to note that words that are spelled slightly differently (like *run* and *ran*) may not get corrected by this process. Stemming can be performed using the ```SnowballC``` package within, demonstrated below:

```{r SnowballC demo}
wordStem(c("jump", "jumping", "jumped", "jumps"))
```
After performing our *stemming*, we will need to remove the *whitespace* or blank spaces that are left from the previous preprocessing steps using the function ```stripWhitespace()```.

```{r Cleaning our corpus}
sms_corpus_clean <- sms_corpus %>%
  tm_map(content_transformer(tolower)) %>%
  tm_map(removeNumbers) %>%
  tm_map(removeWords, stopwords()) %>%
  tm_map(removePunctuation) %>%
  tm_map(stemDocument) %>%
  tm_map(stripWhitespace)
```

The text has us do all of these steps individually, but by using the ```tidyverse``` package and the pipe operator ```%>%```, we are able to do this in one step without continuously modifying our ```sms_corpus_clean``` list every time. Now we can look at the change before processing and after processing.
```{r The fruits of our preprocessing labor}
cat("The text document prior processing:", "\n")
for(i in 1:3){
  print(as.character(sms_corpus[[i]]))
}
cat("\n")
cat("The text document after processing:", "\n")
for(i in 1:3){
  print(as.character(sms_corpus_clean[[i]]))
}
```
 As a note, I had some difficulty using the code ```as.character(sms_corpus[1:3])``` to print just the text from the sms_corpus. R would output this as with a bunch of metadata that we did may not want to display.

#### Splitting text documents into words
The final step in our standardization journey is to split the messages into individual words while still maintaining the identity of the messages. This can be done through a process called **tokenization**, where each word in the text string is considered a *token*.

To do this, we can use another function of the ```tm``` package: the ```DocumentTermMatrix()``` or DTM function. A DRM is a data structure provided by ```tm``` that splits a single text message into a row where each word is a column, creating an $N\times M$ matrix. The package also contains functionality for the transpose of our $N\times M$ matrix, which may be useful for performing analysis on fewer documents that are longer. This is called (surprise!) a Term Document Matrix or TDM. The DTM or TDM can both be considered a **sparse matrix** because there will be a row for every individual word contained in the corpus. The "*sparseness*" of the matrix is because each row will have a majority of the fields as $0$ because there is such a small chance that each document will contain many of the total words.

```{r Creating our DocumentTermMatrix}
sms_dtm <- DocumentTermMatrix(sms_corpus_clean)
```

For future reference, you can also apply all of the data preproccessing we performed in the previous steps to the ```DocumentTermMatrix()``` function, as shown below:
 
```{r Creating another DocumentTermMatrix}
sms_dtm_no_prep <- DocumentTermMatrix(
  sms_corpus,
  control = list(
    tolower = TRUE,
    removeNumbers = TRUE,
    stopwords = TRUE,
    removePunctuation = TRUE,
    stemming = TRUE
  )
)
```

Let's compare the two matrices:

```{r Comparing our DTMs}
cat("Our matrix with preprocessing:", "\n")
sms_dtm
cat("\n")
cat("Our matrix without preprocessing:", "\n")
sms_dtm_no_prep
```
#### Creating training and test datasets
Now we are in the familiar territory of splitting our data into test and training sets. Because the data was stored randomly, we can simply take the first 75% of our entries as our training set and the remainder can be taken as our test set.

```{r Splitting training and test datasets}
sms_dtm_train <- sms_dtm[1:4169, ]
sms_dtm_test <- sms_dtm[4170:5559, ]
sms_train_labels <- sms_raw[1:4169, ]$type
sms_test_labels <- sms_raw[4170:5559, ]$type
```

A proportion table will help show if our test and training sets are both representative of the whole dataset.
```{r}
cat("Our training data")
sms_train_labels %>%
  table %>%
  prop.table
cat("\n")
cat("Our testing data")
sms_test_labels %>%
  table %>%
  prop.table
```
#### Creating a Word Cloud visualization
A **word cloud** will allow us to visually look at our corpus to quickly see the frequency of words within the corpus. Words that appear more frequently will be shown in a larger font size in the cloud, and less frequently a smaller font size. This can be done via the ```wordcloud``` R package.

```{r Creating a Word Cloud, echo=TRUE, message=FALSE, warning=FALSE}
wordcloud(sms_corpus_clean, min.freq = 50, random.order = FALSE)
```
We can also create visualizations of frequency for our raw sms data using the ```subset()``` function using the categorical feature ```$type```:
```{r Wordcloud: Spam or Ham?, warning=FALSE}
par(mfcol = c(1, 2))
spam <- sms_raw %>%
  subset(type == "spam")
spamCloud <- wordcloud(spam$text, max.words = 40, scale = c(3, 0.5))
ham <- sms_raw %>%
  subset(type == "ham")
hamCloud <- wordcloud(ham$text, max.words = 40, scale = c(3, 0.5))
```

#### Creating indicator features for frequent words
To complete our data preprocessing, we must reduce the number of features in our test and training DT Matrices. To do this, we will use the ```findFreqTerms()``` function (again found in the ```tm``` package).

```{r Finding those freq-y terms}
sms_dtm_freq_train <- sms_dtm_train %>%
  findFreqTerms(5) %>%
  sms_dtm_train[ , .]

sms_dtm_freq_test <- sms_dtm_test %>%
  findFreqTerms(5) %>%
  sms_dtm_test[ , .]
```

Now we shall write a function that converts our sparse DT matrices from numeric to categorical "yes/no" matrices that our algorithm can process.

```{r Creating a convert_counts function}
convert_counts <- function(x) {
  x <- ifelse(x > 0, "Yes", "No")
}
```

Applying our ```convert_counts``` function:
```{r Converting sparse numerical matrix into YES/NO matrix}
sms_train <- sms_dtm_freq_train %>%
  apply(MARGIN = 2, convert_counts)
sms_test <- sms_dtm_freq_test %>%
  apply(MARGIN = 2, convert_counts)
```

### Step 3 - training a model on the data
Now comes the fairly straightforward step of training our model on the data and then using that classifier to make predictions on the test set. This requires the ```e1071``` package to apply the ```naiveBays()``` function:

```{r Building the classifier and our predictor}
sms_classifier <- naiveBayes(sms_train, sms_train_labels)
sms_pred <- predict(sms_classifier, sms_test)
```
### Step 4 - evaluating model performance
Now we can use the ```CrossTable()``` function from the ```gmodels``` package to see how our predictions fared.
```{r creating a Crosstable}
CrossTable(sms_pred, sms_test_labels, prop.chisq = FALSE, chisq = FALSE, 
           prop.t = FALSE,
           dnn = c("Predicted", "Actual"))
```
Our cross table shows us that we have $\approx 97.4\%$ accuracuy in our classifier predicting whether or not a message is **ham** or **spam**. Can we do better?

### Step 5 - Improving model performance
Let's attempt to improve our model performance by using a different laplace value for our classifier
```{r model tuning}
sms_classifier2 <- naiveBayes(sms_train, sms_train_labels, laplace = 1)
sms_pred2 <- predict(sms_classifier2, sms_test)
CrossTable(sms_pred2, sms_test_labels, prop.chisq = FALSE, chisq = FALSE, 
           prop.t = FALSE,
           dnn = c("Predicted", "Actual"))
```
We obtain an even better result of $97.5\%$ accuracy! But what did adding a Laplace Smoothing Variable do for us?

#### Laplace Smoothing
That is smoothing? In a research paper from 1996, professors Chen and Goodman of Harvard University stated that "*Smoothing* is a technique essential in the construction of *n*-gram language models... [or] probability distributions over strings $P\left(s\right)$ that attempts to reflect the frequrency with which each string $s$ occurs as a sentence in natural text."[@goodm96]
Smoothing simply avoids giving words that appear less frequently less weight than words that appear more frequently. For example, if the word "Supercalifragilisticexpialidocious" only occurs in our text messages once in a spam message, should it be considered noise that it occurs at all? Smoothing avoids this entirely. 
In the book *Introduction to Information retrieval, the authors have this to say about smoothing: "[T]he probability of words occurring once in the document is normally over-estimated, since their one occurrence was partly by chance... the smoothing of terms actually implements major parts of the term weighting component. It is not just that an unsmoothed model has conjunctive semantics [or words occuring by chance]; an unsmoothed model works badly because it lacks parts of the term weighting compontents"[@manni08].

*Laplace smoothing* is simply one of the more common methods of smoothing to be used. It is also known as *additive smoothing* or *Lidstone smoothing* and can be expressed as
$$
\hat{P}\left(t\vert d\right) = \frac{tf_{t,d} + \alpha\hat{P}\left(t\vert M_{c}\right)}{L_{d}+\alpha}
$$
where $tf_{t,d}$ is the raw term frequency of term $t$ in document $d$, $L_{d}$ is the number of tokens in the document, $M_{c}$ is a language model built from all documents, $\alpha$ refers to a *pseudocount* which is a number corresponding to the use of a uniform distribution of the prior probability and represents our belief in the strength of uniformity of our distribution.

Long story short; our use of Laplace Smoothing makes a great model even better because it helps to weight the words in our corpus.

[^1]: Just as a shameless plug for ```tidyverse```[@wickh17], because I absolutely love the package and think that it's an amazingly useful and dare I say *fun* package to work with.

## Creating analysis on iris dataset
Again we will be taking a look at the iris dataset[@ander36]

### Collecting data
Data is already available in base R
```{r Import iris dataset}
iris.df <- iris
```

### Exploring and preparing the data
```{r Splitting data into test / training sets}
random <- sample(nrow(iris.df), 112, replace = FALSE)
iris_test <- iris.df[-random, ]
iris_train <- iris.df[random, ]
```

### Training a model on the data
```{r Training our model}
iris_classifier <- naiveBayes(iris_train[ , -5], iris_train$Species)
iris_pred <- predict(iris_classifier, iris_test)
```

### Evaluating model performance
```{r Iris Cross Table}
CrossTable(iris_pred, iris_test$Species, prop.chisq = FALSE, chisq = FALSE, 
           prop.t = FALSE,
           dnn = c("Predicted", "Actual"))
```
This is about $92\%$ effective. Let's smack on that Laplace Smoothing and see what it does for us.
### Improving model performance
```{r Improve that model, yo}
iris_classifier2 <- naiveBayes(iris_train[ , -5], iris_train$Species, laplace = 1)
iris_pred2 <- predict(iris_classifier2, iris_test)

CrossTable(iris_pred2, iris_test$Species, prop.chisq = FALSE, chisq = FALSE, 
           prop.t = FALSE,
           dnn = c("Predicted", "Actual"))
```
Well, that didn't change much.

## Conclusion
We did show the high accuracy that the Naive Bayes algorithm has for classifying text base data, and that it is less accurate on numerical data.
The reason that the iris data set was select was because I really like working with it and also because I wanted to try to make a model with numerical values the second time around.

## References
