It can be useful to be able to classify new “test” documents using already classified “training” documents. A common example is using a corpus of labeled spam and ham (non-spam) e-mails to predict whether or not a new document is spam.

For this project, you can start with a spam/ham dataset, then predict the class of new documents (either withheld from the training dataset or from another source such as your own spam folder). One example corpus: https://spamassassin.apache.org/publiccorpus/

I will use a dataset from UCI Machine Learning Repository to create a Spam Classifier for SMS. This dataset includes the messages with a label indicating whether the message is unwanted, spam, or ham(legitimate messages)

## A quick Google search reveals there were 300 billion emails sent per day in 2017. Number of texts sent per day? 8 trillion. But it's harder to classify text messages because they are short. Moreover, the ubiquity of it has given rise to its own lingo e.g. lol, brb, and etc. I will apply Naive Bayes classifier to develop an algorithm that could filter spam text messages. 

# Loading data
raw_text <- read.csv("https://raw.githubusercontent.com/saayedalam/Data/master/sms_spam.csv", stringsAsFactors = FALSE)

# Looking at the structure of data
str(raw_text)
## 'data.frame':    5559 obs. of  2 variables:
##  $ type: chr  "ham" "ham" "ham" "spam" ...
##  $ text: chr  "Hope you are having a good week. Just checking in" "K..give back my thanks." "Am also doing in cbe only. But have to pay." "complimentary 4 STAR Ibiza Holiday or £10,000 cash needs your URGENT collection. 09066364349 NOW from Landline"| __truncated__ ...
# Converting character vector to categorical vector
raw_text$type <- factor(raw_text$type)
str(raw_text$type) # verifying the conversion
##  Factor w/ 2 levels "ham","spam": 1 1 1 2 2 1 1 1 2 1 ...
# Using table function to get a count of spam and ham texts 
table(raw_text$type)
## 
##  ham spam 
## 4812  747
# Exploring, Cleaning and preparing data for analysis using tm package
library(tm) # loading the package

# Creating a collection of text documents (a corpus)
text_corpus <- VCorpus(VectorSource(raw_text$text))

# We can get a summary of individual text from the corpus (corpus is a list)
inspect(text_corpus[1:5]) # viewing first 5 texts
## <<VCorpus>>
## Metadata:  corpus specific: 0, document level (indexed): 0
## Content:  documents: 5
## 
## [[1]]
## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 49
## 
## [[2]]
## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 23
## 
## [[3]]
## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 43
## 
## [[4]]
## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 150
## 
## [[5]]
## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 161
# Viewing the content of the first text
as.character(text_corpus[[1]])
## [1] "Hope you are having a good week. Just checking in"
# Viewing the content of more than one texts using lapply() function
lapply(text_corpus[1:5], as.character) # equivalent to running a while loop but more efficient
## $`1`
## [1] "Hope you are having a good week. Just checking in"
## 
## $`2`
## [1] "K..give back my thanks."
## 
## $`3`
## [1] "Am also doing in cbe only. But have to pay."
## 
## $`4`
## [1] "complimentary 4 STAR Ibiza Holiday or £10,000 cash needs your URGENT collection. 09066364349 NOW from Landline not to lose out! Box434SK38WP150PPM18+"
## 
## $`5`
## [1] "okmail: Dear Dave this is your final notice to collect your 4* Tenerife Holiday or #5000 CASH award! Call 09061743806 from landline. TCs SAE Box326 CW25WX 150ppm"
# Standardizing the words by removing punctuation and other characters
text_corpus_clean <- tm_map(text_corpus, content_transformer(tolower)) # lowercase all texts
text_corpus_clean <- tm_map(text_corpus_clean, removeNumbers) # remove all numbers
text_corpus_clean <- tm_map(text_corpus_clean, removeWords, stopwords()) # remove all common words such as to, but and etc.
text_corpus_clean <- tm_map(text_corpus_clean, removePunctuation) # remove all punctuation

# Stemming words i.e. taking words like learned, learning and learns; and transforming it into the base form - learn
library(SnowballC) # this package provides a stemming function
text_corpus_clean <- tm_map(text_corpus_clean, stemDocument) # stemming words
text_corpus_clean <- tm_map(text_corpus_clean, stripWhitespace) # remove all whitespace

# Splitting texts into individual words using Document Term Matrix (DTM)
text_dtm <- DocumentTermMatrix(text_corpus_clean) # row indicate texts, column indicate words

# Creating train and test datasets for accurate assessment of the performance of the predictive model on unseen data.
text_train <- text_dtm[1:4169, ] # 75% for training
text_test <- text_dtm[4170:5559, ] # 25% for testing
text_train_type <- raw_text[1:4169, ]$type
text_test_type <- raw_text[4170:5559, ]$type

# Verify both set has same proportion of data
prop.table(table(text_train_type))
## text_train_type
##       ham      spam 
## 0.8647158 0.1352842
prop.table(table(text_test_type))
## text_test_type
##       ham      spam 
## 0.8683453 0.1316547
# Visualizing text data using word clouds
library(wordcloud) # loading the package

# Visualizing text by their type i.e. spam or ham
text_spam <- subset(raw_text, type == "spam") # selecting spam texts
wordcloud(text_spam$text, max.words = 40, scale = c(3, 0.5)) # free, stop - good indicator of spam texts

text_ham <- subset(raw_text, type =="ham") # selecting ham texts
wordcloud(text_ham$text, max.words = 40, scale = c(3, 0.5)) # love, sorry - good indicator of legit texts

# Creating indicator features for frequent words for better analysis
text_freq_words <- findFreqTerms(text_train, 5) # eliminating any word that appear in less than 5 texts
str(text_freq_words) # notice the reduction of observations
##  chr [1:1139] "€â\200œ" "â£wk" "abiola" "abl" "abt" "accept" ...
# Selecting only the frequent words from the train and test datasets
text_freq_words_train <- text_train[ , text_freq_words]
text_freq_words_test <- text_test[ , text_freq_words]

# Converting numerical vectors of the DTM to categorical vector for the model
convert <- function(x) {x <- ifelse(x > 0, "y", "n")} # creating a function for conversion
text_train <- apply(text_freq_words_train, MARGIN = 2, convert)
text_test <- apply(text_freq_words_test, MARGIN = 2, convert)
str(text_train) # verifying the conversion
##  chr [1:4169, 1:1139] "n" "n" "n" "n" "n" "n" "n" "n" "n" "n" "n" "n" ...
##  - attr(*, "dimnames")=List of 2
##   ..$ Docs : chr [1:4169] "1" "2" "3" "4" ...
##   ..$ Terms: chr [1:1139] "€â\200œ" "â£wk" "abiola" "abl" ...
# Training model on the dataset
library(e1071) # this package provides naive bayes classifier  

# Creating a Naive Bayes classifier (Bayes theorem suggests that observing one term in a text is independent from observing another term)
text_classifier <- naiveBayes(text_train, text_train_type)

# Evaluating performance of the model
text_test_prediction <- predict(text_classifier, text_test) # using the classifier to make prediction

# Comparing the prediction to the true values
library(gmodels) # this package provided a CrossTable() function to compare
CrossTable(text_test_prediction, text_test_type, 
           prop.chisq = FALSE, prop.t = FALSE,
           dnn = c('predicted', 'actual'))
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |           N / Col Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  1390 
## 
##  
##              | actual 
##    predicted |       ham |      spam | Row Total | 
## -------------|-----------|-----------|-----------|
##          ham |      1201 |        30 |      1231 | 
##              |     0.976 |     0.024 |     0.886 | 
##              |     0.995 |     0.164 |           | 
## -------------|-----------|-----------|-----------|
##         spam |         6 |       153 |       159 | 
##              |     0.038 |     0.962 |     0.114 | 
##              |     0.005 |     0.836 |           | 
## -------------|-----------|-----------|-----------|
## Column Total |      1207 |       183 |      1390 | 
##              |     0.868 |     0.132 |           | 
## -------------|-----------|-----------|-----------|
## 
## 
# Looking at table, 36 texts are mislabeled. 6 of which are legit texts.
# Improving the model's performance
text_classifier_improved <- naiveBayes(text_train, text_train_type, laplace = 1) # laplace assures that one word is not mislabled just because it appeared once on ham/spam texts
text_test_prediction_improved <- predict(text_classifier_improved, text_test) # building an improved model to make prediction
CrossTable(text_test_prediction_improved, text_test_type, 
           prop.chisq = FALSE, prop.t = FALSE,
           dnn = c('predicted', 'actual'))
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |           N / Col Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  1390 
## 
##  
##              | actual 
##    predicted |       ham |      spam | Row Total | 
## -------------|-----------|-----------|-----------|
##          ham |      1202 |        28 |      1230 | 
##              |     0.977 |     0.023 |     0.885 | 
##              |     0.996 |     0.153 |           | 
## -------------|-----------|-----------|-----------|
##         spam |         5 |       155 |       160 | 
##              |     0.031 |     0.969 |     0.115 | 
##              |     0.004 |     0.847 |           | 
## -------------|-----------|-----------|-----------|
## Column Total |      1207 |       183 |      1390 | 
##              |     0.868 |     0.132 |           | 
## -------------|-----------|-----------|-----------|
## 
## 
# Looking at the improved table, 33 texts are mislabled. Overall, the model was able to classify over 97% of all texts correctly as spam or ham.