Data 607 Project 4 - Document Classification

Introduction

For this project we will attempt to classify SMS messages as spam or ham using a naive Bayes model. The data was obtained from:
http://archive.ics.uci.edu/ml/datasets/sms+spam+collection

The following packages will be used:

tm
wordcloud
e1071

Load Data

First we load the data into dataframes with two columns, “type” and “text”. The type column labels whether that message is spam or ham and text contains the raw text content of the sms message. The data contains 747 spam messages and 4827 ham messages.

file <- "data/SMSSpamCollection"
df <- read.csv(file, sep = "\t")
df <- read.table(file,sep="\t",quote="")
colnames(df)<- c("type","text")

length(which(df$type == "spam"))

## [1] 747

length(which(df$type == "ham"))

## [1] 4827

Create & Clean Corpus

Below we create a corpus out of the dataframe and peform cleaning operations. I created a function for the cleaning operations for later use when we create additional subsets for training and testing. Below is summary details of the TDM.

corpus <- Corpus(VectorSource(df$text))

clean_corp <- function(corp){
    corp <- corp %>% tm_map(removeNumbers)
    corp <- corp %>% tm_map(removePunctuation)
    corp <- corp %>% tm_map(removeWords, stopwords())
    corp <- corp %>% tm_map(stripWhitespace)
    corp <- corp %>% tm_map(tolower)
    corp <- corp %>% tm_map(stemDocument)
    return(corp)
}

corpus <- clean_corp(corpus)

tdm <- TermDocumentMatrix(corpus)
tdm

## <<TermDocumentMatrix (terms: 6887, documents: 5574)>>
## Non-/sparse entries: 44714/38343424
## Sparsity           : 100%
## Maximal term length: 40
## Weighting          : term frequency (tf)

The last cleanup item we perform is removal of sparse terms. Below we remove terms with less than 10 occurences.

#remove sparse
tdm <- tdm %>% removeSparseTerms(1-10/length(corpus))
tdm

## <<TermDocumentMatrix (terms: 814, documents: 5574)>>
## Non-/sparse entries: 32642/4504594
## Sparsity           : 99%
## Maximal term length: 13
## Weighting          : term frequency (tf)

Term Frequency Word Cloud

Below we look at the most frequent words for spam and ham

spam_indices <- which(df$type == "spam")
wordcloud(corpus[spam_indices], min.freq=40)

ham_indices <- which(df$type == "ham")
wordcloud(corpus[ham_indices], min.freq=40)

Supervised Training

Below, we will split the data into a training and testing set, with 70% of messages for testing.

We isolate the correspondings subsets, create a new corpus for each, clean each corpus, then turn each into a DTM. We will then use a naive Bayes classifier to predict if an sms is spam or ham.

set.seed(123)

# 70/30 training/testing split
sample_size <- floor(0.70 * nrow(df))
train_idx <- sample(seq_len(nrow(df)), size = sample_size)

training_df <- df[train_idx, ]
testing_df <- df[-train_idx, ]

# Create cleaned corpus for training and test data
training_corp <- clean_corp(Corpus(VectorSource(training_df$text)))
testing_corp <- clean_corp(Corpus(VectorSource(testing_df$text)))

#DTMs from new corps
training_dtm <- DocumentTermMatrix(training_corp)
testing_dtm <- DocumentTermMatrix(testing_corp)

# count function
counter <- function(x) {
  y <- ifelse(x > 0, 1,0)
  y <- factor(y, levels=c(0,1), labels=c("No", "Yes"))
  y
}

train_sms <- apply(training_dtm, 2, counter)
test_sms <- apply(testing_dtm, 2, counter)

# classification of sms
classifier <- naiveBayes(train_sms, factor(training_df$type))

Classification of the test data

Below we run the classifier on the test data:

predict_test <- predict(classifier, newdata=test_sms)

table(predict_test, testing_df$type)

##             
## predict_test  ham spam
##         ham  1445   11
##         spam   11  206

As shown above, using the Naive Bayes classifier the model was able to predict spam within the test data with 95% accuracy.

Sources: http://archive.ics.uci.edu/ml/datasets/sms+spam+collection https://www.rdocumentation.org/packages/e1071/versions/1.7-1/topics/naiveBayes https://rpubs.com/riazakhan94/naive_bayes_classifier_e1071 https://rpubs.com/mzc/mlwr_nb_sms_spam