1. Introduction

For this assignment, we’re asked to work with R textual analysis tools to download a text corpus and classify documents. I chose an SMS message dataset from the UC Irvine Machine Learning Laboratory. I used the tidytext and tm packages to explore the data and prepare it for modeling, then fit a Support Vector Machine to predict whether messages were spam or ham. The model did well at predicting ham, but not spam. Overall prediction accuracy was 86 percent. Sensitivity – or the probability of correctly predicting ham – was 98 percent. Specificity – the probability of correctly predicting spam – was only 2 percent. Detailed results are summarized in the confusion matrix below. I did not have enought time to tune the model.



2. Get the SMS Message data from UCI

# disable quoting to avoid error

url <- "https://archive.ics.uci.edu/ml/machine-learning-databases/00228/smsspamcollection.zip"

if (!file.exists("smsspamcollection.zip")) {
download.file(url=url, destfile="smsspamcollection.zip", method="curl")}

unzip("smsspamcollection.zip")

corpus <- read.delim("SMSSpamCollection", sep="\t", header=F, colClasses="character", quote="")

# str(corpus)
colnames(corpus) <- c("cat", "text")
corpus$cat <- factor(corpus$cat)

pander(table(corpus$cat), caption="Spam and Ham Count in Raw Data")
Spam and Ham Count in Raw Data
ham spam
4827 747


3.0 Use tidytext to explore the data

In these steps we tokenize the data – i.e., break the constituent messages into individual lists. We remove “stop words” – prepositions and articles that typically don’t add meaning – and then chart the most frequently used words.

# first tokenize
tidy_corpus <- corpus %>%
  unnest_tokens(word, text)

# remove stop words
data("stop_words")

tidy_corpus <- tidy_corpus %>%
                  anti_join(stop_words)

tidy_corpus %>%
  count(word, sort = TRUE) %>%
  filter(n > 100) %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(word, n)) +
    geom_col() +
    xlab(NULL) +
    coord_flip() +
  theme_tufte()



3. Sentiment analysis, ham vs. spam

tidytext has tools for sentiment analysis. Following examples in our text, I was able to find the words that contributed most to messages with positive or negative sentiment.

tidy_spam <- tidy_corpus %>% 
  filter(cat == 'spam') %>% 
  count(word, sort = TRUE)

tidy_spam <- 
  tidy_spam %>%
  mutate(prop = n/sum(n))

tidy_ham <- tidy_corpus %>% 
  filter(cat == 'ham') %>% 
  count(word, sort = TRUE)

tidy_ham <- 
  tidy_ham %>%
  mutate(prop = n/sum(n))

# try sentiment analysis

nrcmad <- get_sentiments("nrc") %>% 
  filter(sentiment == "anger")

tidy_corpus %>%
  inner_join(nrcmad) %>%
  count(word, sort = TRUE)
## # A tibble: 182 × 2
##       word     n
##      <chr> <int>
## 1     cash    88
## 2    money    59
## 3      ill    43
## 4     shit    35
## 5      bad    32
## 6      mob    25
## 7     hurt    23
## 8      hot    22
## 9    words    22
## 10 feeling    21
## # ... with 172 more rows
# spam is much more positive
corpus_sentiment <- tidy_corpus %>%
  inner_join(get_sentiments("bing")) %>%
  count(cat, sentiment) %>%
  spread(sentiment, n, fill = 0) %>%
  mutate(sentiment = positive - negative)

# sentiment of words in sms messages
sms_word_counts <- tidy_corpus %>%
  inner_join(get_sentiments("bing")) %>%
  count(word, sentiment, sort = TRUE) %>%
  ungroup()

sms_word_counts
## # A tibble: 752 × 3
##      word sentiment     n
##     <chr>     <chr> <int>
## 1    free  positive   283
## 2    love  positive   207
## 3   happy  positive   108
## 4   prize  positive    92
## 5    miss  negative    79
## 6     won  positive    73
## 7     win  positive    72
## 8  urgent  negative    66
## 9    nice  positive    58
## 10   fine  positive    50
## # ... with 742 more rows
# plot it
sms_word_counts %>%
  group_by(sentiment) %>%
  top_n(10) %>%
  ungroup() %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(word, n, fill = sentiment)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~sentiment, scales = "free_y") +
  labs(y = "Contribution to sentiment",
       x = NULL) +
  coord_flip()

# spam sentiments

spam_word_counts <- tidy_corpus %>%
  filter(cat == 'spam') %>% 
  inner_join(get_sentiments("bing")) %>%
  count(word, sentiment, sort = TRUE) %>%
  ungroup()

ham_word_counts <- tidy_corpus %>%
  filter(cat == 'ham') %>% 
  inner_join(get_sentiments("bing")) %>%
  count(word, sentiment, sort = TRUE) %>%
  ungroup()

spam_word_counts %>%
  group_by(sentiment) %>%
  top_n(10) %>%
  ungroup() %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(word, n, fill = sentiment)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~sentiment, scales = "free_y") +
  labs(y = "Contribution to sentiment",
       x = NULL) +
  coord_flip() +
  theme_tufte()

ham_word_counts %>%
  group_by(sentiment) %>%
  top_n(10) %>%
  ungroup() %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(word, n, fill = sentiment)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~sentiment, scales = "free_y") +
  labs(y = "Contribution to sentiment",
       x = NULL) +
  coord_flip() +
  theme_tufte()



## 4. Word Cloud – I hate ’em

Sure, they’re ugly and a total cliche. But I still hate ’em.

library(wordcloud)
## Loading required package: RColorBrewer
plot.new()
par(mfrow=c(1,2))

tidy_ham %>%
  with(wordcloud(word, n, max.words = 50))

tidy_spam %>%
  with(wordcloud(word, n, max.words = 50))



5. Create and Inspect a Document Term Matrix

tm has functions that help prepare text data for modeling. One essential data structure is the Document Term Matrix. In a DTM, every document is a row and every column is a word. Like tidytext, tm has a number of built-in functions to find frequent words, inspect the matrix, etc. For instance, inspect() reports the number of rows in our corpus (5,574), the number of words (1,592) and the length of the longest document (19 words). You can combine other R functions, say to determine that there are 854 words that appear 10 or more times in the corpus. For our model, we exclude terms (words) with fewer than 5 appearances in the messages.

Output of the tm inspect() function summarizes the number of rows and terms in the matrix and displays the first 10 rows and columns.

## cleanup steps ##

# first put the corpus in tm format
corpus2 <- Corpus(VectorSource(corpus$text))
# standardize to lowercase
corpus2 <- tm_map(corpus2, content_transformer(tolower))
# remove tm stopwords
corpus2 <- tm_map(corpus2, removeWords, stopwords())
# standardize whitespaces
corpus2 <- tm_map(corpus2, stripWhitespace)
# remove punctuation
corpus2 <- tm_map(corpus2, removePunctuation)

# corpus2[[1]]$content

# make a dtm with the cleaned-up corpus

dtm <- DocumentTermMatrix(corpus2)
# inspect(dtm)

# words appearing more than 5x
features <- findFreqTerms(dtm, 10)
# summary(features)
# head(features)

# limit to frequent terms, i.e., 5 or more appearances using the dictionary parameter
dtm2 <- DocumentTermMatrix(corpus2, list(global = c(2, Inf),
                                         dictionary = features))
inspect(dtm2)
## <<DocumentTermMatrix (documents: 5574, terms: 854)>>
## Non-/sparse entries: 29424/4730772
## Sparsity           : 99%
## Maximal term length: 15
## Weighting          : term frequency (tf)
## Sample             :
##       Terms
## Docs   call can free get just know like ltgt now will
##   1086    0   0    0   1    0    0    1    0   0   11
##   1361    0   0    0   1    0    0    1    3   0    2
##   1580    0   0    0   0    0    0    0   18   0    0
##   1864    0   0    0   0    0    1    1    0   0    0
##   2135    1   0    0   0    2    0    0    0   0    1
##   2159    0   0    0   0    0    0    0    0   0    0
##   2381    0   1    0   0    0    0    0    1   0    0
##   2435    0   3    1   1    0    0    0    6   0    0
##   2850    0   0    0   0    0    0    0    0   0    0
##   3018    0   0    0   0    0    0    0    2   0    0
# freq <- colSums(as.matrix(dtm2))
# length(freq)

# ord <- order(freq)
# freq[head(ord)]
# freq[tail(ord)]

#head(table(freq), 10)
# tail(table(freq), 10)


6. Split into test, training sets

We’ve standardized the corpus by taking out stopwords, punctuation and white space and converting to lowercase. In text messages, sometimes caps and punctuation do have specific meaning; we’ll assume for the moment that they don’t and run a model that sets a baseline for classifcation rate. Next we split the data into test and training sets before building a classifier. The function createDataPartition() from the caret package does this for us. The table shows that the proportions of ham and spam messages in the test and training sets are approximately the same.

# set.seed(8080)

train_idx <- createDataPartition(corpus$cat, p=0.75, list=FALSE)

# set for the original raw data 
train1 <- corpus[train_idx,]
test1 <- corpus[-train_idx,]

# set for the cleaned-up data
train2 <- corpus2[train_idx]
test2 <- corpus2[-train_idx]

# check to see if the proportions of ham/spam are the same 
frqtab <- function(x, caption) {
    round(100*prop.table(table(x)), 1)
}

ft_orig <- frqtab(corpus$cat)
ft_train <- frqtab(train1$cat)
ft_test <- frqtab(test1$cat)
ft_df <- as.data.frame(cbind(ft_orig, ft_train, ft_test))
colnames(ft_df) <- c("Original", "Training set", "Test set")

pander(head(ft_df), caption="Ham/Spam in Test and Training Sets")
Ham/Spam in Test and Training Sets
  Original Training set Test set
ham 86.6 86.6 86.6
spam 13.4 13.4 13.4

7. Convert Document Term Matrix

To work with classifiers, the Document Term Matrix has to be converted from a matrix of counts to a boolean value indicating whether a word is present or not. Here we make new matrices using the test and training data and convert them for modeling.

dict2 <- findFreqTerms(dtm2, lowfreq=10)

sms_train <- DocumentTermMatrix(train2, list(dictionary=dict2))
sms_test <- DocumentTermMatrix(test2, list(dictionary=dict2))

# this step further converts the DTM-shaped data into a categorical form for modeling with Naive Bayes
convert_counts <- function(x) {
    x <- ifelse(x > 0, 1, 0)
    # x <- factor(x, levels = c(0, 1), labels = c("Absent", "Present"))
}

sms_train <- sms_train %>% apply(MARGIN=2, FUN=convert_counts)
sms_test <- sms_test %>% apply(MARGIN=2, FUN=convert_counts)

sms_train <- as.data.frame(sms_train)
sms_test <- as.data.frame(sms_test)

str(sms_train)
## 'data.frame':    4182 obs. of  854 variables:
##  $ available      : num  1 0 0 0 0 0 0 0 0 0 ...
##  $ crazy          : num  1 0 0 0 0 0 0 0 0 0 ...
##  $ got            : num  0 1 0 0 0 0 1 0 0 0 ...
##  $ great          : num  0 1 0 0 0 0 0 0 0 0 ...
##  $ point          : num  0 1 0 0 0 0 0 0 0 0 ...
##  $ wat            : num  0 1 0 0 0 0 0 0 0 0 ...
##  $ world          : num  0 1 0 0 0 1 0 1 0 0 ...
##  $ lar            : num  0 1 0 0 0 0 0 0 0 0 ...
##  $ wif            : num  0 1 0 0 0 0 0 0 0 0 ...
##  $ apply          : num  0 1 0 0 0 0 0 0 0 0 ...
##  $ comp           : num  0 1 0 0 0 0 1 1 0 0 ...
##  $ entry          : num  0 1 0 0 0 0 1 0 0 0 ...
##  $ final          : num  0 1 0 0 0 0 0 0 0 0 ...
##  $ free           : num  0 0 1 0 0 0 0 0 0 0 ...
##  $ may            : num  0 0 1 0 0 0 0 0 0 0 ...
##  $ receive        : num  0 0 1 0 0 0 0 0 0 0 ...
##  $ text           : num  0 0 1 0 0 0 0 0 0 0 ...
##  $ txt            : num  0 0 0 1 0 0 0 0 0 0 ...
##  $ win            : num  0 0 0 1 0 0 0 0 0 0 ...
##  $ wkly           : num  0 0 0 1 0 0 0 0 0 0 ...
##  $ already        : num  0 0 0 1 0 0 0 0 0 0 ...
##  $ dun            : num  0 0 0 1 0 0 0 0 0 0 ...
##  $ early          : num  0 0 0 1 0 0 0 0 0 0 ...
##  $ say            : num  0 0 0 0 1 1 0 0 0 0 ...
##  $ around         : num  0 0 0 0 1 0 0 1 0 0 ...
##  $ goes           : num  0 0 0 0 1 0 0 0 0 0 ...
##  $ nah            : num  0 0 0 0 1 0 0 0 0 0 ...
##  $ think          : num  0 0 0 0 1 0 0 0 0 0 ...
##  $ though         : num  0 0 0 0 1 0 0 0 0 0 ...
##  $ usf            : num  0 0 0 0 1 0 0 1 0 0 ...
##  $ 150            : num  0 0 0 0 1 0 0 0 0 0 ...
##  $ back           : num  0 0 0 0 1 0 0 0 0 0 ...
##  $ freemsg        : num  0 0 0 0 1 0 0 0 0 0 ...
##  $ fun            : num  0 0 0 0 1 0 0 0 0 0 ...
##  $ hey            : num  0 0 0 0 1 0 0 0 0 0 ...
##  $ like           : num  0 0 0 0 0 1 0 0 0 0 ...
##  $ now            : num  0 0 0 0 0 1 0 0 0 0 ...
##  $ send           : num  0 0 0 0 0 1 0 0 0 0 ...
##  $ still          : num  0 0 0 0 0 1 0 0 0 0 ...
##  $ weeks          : num  0 0 0 0 0 1 0 0 0 0 ...
##  $ word           : num  0 0 0 0 0 1 0 0 0 0 ...
##  $ xxx            : num  0 0 0 0 0 1 0 0 0 0 ...
##  $ brother        : num  0 0 0 0 0 0 1 0 0 0 ...
##  $ even           : num  0 0 0 0 0 0 1 0 0 0 ...
##  $ speak          : num  0 0 0 0 0 0 1 0 0 0 ...
##  $ treat          : num  0 0 0 0 0 0 1 0 0 0 ...
##  $ callertune     : num  0 0 0 0 0 0 1 0 0 0 ...
##  $ friends        : num  0 0 0 0 0 0 1 0 0 0 ...
##  $ per            : num  0 0 0 0 0 0 1 0 0 0 ...
##  $ press          : num  0 0 0 0 0 0 0 1 0 0 ...
##  $ set            : num  0 0 0 0 0 0 0 1 0 0 ...
##  $ call           : num  0 0 0 0 0 0 0 1 0 0 ...
##  $ claim          : num  0 0 0 0 0 0 0 1 0 0 ...
##  $ code           : num  0 0 0 0 0 0 0 1 0 0 ...
##  $ customer       : num  0 0 0 0 0 0 0 0 1 0 ...
##  $ hours          : num  0 0 0 0 0 0 0 0 1 0 ...
##  $ network        : num  0 0 0 0 0 0 0 0 1 0 ...
##  $ prize          : num  0 0 0 0 0 0 0 0 1 0 ...
##  $ reward         : num  0 0 0 0 0 0 0 0 1 0 ...
##  $ selected       : num  0 0 0 0 0 0 0 0 1 1 ...
##  $ valid          : num  0 0 0 0 0 0 0 0 1 0 ...
##  $ valued         : num  0 0 0 0 0 0 0 0 1 0 ...
##  $ winner         : num  0 0 0 0 0 0 0 0 1 0 ...
##  $ camera         : num  0 0 0 0 0 0 0 0 0 1 ...
##  $ colour         : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ latest         : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ mobile         : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ mobiles        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ months         : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ update         : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ enough         : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ gonna          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ home           : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ soon           : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ stuff          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ talk           : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ today          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ tonight        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ want           : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ 100            : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ cash           : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ cost           : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ info           : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ pounds         : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ reply          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ pobox          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ urgent         : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ week           : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ won            : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ help           : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ right          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ take           : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ thank          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ times          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ will           : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ wonderful      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ wont           : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ words          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ date           : num  0 0 0 0 0 0 0 0 0 0 ...
##   [list output truncated]


8. Train a Support Vector Machine

I tried a naive Bayes model and a logistic regression but could only get an SVM to work.

# prep the data
sms_train1 <- cbind(cat=factor(train1$cat), sms_train)
sms_test1 <- cbind(cat=factor(test1$cat), sms_test)

# sms_train1[,-1]<-apply(sms_train1[,-1],MARGIN=2,as.numeric)
# sms_test1<-apply(sms_test, MARGIN=2, as.numeric)

sms_train1<-as.data.frame(sms_train1)
sms_test1<-as.data.frame(sms_test1)

# model specification
fit1 <- svm(cat~., data=sms_train1)

# print a summary
fit1
## 
## Call:
## svm(formula = cat ~ ., data = sms_train1)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  radial 
##        cost:  1 
##       gamma:  0.00117096 
## 
## Number of Support Vectors:  1544
fit1.pred <- predict(fit1, na.omit(sms_test1))

fit1.perf <- table(na.omit(sms_test1$cat), fit1.pred, dnn=c("Actual", "Predicted"))

fit1.perf
##       Predicted
## Actual  ham spam
##   ham  1163   43
##   spam  184    2
# head(fit1.pred, 20)


9. Prediction

As noted earlier, the model doesn’t do well at predicting spam. On the test data, the model correctly predicted only 4 out of 22 spam messages. It also classed a fair measure of ham messages as spam (182). Clearly more tuning is needed.

confMatrix1 <- confusionMatrix(fit1.pred, sms_test1$cat, positive="ham")
confMatrix1
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  ham spam
##       ham  1163  184
##       spam   43    2
##                                          
##                Accuracy : 0.8369         
##                  95% CI : (0.8165, 0.856)
##     No Information Rate : 0.8664         
##     P-Value [Acc > NIR] : 0.9993         
##                                          
##                   Kappa : -0.0367        
##  Mcnemar's Test P-Value : <2e-16         
##                                          
##             Sensitivity : 0.96434        
##             Specificity : 0.01075        
##          Pos Pred Value : 0.86340        
##          Neg Pred Value : 0.04444        
##              Prevalence : 0.86638        
##          Detection Rate : 0.83549        
##    Detection Prevalence : 0.96767        
##       Balanced Accuracy : 0.48755        
##                                          
##        'Positive' Class : ham            
##