1. Introduction
For this assignment, we’re asked to work with R textual analysis tools to download a text corpus and classify documents. I chose an SMS message dataset from the UC Irvine Machine Learning Laboratory. I used the tidytext and tm packages to explore the data and prepare it for modeling, then fit a Support Vector Machine to predict whether messages were spam or ham. The model did well at predicting ham, but not spam. Overall prediction accuracy was 86 percent. Sensitivity – or the probability of correctly predicting ham – was 98 percent. Specificity – the probability of correctly predicting spam – was only 2 percent. Detailed results are summarized in the confusion matrix below. I did not have enought time to tune the model.
2. Get the SMS Message data from UCI
# disable quoting to avoid error
url <- "https://archive.ics.uci.edu/ml/machine-learning-databases/00228/smsspamcollection.zip"
if (!file.exists("smsspamcollection.zip")) {
download.file(url=url, destfile="smsspamcollection.zip", method="curl")}
unzip("smsspamcollection.zip")
corpus <- read.delim("SMSSpamCollection", sep="\t", header=F, colClasses="character", quote="")
# str(corpus)
colnames(corpus) <- c("cat", "text")
corpus$cat <- factor(corpus$cat)
pander(table(corpus$cat), caption="Spam and Ham Count in Raw Data")| ham | spam |
|---|---|
| 4827 | 747 |
3.0 Use tidytext to explore the data
In these steps we tokenize the data – i.e., break the constituent messages into individual lists. We remove “stop words” – prepositions and articles that typically don’t add meaning – and then chart the most frequently used words.
# first tokenize
tidy_corpus <- corpus %>%
unnest_tokens(word, text)
# remove stop words
data("stop_words")
tidy_corpus <- tidy_corpus %>%
anti_join(stop_words)
tidy_corpus %>%
count(word, sort = TRUE) %>%
filter(n > 100) %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(word, n)) +
geom_col() +
xlab(NULL) +
coord_flip() +
theme_tufte()3. Sentiment analysis, ham vs. spam
tidytext has tools for sentiment analysis. Following examples in our text, I was able to find the words that contributed most to messages with positive or negative sentiment.
tidy_spam <- tidy_corpus %>%
filter(cat == 'spam') %>%
count(word, sort = TRUE)
tidy_spam <-
tidy_spam %>%
mutate(prop = n/sum(n))
tidy_ham <- tidy_corpus %>%
filter(cat == 'ham') %>%
count(word, sort = TRUE)
tidy_ham <-
tidy_ham %>%
mutate(prop = n/sum(n))
# try sentiment analysis
nrcmad <- get_sentiments("nrc") %>%
filter(sentiment == "anger")
tidy_corpus %>%
inner_join(nrcmad) %>%
count(word, sort = TRUE)## # A tibble: 182 × 2
## word n
## <chr> <int>
## 1 cash 88
## 2 money 59
## 3 ill 43
## 4 shit 35
## 5 bad 32
## 6 mob 25
## 7 hurt 23
## 8 hot 22
## 9 words 22
## 10 feeling 21
## # ... with 172 more rows
# spam is much more positive
corpus_sentiment <- tidy_corpus %>%
inner_join(get_sentiments("bing")) %>%
count(cat, sentiment) %>%
spread(sentiment, n, fill = 0) %>%
mutate(sentiment = positive - negative)
# sentiment of words in sms messages
sms_word_counts <- tidy_corpus %>%
inner_join(get_sentiments("bing")) %>%
count(word, sentiment, sort = TRUE) %>%
ungroup()
sms_word_counts## # A tibble: 752 × 3
## word sentiment n
## <chr> <chr> <int>
## 1 free positive 283
## 2 love positive 207
## 3 happy positive 108
## 4 prize positive 92
## 5 miss negative 79
## 6 won positive 73
## 7 win positive 72
## 8 urgent negative 66
## 9 nice positive 58
## 10 fine positive 50
## # ... with 742 more rows
# plot it
sms_word_counts %>%
group_by(sentiment) %>%
top_n(10) %>%
ungroup() %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(word, n, fill = sentiment)) +
geom_col(show.legend = FALSE) +
facet_wrap(~sentiment, scales = "free_y") +
labs(y = "Contribution to sentiment",
x = NULL) +
coord_flip()# spam sentiments
spam_word_counts <- tidy_corpus %>%
filter(cat == 'spam') %>%
inner_join(get_sentiments("bing")) %>%
count(word, sentiment, sort = TRUE) %>%
ungroup()
ham_word_counts <- tidy_corpus %>%
filter(cat == 'ham') %>%
inner_join(get_sentiments("bing")) %>%
count(word, sentiment, sort = TRUE) %>%
ungroup()
spam_word_counts %>%
group_by(sentiment) %>%
top_n(10) %>%
ungroup() %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(word, n, fill = sentiment)) +
geom_col(show.legend = FALSE) +
facet_wrap(~sentiment, scales = "free_y") +
labs(y = "Contribution to sentiment",
x = NULL) +
coord_flip() +
theme_tufte()ham_word_counts %>%
group_by(sentiment) %>%
top_n(10) %>%
ungroup() %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(word, n, fill = sentiment)) +
geom_col(show.legend = FALSE) +
facet_wrap(~sentiment, scales = "free_y") +
labs(y = "Contribution to sentiment",
x = NULL) +
coord_flip() +
theme_tufte()
## 4. Word Cloud – I hate ’em
Sure, they’re ugly and a total cliche. But I still hate ’em.
library(wordcloud)## Loading required package: RColorBrewer
plot.new()
par(mfrow=c(1,2))
tidy_ham %>%
with(wordcloud(word, n, max.words = 50))
tidy_spam %>%
with(wordcloud(word, n, max.words = 50))5. Create and Inspect a Document Term Matrix
tm has functions that help prepare text data for modeling. One essential data structure is the Document Term Matrix. In a DTM, every document is a row and every column is a word. Like tidytext, tm has a number of built-in functions to find frequent words, inspect the matrix, etc. For instance, inspect() reports the number of rows in our corpus (5,574), the number of words (1,592) and the length of the longest document (19 words). You can combine other R functions, say to determine that there are 854 words that appear 10 or more times in the corpus. For our model, we exclude terms (words) with fewer than 5 appearances in the messages.
Output of the tm inspect() function summarizes the number of rows and terms in the matrix and displays the first 10 rows and columns.
## cleanup steps ##
# first put the corpus in tm format
corpus2 <- Corpus(VectorSource(corpus$text))
# standardize to lowercase
corpus2 <- tm_map(corpus2, content_transformer(tolower))
# remove tm stopwords
corpus2 <- tm_map(corpus2, removeWords, stopwords())
# standardize whitespaces
corpus2 <- tm_map(corpus2, stripWhitespace)
# remove punctuation
corpus2 <- tm_map(corpus2, removePunctuation)
# corpus2[[1]]$content
# make a dtm with the cleaned-up corpus
dtm <- DocumentTermMatrix(corpus2)
# inspect(dtm)
# words appearing more than 5x
features <- findFreqTerms(dtm, 10)
# summary(features)
# head(features)
# limit to frequent terms, i.e., 5 or more appearances using the dictionary parameter
dtm2 <- DocumentTermMatrix(corpus2, list(global = c(2, Inf),
dictionary = features))
inspect(dtm2)## <<DocumentTermMatrix (documents: 5574, terms: 854)>>
## Non-/sparse entries: 29424/4730772
## Sparsity : 99%
## Maximal term length: 15
## Weighting : term frequency (tf)
## Sample :
## Terms
## Docs call can free get just know like ltgt now will
## 1086 0 0 0 1 0 0 1 0 0 11
## 1361 0 0 0 1 0 0 1 3 0 2
## 1580 0 0 0 0 0 0 0 18 0 0
## 1864 0 0 0 0 0 1 1 0 0 0
## 2135 1 0 0 0 2 0 0 0 0 1
## 2159 0 0 0 0 0 0 0 0 0 0
## 2381 0 1 0 0 0 0 0 1 0 0
## 2435 0 3 1 1 0 0 0 6 0 0
## 2850 0 0 0 0 0 0 0 0 0 0
## 3018 0 0 0 0 0 0 0 2 0 0
# freq <- colSums(as.matrix(dtm2))
# length(freq)
# ord <- order(freq)
# freq[head(ord)]
# freq[tail(ord)]
#head(table(freq), 10)
# tail(table(freq), 10)6. Split into test, training sets
We’ve standardized the corpus by taking out stopwords, punctuation and white space and converting to lowercase. In text messages, sometimes caps and punctuation do have specific meaning; we’ll assume for the moment that they don’t and run a model that sets a baseline for classifcation rate. Next we split the data into test and training sets before building a classifier. The function createDataPartition() from the caret package does this for us. The table shows that the proportions of ham and spam messages in the test and training sets are approximately the same.
# set.seed(8080)
train_idx <- createDataPartition(corpus$cat, p=0.75, list=FALSE)
# set for the original raw data
train1 <- corpus[train_idx,]
test1 <- corpus[-train_idx,]
# set for the cleaned-up data
train2 <- corpus2[train_idx]
test2 <- corpus2[-train_idx]
# check to see if the proportions of ham/spam are the same
frqtab <- function(x, caption) {
round(100*prop.table(table(x)), 1)
}
ft_orig <- frqtab(corpus$cat)
ft_train <- frqtab(train1$cat)
ft_test <- frqtab(test1$cat)
ft_df <- as.data.frame(cbind(ft_orig, ft_train, ft_test))
colnames(ft_df) <- c("Original", "Training set", "Test set")
pander(head(ft_df), caption="Ham/Spam in Test and Training Sets")| Original | Training set | Test set | |
|---|---|---|---|
| ham | 86.6 | 86.6 | 86.6 |
| spam | 13.4 | 13.4 | 13.4 |
7. Convert Document Term Matrix
To work with classifiers, the Document Term Matrix has to be converted from a matrix of counts to a boolean value indicating whether a word is present or not. Here we make new matrices using the test and training data and convert them for modeling.
dict2 <- findFreqTerms(dtm2, lowfreq=10)
sms_train <- DocumentTermMatrix(train2, list(dictionary=dict2))
sms_test <- DocumentTermMatrix(test2, list(dictionary=dict2))
# this step further converts the DTM-shaped data into a categorical form for modeling with Naive Bayes
convert_counts <- function(x) {
x <- ifelse(x > 0, 1, 0)
# x <- factor(x, levels = c(0, 1), labels = c("Absent", "Present"))
}
sms_train <- sms_train %>% apply(MARGIN=2, FUN=convert_counts)
sms_test <- sms_test %>% apply(MARGIN=2, FUN=convert_counts)
sms_train <- as.data.frame(sms_train)
sms_test <- as.data.frame(sms_test)
str(sms_train)## 'data.frame': 4182 obs. of 854 variables:
## $ available : num 1 0 0 0 0 0 0 0 0 0 ...
## $ crazy : num 1 0 0 0 0 0 0 0 0 0 ...
## $ got : num 0 1 0 0 0 0 1 0 0 0 ...
## $ great : num 0 1 0 0 0 0 0 0 0 0 ...
## $ point : num 0 1 0 0 0 0 0 0 0 0 ...
## $ wat : num 0 1 0 0 0 0 0 0 0 0 ...
## $ world : num 0 1 0 0 0 1 0 1 0 0 ...
## $ lar : num 0 1 0 0 0 0 0 0 0 0 ...
## $ wif : num 0 1 0 0 0 0 0 0 0 0 ...
## $ apply : num 0 1 0 0 0 0 0 0 0 0 ...
## $ comp : num 0 1 0 0 0 0 1 1 0 0 ...
## $ entry : num 0 1 0 0 0 0 1 0 0 0 ...
## $ final : num 0 1 0 0 0 0 0 0 0 0 ...
## $ free : num 0 0 1 0 0 0 0 0 0 0 ...
## $ may : num 0 0 1 0 0 0 0 0 0 0 ...
## $ receive : num 0 0 1 0 0 0 0 0 0 0 ...
## $ text : num 0 0 1 0 0 0 0 0 0 0 ...
## $ txt : num 0 0 0 1 0 0 0 0 0 0 ...
## $ win : num 0 0 0 1 0 0 0 0 0 0 ...
## $ wkly : num 0 0 0 1 0 0 0 0 0 0 ...
## $ already : num 0 0 0 1 0 0 0 0 0 0 ...
## $ dun : num 0 0 0 1 0 0 0 0 0 0 ...
## $ early : num 0 0 0 1 0 0 0 0 0 0 ...
## $ say : num 0 0 0 0 1 1 0 0 0 0 ...
## $ around : num 0 0 0 0 1 0 0 1 0 0 ...
## $ goes : num 0 0 0 0 1 0 0 0 0 0 ...
## $ nah : num 0 0 0 0 1 0 0 0 0 0 ...
## $ think : num 0 0 0 0 1 0 0 0 0 0 ...
## $ though : num 0 0 0 0 1 0 0 0 0 0 ...
## $ usf : num 0 0 0 0 1 0 0 1 0 0 ...
## $ 150 : num 0 0 0 0 1 0 0 0 0 0 ...
## $ back : num 0 0 0 0 1 0 0 0 0 0 ...
## $ freemsg : num 0 0 0 0 1 0 0 0 0 0 ...
## $ fun : num 0 0 0 0 1 0 0 0 0 0 ...
## $ hey : num 0 0 0 0 1 0 0 0 0 0 ...
## $ like : num 0 0 0 0 0 1 0 0 0 0 ...
## $ now : num 0 0 0 0 0 1 0 0 0 0 ...
## $ send : num 0 0 0 0 0 1 0 0 0 0 ...
## $ still : num 0 0 0 0 0 1 0 0 0 0 ...
## $ weeks : num 0 0 0 0 0 1 0 0 0 0 ...
## $ word : num 0 0 0 0 0 1 0 0 0 0 ...
## $ xxx : num 0 0 0 0 0 1 0 0 0 0 ...
## $ brother : num 0 0 0 0 0 0 1 0 0 0 ...
## $ even : num 0 0 0 0 0 0 1 0 0 0 ...
## $ speak : num 0 0 0 0 0 0 1 0 0 0 ...
## $ treat : num 0 0 0 0 0 0 1 0 0 0 ...
## $ callertune : num 0 0 0 0 0 0 1 0 0 0 ...
## $ friends : num 0 0 0 0 0 0 1 0 0 0 ...
## $ per : num 0 0 0 0 0 0 1 0 0 0 ...
## $ press : num 0 0 0 0 0 0 0 1 0 0 ...
## $ set : num 0 0 0 0 0 0 0 1 0 0 ...
## $ call : num 0 0 0 0 0 0 0 1 0 0 ...
## $ claim : num 0 0 0 0 0 0 0 1 0 0 ...
## $ code : num 0 0 0 0 0 0 0 1 0 0 ...
## $ customer : num 0 0 0 0 0 0 0 0 1 0 ...
## $ hours : num 0 0 0 0 0 0 0 0 1 0 ...
## $ network : num 0 0 0 0 0 0 0 0 1 0 ...
## $ prize : num 0 0 0 0 0 0 0 0 1 0 ...
## $ reward : num 0 0 0 0 0 0 0 0 1 0 ...
## $ selected : num 0 0 0 0 0 0 0 0 1 1 ...
## $ valid : num 0 0 0 0 0 0 0 0 1 0 ...
## $ valued : num 0 0 0 0 0 0 0 0 1 0 ...
## $ winner : num 0 0 0 0 0 0 0 0 1 0 ...
## $ camera : num 0 0 0 0 0 0 0 0 0 1 ...
## $ colour : num 0 0 0 0 0 0 0 0 0 0 ...
## $ latest : num 0 0 0 0 0 0 0 0 0 0 ...
## $ mobile : num 0 0 0 0 0 0 0 0 0 0 ...
## $ mobiles : num 0 0 0 0 0 0 0 0 0 0 ...
## $ months : num 0 0 0 0 0 0 0 0 0 0 ...
## $ update : num 0 0 0 0 0 0 0 0 0 0 ...
## $ enough : num 0 0 0 0 0 0 0 0 0 0 ...
## $ gonna : num 0 0 0 0 0 0 0 0 0 0 ...
## $ home : num 0 0 0 0 0 0 0 0 0 0 ...
## $ soon : num 0 0 0 0 0 0 0 0 0 0 ...
## $ stuff : num 0 0 0 0 0 0 0 0 0 0 ...
## $ talk : num 0 0 0 0 0 0 0 0 0 0 ...
## $ today : num 0 0 0 0 0 0 0 0 0 0 ...
## $ tonight : num 0 0 0 0 0 0 0 0 0 0 ...
## $ want : num 0 0 0 0 0 0 0 0 0 0 ...
## $ 100 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ cash : num 0 0 0 0 0 0 0 0 0 0 ...
## $ cost : num 0 0 0 0 0 0 0 0 0 0 ...
## $ info : num 0 0 0 0 0 0 0 0 0 0 ...
## $ pounds : num 0 0 0 0 0 0 0 0 0 0 ...
## $ reply : num 0 0 0 0 0 0 0 0 0 0 ...
## $ pobox : num 0 0 0 0 0 0 0 0 0 0 ...
## $ urgent : num 0 0 0 0 0 0 0 0 0 0 ...
## $ week : num 0 0 0 0 0 0 0 0 0 0 ...
## $ won : num 0 0 0 0 0 0 0 0 0 0 ...
## $ help : num 0 0 0 0 0 0 0 0 0 0 ...
## $ right : num 0 0 0 0 0 0 0 0 0 0 ...
## $ take : num 0 0 0 0 0 0 0 0 0 0 ...
## $ thank : num 0 0 0 0 0 0 0 0 0 0 ...
## $ times : num 0 0 0 0 0 0 0 0 0 0 ...
## $ will : num 0 0 0 0 0 0 0 0 0 0 ...
## $ wonderful : num 0 0 0 0 0 0 0 0 0 0 ...
## $ wont : num 0 0 0 0 0 0 0 0 0 0 ...
## $ words : num 0 0 0 0 0 0 0 0 0 0 ...
## $ date : num 0 0 0 0 0 0 0 0 0 0 ...
## [list output truncated]
8. Train a Support Vector Machine
I tried a naive Bayes model and a logistic regression but could only get an SVM to work.
# prep the data
sms_train1 <- cbind(cat=factor(train1$cat), sms_train)
sms_test1 <- cbind(cat=factor(test1$cat), sms_test)
# sms_train1[,-1]<-apply(sms_train1[,-1],MARGIN=2,as.numeric)
# sms_test1<-apply(sms_test, MARGIN=2, as.numeric)
sms_train1<-as.data.frame(sms_train1)
sms_test1<-as.data.frame(sms_test1)
# model specification
fit1 <- svm(cat~., data=sms_train1)
# print a summary
fit1##
## Call:
## svm(formula = cat ~ ., data = sms_train1)
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: radial
## cost: 1
## gamma: 0.00117096
##
## Number of Support Vectors: 1544
fit1.pred <- predict(fit1, na.omit(sms_test1))
fit1.perf <- table(na.omit(sms_test1$cat), fit1.pred, dnn=c("Actual", "Predicted"))
fit1.perf## Predicted
## Actual ham spam
## ham 1163 43
## spam 184 2
# head(fit1.pred, 20)9. Prediction
As noted earlier, the model doesn’t do well at predicting spam. On the test data, the model correctly predicted only 4 out of 22 spam messages. It also classed a fair measure of ham messages as spam (182). Clearly more tuning is needed.
confMatrix1 <- confusionMatrix(fit1.pred, sms_test1$cat, positive="ham")
confMatrix1## Confusion Matrix and Statistics
##
## Reference
## Prediction ham spam
## ham 1163 184
## spam 43 2
##
## Accuracy : 0.8369
## 95% CI : (0.8165, 0.856)
## No Information Rate : 0.8664
## P-Value [Acc > NIR] : 0.9993
##
## Kappa : -0.0367
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.96434
## Specificity : 0.01075
## Pos Pred Value : 0.86340
## Neg Pred Value : 0.04444
## Prevalence : 0.86638
## Detection Rate : 0.83549
## Detection Prevalence : 0.96767
## Balanced Accuracy : 0.48755
##
## 'Positive' Class : ham
##