Tasks:

  1. Text mining (determine group of text that belong to the same cluster) using the Latent Dirichlet allocation (LDA) algorithm
  2. Use of the Naive Bayes algorithm to predict whether or not an sms message is spam

Get data

For both tasks, I will use the the spam.csv dataset from the UCI machine learning data repository.


setwd("C:/Users/Owner/Desktop/MachineLearningR_sampleData")
text <- read.csv("sms.csv", header = TRUE)
str(text)
## 'data.frame':    5559 obs. of  2 variables:
##  $ type: Factor w/ 2 levels "ham","spam": 1 1 1 2 2 1 1 1 2 1 ...
##  $ Text: Factor w/ 5156 levels "'An Amazing Quote'' - Sometimes in life its difficult to decide whats wrong!! a lie that brings a smile or the "| __truncated__,..: 1651 2557 257 626 3308 190 357 3392 2726 1079 ...
summary(text)
##    type     
##  ham :4812  
##  spam: 747  
##             
##             
##             
##             
##             
##                                                                                                                                                     Text     
##  Sorry, I'll call later                                                                                                                               :  30  
##  I cant pick the phone right now. Pls send a message                                                                                                  :  12  
##  Ok...                                                                                                                                                :  10  
##  7 wonders in My WORLD 7th You 6th Ur style 5th Ur smile 4th Ur Personality 3rd Ur Nature 2nd Ur SMS and 1st Ur Lovely Friendship... good morning dear:   4  
##  Ok                                                                                                                                                   :   4  
##  Ok.                                                                                                                                                  :   4  
##  (Other)                                                                                                                                              :5495
head(text, n =10)
##    type
## 1   ham
## 2   ham
## 3   ham
## 4  spam
## 5  spam
## 6   ham
## 7   ham
## 8   ham
## 9  spam
## 10  ham
##                                                                                                                                                                 Text
## 1                                                                                                                  Hope you are having a good week. Just checking in
## 2                                                                                                                                            K..give back my thanks.
## 3                                                                                                                        Am also doing in cbe only. But have to pay.
## 4             complimentary 4 STAR Ibiza Holiday or £10,000 cash needs your URGENT collection. 09066364349 NOW from Landline not to lose out! Box434SK38WP150PPM18+
## 5  okmail: Dear Dave this is your final notice to collect your 4* Tenerife Holiday or #5000 CASH award! Call 09061743806 from landline. TCs SAE Box326 CW25WX 150ppm
## 6                                                                                                                 Aiya we discuss later lar... Pick u up at 4 is it?
## 7                                                                                                                                             Are you this much buzy
## 8                                                                                                                                    Please ask mummy to call father
## 9  Marvel Mobile Play the official Ultimate Spider-man game (£4.50) on ur mobile right now. Text SPIDER to 83338 for the game & we ll send u a FREE 8Ball wallpaper
## 10                                                                                                                    fyi I'm at usf now, swing by the room whenever

Text mining using Latent Dirichlet allocation (LDA) algorithm

Data preparation

From the first 10 row of the “text” dataset, it is evident that there several special characters in the text messages. Some of these characters can not be removed using the the functions in the tm package. I will start by removing these characters using the gsub() function.

text$Text2<- as.data.frame(sapply(text$Text,gsub,pattern="[^a-zA-Z0-9]",replacement= " "))

head(text$Text2, n = 10)
##                                                                                                 sapply(text$Text, gsub, pattern = "[^a-zA-Z0-9]", replacement = " ")
## 1                                                                                                                  Hope you are having a good week  Just checking in
## 2                                                                                                                                            K  give back my thanks 
## 3                                                                                                                        Am also doing in cbe only  But have to pay 
## 4             complimentary 4 STAR Ibiza Holiday or   10 000 cash needs your URGENT collection  09066364349 NOW from Landline not to lose out  Box434SK38WP150PPM18 
## 5  okmail  Dear Dave this is your final notice to collect your 4  Tenerife Holiday or  5000 CASH award  Call 09061743806 from landline  TCs SAE Box326 CW25WX 150ppm
## 6                                                                                                                 Aiya we discuss later lar    Pick u up at 4 is it 
## 7                                                                                                                                             Are you this much buzy
## 8                                                                                                                                    Please ask mummy to call father
## 9  Marvel Mobile Play the official Ultimate Spider man game    4 50  on ur mobile right now  Text SPIDER to 83338 for the game   we ll send u a FREE 8Ball wallpaper
## 10                                                                                                                    fyi I m at usf now  swing by the room whenever
str(text)
## 'data.frame':    5559 obs. of  3 variables:
##  $ type : Factor w/ 2 levels "ham","spam": 1 1 1 2 2 1 1 1 2 1 ...
##  $ Text : Factor w/ 5156 levels "'An Amazing Quote'' - Sometimes in life its difficult to decide whats wrong!! a lie that brings a smile or the "| __truncated__,..: 1651 2557 257 626 3308 190 357 3392 2726 1079 ...
##  $ Text2:'data.frame':   5559 obs. of  1 variable:
##   ..$ sapply(text$Text, gsub, pattern = "[^a-zA-Z0-9]", replacement = " "): Factor w/ 5155 levels "   ","       ",..: 1652 2524 257 626 3308 190 357 3392 2726 1079 ...

I will create a corpus for the text document for the “Text” column of each row. This will generate 5559 text documents.

library(tm)
## Warning: package 'tm' was built under R version 3.4.2
## Loading required package: NLP
text_corpus <- VCorpus(VectorSource(text$Text2))

print(text_corpus)
## <<VCorpus>>
## Metadata:  corpus specific: 0, document level (indexed): 0
## Content:  documents: 1

Using the tm package, I will perform a series of cleaning steps including removal of numbers and punctuations, word stemming and whitespace striping.

text_corpus_clean <- tm_map(text_corpus,
content_transformer(tolower))
text_corpus_clean <- tm_map(text_corpus_clean, stemDocument)

text_corpus_clean <- tm_map(text_corpus_clean, removeNumbers)
text_corpus_clean <- tm_map(text_corpus_clean,
removeWords, stopwords())
text_corpus_clean <- tm_map(text_corpus_clean, removePunctuation)

text_corpus_clean <- tm_map(text_corpus_clean, stripWhitespace)

Data Exploration

I will examine the words that have at least a frequency of 50 in all the documents.

library(wordcloud)
## Warning: package 'wordcloud' was built under R version 3.4.2
## Loading required package: RColorBrewer
wordcloud(text_corpus_clean, min.freq = 50, random.order = FALSE,
          colors=brewer.pal(8, "Dark2"))

Model training

I will use the DocumentTermMatrix() from the tm package to determine the frequency of words in each document.

text_dtm <- DocumentTermMatrix(text_corpus_clean)

Text mining (determine group of text that belong to the same cluster) using the Latent Dirichlet allocation (LDA) algorithm

I will use the LDA function from the topicmodels package to create a 2-LDA model (a 2 topics model)

library(topicmodels)
## Warning: package 'topicmodels' was built under R version 3.4.2
text_lda <- LDA(text_dtm, k = 2, method = "VEM", control = NULL)
text_lda
## A LDA_VEM topic model with 2 topics.

Model Evaluation

I will determine the 10 most abundant words in each group by using tidy() function from tidytext package and the ggplot for visualization

library(tidytext)
## Warning: package 'tidytext' was built under R version 3.4.2
text_topics <- tidy(text_lda, matrix = "beta")
## Warning: package 'bindrcpp' was built under R version 3.4.2
text_topics
## # A tibble: 12,168 x 3
##    topic        term         beta
##    <int>       <chr>        <dbl>
##  1     1         aah 1.241060e-04
##  2     2         aah 3.361311e-05
##  3     1       aaniy 1.350850e-05
##  4     2       aaniy 2.623806e-05
##  5     1 aaooooright 4.381687e-05
##  6     2 aaooooright 9.883284e-06
##  7     1       aathi 1.300980e-04
##  8     2       aathi 1.309620e-04
##  9     1       abbey 1.449761e-05
## 10     2       abbey 2.570433e-05
## # ... with 12,158 more rows
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.4.2
## 
## Attaching package: 'ggplot2'
## The following object is masked from 'package:NLP':
## 
##     annotate
library(dplyr)
## Warning: package 'dplyr' was built under R version 3.4.2
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
text_top_terms <- text_topics %>%
  group_by(topic) %>%
  top_n(10, beta) %>%
  ungroup() %>%
  arrange(topic, -beta)

text_top_terms %>%
  mutate(term = reorder(term, beta)) %>%
  ggplot(aes(term, beta, fill = factor(topic))) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ topic, scales = "free") +
  coord_flip()

I will also create a model with 5 topics (One of the disadvanatges of using the LDA algorithm is that you get to choose the number of topics and there is no effective way to determine if the number you choose is the most optimal).

library(topicmodels)
text_lda <- LDA(text_dtm, k = 5, method = "VEM", control = NULL)
text_lda
## A LDA_VEM topic model with 5 topics.

Model Evaluation

I will determine the 10 most abundant words in each group by using tidy() function from tidytext package and the ggplot for visualization

library(tidytext)
text_topics <- tidy(text_lda, matrix = "beta")
text_topics
## # A tibble: 30,420 x 3
##    topic  term         beta
##    <int> <chr>        <dbl>
##  1     1   aah 5.491811e-05
##  2     2   aah 7.560095e-05
##  3     3   aah 7.076431e-05
##  4     4   aah 3.100332e-05
##  5     5   aah 9.144448e-05
##  6     1 aaniy 1.234812e-06
##  7     2 aaniy 3.448618e-05
##  8     3 aaniy 3.646055e-06
##  9     4 aaniy 3.090558e-05
## 10     5 aaniy 3.478404e-05
## # ... with 30,410 more rows
library(ggplot2)
library(dplyr)

text_top_terms <- text_topics %>%
  group_by(topic) %>%
  top_n(10, beta) %>%
  ungroup() %>%
  arrange(topic, -beta)

text_top_terms %>%
  mutate(term = reorder(term, beta)) %>%
  ggplot(aes(term, beta, fill = factor(topic))) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ topic, scales = "free") +
  coord_flip()

A slightly diffferent appraoch is to determine the terms that have the greatest difference in Beta between Group 1 and Group 2 by calculating log2(beta2/beta1).

library(tidyr)
## Warning: package 'tidyr' was built under R version 3.4.2
beta_spread <- text_topics %>%
  mutate(topic = paste0("topic", topic)) %>%
  spread(topic, beta) %>%
  filter(topic1 > .001 | topic2 > .001) %>%
  mutate(log_ratio = log2(topic2 / topic1))

beta_spread
## # A tibble: 273 x 7
##       term       topic1       topic2       topic3       topic4
##      <chr>        <dbl>        <dbl>        <dbl>        <dbl>
##  1 account 0.0012547224 0.0007640345 0.0001829796 0.0009127409
##  2 address 0.0006427047 0.0010315629 0.0005301596 0.0001662997
##  3   aight 0.0011712310 0.0006687148 0.0009430776 0.0004354113
##  4 alreadi 0.0025329355 0.0009595809 0.0023938813 0.0017457056
##  5    also 0.0014501787 0.0011277313 0.0021345402 0.0013796392
##  6   alway 0.0008326815 0.0014345918 0.0002037295 0.0010203782
##  7     ani 0.0037893770 0.0011106809 0.0011147840 0.0033197864
##  8  answer 0.0011276646 0.0005401424 0.0011193087 0.0006660775
##  9   anyth 0.0029081518 0.0002152274 0.0015243450 0.0015748219
## 10  around 0.0020112457 0.0002674228 0.0001493653 0.0018431603
## # ... with 263 more rows, and 2 more variables: topic5 <dbl>,
## #   log_ratio <dbl>
beta_spread %>%
  group_by(direction = log_ratio > 0) %>%
  top_n(10, abs(log_ratio)) %>%
  ungroup() %>%
  mutate(term = reorder(term, log_ratio)) %>%
  ggplot(aes(term, log_ratio)) +
  geom_col() +
  labs(y = "Log2 ratio of beta in topic 2 / topic 1") +
  coord_flip()

text_topics <- tidy(text_lda, matrix = "gamma")
text_topics
## # A tibble: 5 x 3
##   document topic     gamma
##      <chr> <int>     <dbl>
## 1        1     1 0.2081219
## 2        1     2 0.1966877
## 3        1     3 0.1736241
## 4        1     4 0.1981873
## 5        1     5 0.2233790

Task 2: Using Bayesian algorithm to predict if an sms message is spam

Data Exploration

ggplot(text, aes(x = type)) + 
  theme_bw() +
  geom_bar() +
  theme(text = element_text(size=20))+
  labs(y = "Number of sms",
   title = "Classification of sms messages")

Word cloud for spam and ham messages

spam <- subset(text, type == "spam")

wordcloud(spam$Text, max.words = 50, random.order = FALSE,
          colors=brewer.pal(8, "Dark2"), main = "spam")

ham <- subset(text, type != "spam")
wordcloud(ham$Text, max.words = 50, random.order = FALSE,
          colors=brewer.pal(8, "Dark2"), main = "ham")

Data preparation

text_corpus <- VCorpus(VectorSource(text$Text))

text_dtm <- DocumentTermMatrix(text_corpus, control = list(
tolower = TRUE,
removeNumbers = TRUE,
stopwords = TRUE,
removePunctuation = TRUE,
stemming = TRUE
))


text_dtm
## <<DocumentTermMatrix (documents: 5559, terms: 6965)>>
## Non-/sparse entries: 43231/38675204
## Sparsity           : 100%
## Maximal term length: 40
## Weighting          : term frequency (tf)

I will split dtm file into the training and testing datasets

text_dtm_train <- text_dtm[1:4500, ]
text_dtm_test <- text_dtm[4501:5559, ]

text_dtm_train_labels <- text[1:4500, ]$type
text_dtm_test_labels <- text[4501:5559, ]$type
freq_words <- findFreqTerms(text_dtm, 10)
text_dtm_train_freq_words <- text_dtm_train[ , freq_words]
text_dtm_test_freq_words <- text_dtm_test[ , freq_words]

I have trim down the words in my text document to only include words that appear at least 10 times in the entire text dataset.

Converting the numerical counts of the words into yes or no category.

convert_counts <- function(x) {
x <- ifelse(x > 0, "Yes", "No")
}
text_dtm_train_ready <- apply(text_dtm_train_freq_words, MARGIN = 2,
convert_counts) 

text_dtm_test_ready <- apply(text_dtm_test_freq_words, MARGIN = 2,
convert_counts)

Model training using the naiveBayes() function in the e1071 package

library(e1071)
## Warning: package 'e1071' was built under R version 3.4.2
model_text <- naiveBayes(text_dtm_train_ready, text_dtm_train_labels, laplace = 1)

Model evaluation

predict_test <- predict(model_text, text_dtm_test_ready)
library(gmodels)
## Warning: package 'gmodels' was built under R version 3.4.2
CrossTable(predict_test, text_dtm_test_labels,
prop.chisq = FALSE, prop.t = FALSE,
dnn = c('predicted', 'actual'))
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |           N / Col Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  1059 
## 
##  
##              | actual 
##    predicted |       ham |      spam | Row Total | 
## -------------|-----------|-----------|-----------|
##          ham |       911 |        22 |       933 | 
##              |     0.976 |     0.024 |     0.881 | 
##              |     0.997 |     0.152 |           | 
## -------------|-----------|-----------|-----------|
##         spam |         3 |       123 |       126 | 
##              |     0.024 |     0.976 |     0.119 | 
##              |     0.003 |     0.848 |           | 
## -------------|-----------|-----------|-----------|
## Column Total |       914 |       145 |      1059 | 
##              |     0.863 |     0.137 |           | 
## -------------|-----------|-----------|-----------|
## 
## 

References:

  1. Machine Learning with R (2nd edition), by Brett Lantz
  2. ggplot2: Elegant Graphics for Data Analysis (2nd edition), by Hadley Wickham
  3. Text Mining with R, by Julia Silge and David Robinson