Introduction

Document classification or Document categorization is to classify documents into one or more classes/categories manually or algorithmically. Today we try to classify classify new “test” documents using already classified “training” documents, using spam/ham dataset, then predict the class of new documents.

classDiagram

classDiagram

The Classification Model:

We can divide the steps into:

We implement the document classification using tm/plyr packages, as preliminary steps, we need to load the required libraries into R environment:

Corpus creation and The Data for our SPAM model

Corpus is a large and structured set of texts used for analysis. We will use the dataset from the SMS Spam Collection to create a Spam Classifier. This can be downloaded from the UCI Machine Learning Repository.

This dataset includes the text of SMS messages along with a label indicating whether the message is unwanted. Junk messages are labeled spam, while legitimate messages are labeled ham.

The collection is composed by just one text file, where each line has the correct class followed by the raw message. We offer some examples bellow:

classDiagram

classDiagram

#Import libraries
library(tm)
## Loading required package: NLP
library(SnowballC)
library(wordcloud)
## Loading required package: RColorBrewer
library(RColorBrewer) 
library(e1071)         #For Naive Bayes
library(caret)         #For the Confusion Matrix
## Loading required package: lattice
## Loading required package: ggplot2
## 
## Attaching package: 'ggplot2'
## The following object is masked from 'package:NLP':
## 
##     annotate
library(gmodels) #provides CrossTable() function for comparison

#Import data
RawSMS <- read.csv("data/SMSSpamCollection.csv", sep="\t", header=TRUE)
## Warning in scan(file = file, what = what, sep = sep, quote = quote, dec = dec, :
## EOF within quoted string
head(RawSMS)
##    ham
## 1  ham
## 2 spam
## 3  ham
## 4  ham
## 5 spam
## 6  ham
##                                               Go.until.jurong.point..crazy...Available.only.in.bugis.n.great.world.la.e.buffet....Cine.there.got.amore.wat...
## 1                                                                                                                               Ok lar... Joking wif u oni...
## 2 Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's
## 3                                                                                                           U dun say so early hor... U c already then say...
## 4                                                                                               Nah I don't think he goes to usf, he lives around here though
## 5         FreeMsg Hey there darling it's been 3 week's now and no word back! I'd like some fun you up for it still? Tb ok! XxX std chgs to send, £1.50 to rcv
## 6                                                                               Even my brother is not like to speak with me. They treat me like aids patent.
#Select & rename appropriate columns of the dataset
RawSMS <- RawSMS[, 1:2]
colnames(RawSMS) <- c("Tag", "Msg")
str(RawSMS)
## 'data.frame':    3183 obs. of  2 variables:
##  $ Tag: Factor w/ 2 levels "ham","spam": 1 2 1 1 2 1 1 2 2 1 ...
##  $ Msg: Factor w/ 3026 levels " &lt;#&gt;  mins but i had to stop somewhere first.",..: 1883 624 2506 1677 642 577 234 2791 755 1292 ...
RawSMS$Tag <- factor(RawSMS$Tag)

# creating our corpus
text_corpus <- VCorpus(VectorSource(RawSMS$Msg))

# Viewing the content of more than one texts using lapply() function
lapply(text_corpus[1:5], as.character) 
## $`1`
## [1] "Ok lar... Joking wif u oni..."
## 
## $`2`
## [1] "Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's"
## 
## $`3`
## [1] "U dun say so early hor... U c already then say..."
## 
## $`4`
## [1] "Nah I don't think he goes to usf, he lives around here though"
## 
## $`5`
## [1] "FreeMsg Hey there darling it's been 3 week's now and no word back! I'd like some fun you up for it still? Tb ok! XxX std chgs to send, £1.50 to rcv"

Cleaning our Corpus

cleanCorpus <- tm_map(text_corpus, content_transformer(tolower)) # lowercase all texts
cleanCorpus <- tm_map(cleanCorpus, removeNumbers) # remove all numbers
cleanCorpus <- tm_map(cleanCorpus, removeWords, stopwords('english')) # remove all common words such as to, but and etc.
cleanCorpus <- tm_map(cleanCorpus, removePunctuation) # remove all punctuation
cleanCorpus <- tm_map(cleanCorpus, stripWhitespace) # remove all whitespace


text_dtm <- DocumentTermMatrix(cleanCorpus)
inspect(text_dtm)
## <<DocumentTermMatrix (documents: 3183, terms: 7956)>>
## Non-/sparse entries: 34573/25289375
## Sparsity           : 100%
## Maximal term length: 40
## Weighting          : term frequency (tf)
## Sample             :
##       Terms
## Docs   call can free get ham just ltgt now spam will
##   1517    0   5    1   1  23    0    8   5    4    1
##   1875   18  17    6   9 157   15   14  17   23   10
##   2452   15   8    9  23 185    6   15  20   26   11
##   2471    8  15    4  10  93    5   16   8   12    5
##   2550   30   9    9  12 145   10    9  21   33   12
##   2963   35  19    9  24 269   16    9  24   39   16
##   3183   53  38   27  36 434   30   25  36   58   45
##   472     0   0    0   1   0    0    0   0    0   11
##   670    15  13   12  21 247   27    9  29   30   17
##   98     70  45   33  37 528   37   25  50   85   40

Creating Train & test data

As the dataset is a randomnly sorted, we can directly divide the data into the training and test dataset. Let’s take a proportion of 75:25 for the training:test data.

# Creating train and test portions 
train <- text_dtm[1:2387, ] # 75% for training
test <- text_dtm[2387:3183, ] # 25% for testing
train_type <- RawSMS[1:2387, ]$Tag
test_type <- RawSMS[2387:3183, ]$Tag
#training portion
tbl_train <- prop.table(table(train_type))
tbl_train
## train_type
##       ham      spam 
## 0.8638458 0.1361542

Using the prop.table()function we can validate that the data in the training & test sets are both split into 85% ham & 15% spam messages.

#testing portion
tbl_test <- prop.table(table(test_type))
tbl_test
## test_type
##       ham      spam 
## 0.8582183 0.1417817

Build the spam cloud

Wordclouds are simple yet effective for text visualization, so let’s visualize the spam/ham messages using some wordclouds.

A word cloud depicts the frequency of words appearing in the text. The larger the size of the word, the greater the frequency.

spamText <- subset(RawSMS, Tag == "spam") 
wordcloud(spamText$Msg, max.words = 50, scale = c(5, 0.3),random.order = FALSE, rot.per = 0.15, colors = brewer.pal(8, "Dark2") )
## Warning in tm_map.SimpleCorpus(corpus, tm::removePunctuation): transformation
## drops documents
## Warning in tm_map.SimpleCorpus(corpus, function(x) tm::removeWords(x,
## tm::stopwords())): transformation drops documents

hamText <- subset(RawSMS, Tag =="ham") # selecting ham texts
wordcloud(hamText$Msg, max.words = 50, scale = c(5, 0.3),random.order = FALSE, rot.per = 0.15, colors = brewer.pal(8, "Dark2"))
## Warning in tm_map.SimpleCorpus(corpus, tm::removePunctuation): transformation
## drops documents
## Warning in tm_map.SimpleCorpus(corpus, function(x) tm::removeWords(x,
## tm::stopwords())): transformation drops documents

### Creating Indicator Features

We are going to transform the sparse matrix into something the Naive Bayes model can train. We will extract the most frequent words in the texts using using findFreqTerms()

freq_words <- findFreqTerms(train, 5) 
str(freq_words)
##  chr [1:1084] "abiola" "able" "abt" "accept" "access" "account" "across" ...

Converting numerical vectors of the DTM to categorical vector for the model

# Selecting only the frequent words from the train and test datasets
freq_words_train <- train[ , freq_words]
freq_words_test <- test[ , freq_words]


# creating a function for conversion
convert <- function(x) {x <- ifelse(x > 0, "y", "n")} 
train <- apply(freq_words_train, MARGIN = 2, convert)
test <- apply(freq_words_test, MARGIN = 2, convert)
str(train) # verifying the conversion
##  chr [1:2387, 1:1084] "n" "n" "n" "n" "n" "n" "n" "n" "n" "n" "n" "n" "n" ...
##  - attr(*, "dimnames")=List of 2
##   ..$ Docs : chr [1:2387] "1" "2" "3" "4" ...
##   ..$ Terms: chr [1:1084] "abiola" "able" "abt" "accept" ...

Training our Model

We will use naiveBayes() function from the e1071 package to train our classifier. The algorithm uses the presence or absence of words to assess the probability that a given SMS message is spam.

# Creating a Naive Bayes classifier
sms_classifier <- naiveBayes(train, train_type)

# Making prediction & evaluation with the classifier
test_prediction <- predict(sms_classifier, test)

CrossTable(test_prediction, test_type, 
           prop.chisq = FALSE, prop.t = FALSE,
           dnn = c('predicted', 'actual'))
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |           N / Col Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  797 
## 
##  
##              | actual 
##    predicted |       ham |      spam | Row Total | 
## -------------|-----------|-----------|-----------|
##          ham |       681 |        26 |       707 | 
##              |     0.963 |     0.037 |     0.887 | 
##              |     0.996 |     0.230 |           | 
## -------------|-----------|-----------|-----------|
##         spam |         3 |        87 |        90 | 
##              |     0.033 |     0.967 |     0.113 | 
##              |     0.004 |     0.770 |           | 
## -------------|-----------|-----------|-----------|
## Column Total |       684 |       113 |       797 | 
##              |     0.858 |     0.142 |           | 
## -------------|-----------|-----------|-----------|
## 
## 

This Classifier gives us a higher accuracy.

let’s improve the model by tweaking one of the parameters of the NaiveBayes classifier, that is laplace parameter, and see what’s going to happen.

#laplace assures that one word is not mislabled just because it appeared once on ham/spam texts
sms_classifier_improved <- naiveBayes(train, train_type, laplace = 1)
test_prediction_improved <- predict(sms_classifier_improved, test)

CrossTable(test_prediction_improved, test_type, 
           prop.chisq = FALSE, prop.t = FALSE,
           dnn = c('predicted', 'actual'))
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |           N / Col Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  797 
## 
##  
##              | actual 
##    predicted |       ham |      spam | Row Total | 
## -------------|-----------|-----------|-----------|
##          ham |       678 |        28 |       706 | 
##              |     0.960 |     0.040 |     0.886 | 
##              |     0.991 |     0.248 |           | 
## -------------|-----------|-----------|-----------|
##         spam |         6 |        85 |        91 | 
##              |     0.066 |     0.934 |     0.114 | 
##              |     0.009 |     0.752 |           | 
## -------------|-----------|-----------|-----------|
## Column Total |       684 |       113 |       797 | 
##              |     0.858 |     0.142 |           | 
## -------------|-----------|-----------|-----------|
## 
## 

As we can see, only 24 messages are now mislabeled.