Would it be nice to classify text message as spam or ham(legit messages) ? This project aims to do exactly that. We will use SMS Spam Collection dataset to build a document classifier (Naive Bayes algorithm) that classify text message, then predict the class of new text (testing dataset), then using confusion matrix to evaluate the model performance.
library(tm)## Loading required package: NLP
library(caret)## Loading required package: ggplot2
##
## Attaching package: 'ggplot2'
## The following object is masked from 'package:NLP':
##
## annotate
## Loading required package: lattice
library(e1071)
library(pROC)## Type 'citation("pROC")' for a citation.
##
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
##
## cov, smooth, var
library(SnowballC)
library(wordcloud)## Loading required package: RColorBrewer
The dataset can be downloaded at Kaggle.
The dataset contains text of SMS messages and their respective label as Spam or Ham.
sms_raw <- read.csv("https://raw.githubusercontent.com/yinaS1234/data-607/main/project%204/spamham.csv", header = FALSE, stringsAsFactors = FALSE)
sms_raw <- sms_raw[,1:2]
colnames(sms_raw) <- c("Type", "Text")
sms_raw$Type <- factor(sms_raw$Type)
# Check the structure of the dataset
str(sms_raw)## 'data.frame': 5573 obs. of 2 variables:
## $ Type: Factor w/ 3 levels "Category","ham",..: 1 2 2 3 2 2 3 2 2 3 ...
## $ Text: chr "Message" "Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat..." "Ok lar... Joking wif u oni..." "Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question("| __truncated__ ...
# Check the number of spam and ham messages
table(sms_raw$Type)##
## Category ham spam
## 1 4825 747
prop.table(table(sms_raw$Type))##
## Category ham spam
## 0.0001794366 0.8657814463 0.1340391172
About 86% of the messages are ham while the remaining 14% are spam.
Now that we have the tm package installed and loaded we can create a corpus from the text data. A corpus is a collection of text documents.
—–Use the Vcorpus() function to create a volatile
corpus.
—- Transform to lower case and remove punctuation and number
—–Remove stopwords
—–Trim words to their root form through **stemming*
stemDocument()
—–Remove extra white spaces
# Create corpus
sms_corpus <- VCorpus(x = VectorSource(sms_raw$Text))
# Print corpus
sms_corpus## <<VCorpus>>
## Metadata: corpus specific: 0, document level (indexed): 0
## Content: documents: 5573
# Check the text in some messages and their type
lapply(sms_corpus[5:8], as.character)## $`5`
## [1] "U dun say so early hor... U c already then say..."
##
## $`6`
## [1] "Nah I don't think he goes to usf, he lives around here though"
##
## $`7`
## [1] "FreeMsg Hey there darling it's been 3 week's now and no word back! I'd like some fun you up for it still? Tb ok! XxX std chgs to send, £1.50 to rcv"
##
## $`8`
## [1] "Even my brother is not like to speak with me. They treat me like aids patent."
sms_raw$Type[5:8]## [1] ham ham spam ham
## Levels: Category ham spam
corpus_clean <- sms_corpus
# Remove Numbers
corpus_clean <- tm_map(x = corpus_clean, FUN = removeNumbers)
# Transform all letters to lower case
corpus_clean <- tm_map(x = corpus_clean, content_transformer(tolower))
# Remove punctuation
corpus_clean <- tm_map(x = corpus_clean, FUN = removePunctuation)
# Print stopwords()
stopwords()## [1] "i" "me" "my" "myself" "we"
## [6] "our" "ours" "ourselves" "you" "your"
## [11] "yours" "yourself" "yourselves" "he" "him"
## [16] "his" "himself" "she" "her" "hers"
## [21] "herself" "it" "its" "itself" "they"
## [26] "them" "their" "theirs" "themselves" "what"
## [31] "which" "who" "whom" "this" "that"
## [36] "these" "those" "am" "is" "are"
## [41] "was" "were" "be" "been" "being"
## [46] "have" "has" "had" "having" "do"
## [51] "does" "did" "doing" "would" "should"
## [56] "could" "ought" "i'm" "you're" "he's"
## [61] "she's" "it's" "we're" "they're" "i've"
## [66] "you've" "we've" "they've" "i'd" "you'd"
## [71] "he'd" "she'd" "we'd" "they'd" "i'll"
## [76] "you'll" "he'll" "she'll" "we'll" "they'll"
## [81] "isn't" "aren't" "wasn't" "weren't" "hasn't"
## [86] "haven't" "hadn't" "doesn't" "don't" "didn't"
## [91] "won't" "wouldn't" "shan't" "shouldn't" "can't"
## [96] "cannot" "couldn't" "mustn't" "let's" "that's"
## [101] "who's" "what's" "here's" "there's" "when's"
## [106] "where's" "why's" "how's" "a" "an"
## [111] "the" "and" "but" "if" "or"
## [116] "because" "as" "until" "while" "of"
## [121] "at" "by" "for" "with" "about"
## [126] "against" "between" "into" "through" "during"
## [131] "before" "after" "above" "below" "to"
## [136] "from" "up" "down" "in" "out"
## [141] "on" "off" "over" "under" "again"
## [146] "further" "then" "once" "here" "there"
## [151] "when" "where" "why" "how" "all"
## [156] "any" "both" "each" "few" "more"
## [161] "most" "other" "some" "such" "no"
## [166] "nor" "not" "only" "own" "same"
## [171] "so" "than" "too" "very"
# Remove stop words
corpus_clean <- tm_map(x = corpus_clean, FUN = removeWords, stopwords())
# Test the function
wordStem(words = c("cooks", "cooking", "cooked"))## [1] "cook" "cook" "cook"
# Stem words in corpus
corpus_clean <- tm_map(x = corpus_clean, FUN = stemDocument)
# Remove extra white spaces
corpus_clean <- tm_map(x = corpus_clean, FUN = stripWhitespace)A DTM is a data structure where each document is represented in its own row and a matrix that represents the frequency of each term (word)
# Create Document Term Matrix
DTM <- DocumentTermMatrix(x = corpus_clean)
DTM## <<DocumentTermMatrix (documents: 5573, terms: 6956)>>
## Non-/sparse entries: 43715/38722073
## Sparsity : 100%
## Maximal term length: 40
## Weighting : term frequency (tf)
Word clouds provide a very intuitive way to visualize the frequency of words in a corpus.
wordcloud(words = corpus_clean,
min.freq = 100, # minimum number of times a word must be present before it appears
random.order = FALSE, # Arrange most frequent words to be in the center of the word cloud
color = (colors = c("#4575b4","#74add1","#abd9e9","#e0f3f8","#fee090","#fdae61","#f46d43","#d73027")) # Colour gradient for the font
)Now split the dataset into a training and test datasets. We will use 80% of the data for training and the remaining 20% for testing.
# Create Training Set
DTM_train <- DTM[1:round(nrow(DTM)*0.80, 0), ]
# Create Test Set
DTM_test <- DTM[(round(nrow(DTM)*0.80, 0)+1):nrow(DTM), ]
# Create vectors with labels for the training and test set
train_labels <- sms_raw[1:round(nrow(sms_raw)*0.80, 0), ]$Type
test_labels <- sms_raw[(round(nrow(sms_raw)*0.80, 0)+1):nrow(DTM), ]$Type
# Check proportion of ham and spam is similar on the training and test set
prop.table(table(train_labels))## train_labels
## Category ham spam
## 0.0002243158 0.8647375505 0.1350381337
prop.table(table(test_labels))## test_labels
## Category ham spam
## 0.0000000 0.8699552 0.1300448
Our document term matrix has 6956 features, since this is a very high number to use on our Bayesian algorithm we will reduce the number of features by only selecting the most frequent words.
# filter out the words appear less than 0.1% of the time
threshold <- 0.1 # in %
min_freq = round(DTM$nrow*(threshold/100),0) # calculate minimum frequency
min_freq## [1] 6
# Create vector of most frequent words
frequent_words <- findFreqTerms(x = DTM, lowfreq = min_freq)
str(frequent_words)## chr [1:1251] "‘ll" "abiola" "abl" "abt" "accept" "access" "account" ...
# Filter DTM to only have most frequent words
DTM_train_most_frequent <- DTM_train[, frequent_words]
DTM_test_most_frequent <- DTM_test[, frequent_words]
# Check dimension of DTM
dim(DTM_train_most_frequent)## [1] 4458 1251
In order to use the Naive Bayes classifier we need to convert the numerical features in our Document Term Matrix (DTM) to categorical features.
# Create function that converts numeric values to "Yes" or "No" if word is present or absent in document
is_present <- function(x) {
x <- ifelse(test = x > 0, yes = "Yes", no = "No")
}
# Test function
x <- is_present(c(1, 0, 3, 4, 0, 0))
x## [1] "Yes" "No" "Yes" "Yes" "No" "No"
# Apply is_present() function to training and test DTM
DTM_train_most_frequent <- apply(X = DTM_train_most_frequent,
MARGIN = 2, # Apply function to columns
FUN = is_present) # Specify function to be used
DTM_test_most_frequent <- apply(X = DTM_test_most_frequent,
MARGIN = 2, # Apply function to columns
FUN = is_present) # Specify function to be used# Create model from the training dataset
spam_classifier <- naiveBayes(x = DTM_train_most_frequent, y = train_labels)
# Print probability tables for some words
spam_classifier$tables$call## call
## train_labels No Yes
## Category 1.00000000 0.00000000
## ham 0.94422827 0.05577173
## spam 0.57308970 0.42691030
spam_classifier$tables$friend## friend
## train_labels No Yes
## Category 1.00000000 0.00000000
## ham 0.98184176 0.01815824
## spam 0.98172757 0.01827243
spam_classifier$tables$free## free
## train_labels No Yes
## Category 1.00000000 0.00000000
## ham 0.98806744 0.01193256
## spam 0.77574751 0.22425249
We can see from the probability tables above that the probability of finding the words “call” and “free” in spam messages are multiple times higher than in legitimate messages. On the other hand the word “friend” is almost equally likely to be found in both ham and spam.
# Make predictions on test set
test_predictions <- predict(object = spam_classifier, newdata = DTM_test_most_frequent)
# Create confusion matrix
confusionMatrix(data = test_predictions, reference = test_labels, positive = "spam", dnn = c("Prediction", "Actual"))## Confusion Matrix and Statistics
##
## Actual
## Prediction Category ham spam
## Category 0 0 0
## ham 0 965 16
## spam 0 5 129
##
## Overall Statistics
##
## Accuracy : 0.9812
## 95% CI : (0.9714, 0.9883)
## No Information Rate : 0.87
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.914
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: Category Class: ham Class: spam
## Sensitivity NA 0.9948 0.8897
## Specificity 1 0.8897 0.9948
## Pos Pred Value NA 0.9837 0.9627
## Neg Pred Value NA 0.9627 0.9837
## Prevalence 0 0.8700 0.1300
## Detection Rate 0 0.8655 0.1157
## Detection Prevalence 0 0.8798 0.1202
## Balanced Accuracy NA 0.9423 0.9423
We test the performance of our spam classifier using our testing
dataset
As we can see from the output above we obtain an overall accuracy of
98.12%.
The Naive Bayes model we built seems like able to classify spam and ham text effectively.