We will be classifying SMS messages into spam or ham. Junk meesages are labeled spam and the real messages are labeled ham. We will be using the data from the website https://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection#.
This corpus has been collected from free or free for research sources at the Internet:
-> A collection of 425 SMS spam messages was manually extracted from the Grumbletext Web site. This is a UK forum in which cell phone users make public claims about SMS spam messages, most of them without reporting the very spam message received. The identification of the text of spam messages in the claims is a very hard and time-consuming task, and it involved carefully scanning hundreds of web pages. The Grumbletext Web site is: [Web Link].
-> A subset of 3,375 SMS randomly chosen ham messages of the NUS SMS Corpus (NSC), which is a dataset of about 10,000 legitimate messages collected for research at the Department of Computer Science at the National University of Singapore. The messages largely originate from Singaporeans and mostly from students attending the University. These messages were collected from volunteers who were made aware that their contributions were going to be made publicly available. The NUS SMS Corpus is avalaible at: [Web Link].
-> A list of 450 SMS ham messages collected from Caroline Tag’s PhD Thesis available at [Web Link].
-> Finally, we have incorporated the SMS Spam Corpus v.0.1 Big. It has 1,002 SMS ham messages and 322 spam messages and it is public available at: [Web Link]. This corpus has been used in the following academic researches:
[1] Gómez Hidalgo, J.M., Cajigas Bringas, G., Puertas Sanz, E., Carrero GarcÃ�a, F. Content Based SMS Spam Filtering. Proceedings of the 2006 ACM Symposium on Document Engineering (ACM DOCENG’06), Amsterdam, The Netherlands, 10-13, 2006.
[2] Cormack, G. V., Gómez Hidalgo, J. M., and Puertas Sánz, E. Feature engineering for mobile (SMS) spam filtering. Proceedings of the 30th Annual international ACM Conference on Research and Development in information Retrieval (ACM SIGIR’07), New York, NY, 871-872, 2007.
[3] Cormack, G. V., Gómez Hidalgo, J. M., and Puertas Sánz, E. Spam filtering for short messages. Proceedings of the 16th ACM Conference on Information and Knowledge Management (ACM CIKM’07). Lisbon, Portugal, 313-320, 2007.
library(tm)
## Loading required package: NLP
library(SnowballC)
library(wordcloud)
## Loading required package: RColorBrewer
library(RColorBrewer)
library(e1071) #For Naive Bayes
library(caret) #For the confusion matrix
## Loading required package: lattice
## Loading required package: ggplot2
##
## Attaching package: 'ggplot2'
## The following object is masked from 'package:NLP':
##
## annotate
#Load the csv file into R
sms_raw <- read.csv("C:/Users/tonym/OneDrive/Documents/sms.csv")
#View the first lines of the data
head(sms_raw)
## type
## 1 ham
## 2 ham
## 3 spam
## 4 ham
## 5 ham
## 6 spam
## text
## 1 Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...
## 2 Ok lar... Joking wif u oni...
## 3 Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's
## 4 U dun say so early hor... U c already then say...
## 5 Nah I don't think he goes to usf, he lives around here though
## 6 FreeMsg Hey there darling it's been 3 week's now and no word back! I'd like some fun you up for it still? Tb ok! XxX std chgs to send, £1.50 to rcv
#Select & rename appropriate columns of the dataset
sms_raw <- sms_raw[, 1:2]
colnames(sms_raw) <- c("Tag", "Msg")
str(sms_raw)
## 'data.frame': 5572 obs. of 2 variables:
## $ Tag: Factor w/ 2 levels "ham","spam": 1 1 2 1 1 2 1 1 2 2 ...
## $ Msg: Factor w/ 5168 levels "'An Amazing Quote'' - \"Sometimes in life its difficult to decide whats wrong!! a lie that brings a smile or th"| __truncated__,..: 1121 3222 1020 4258 2868 1045 951 383 4763 1256 ...
#Find the proportions of junk vs legitimate sms messages
table(sms_raw$Tag)
##
## ham spam
## 4825 747
prop.table(table(sms_raw$Tag))
##
## ham spam
## 0.8659368 0.1340632
We will use a wordcloud to visualize the spam and ham subsets. Using a word cloud to show the frequency appearing in the text. The larger the size of the word, the greater the frequency of the words.
spam <- subset(sms_raw, Tag == "spam")
suppressWarnings(wordcloud(spam$Msg, max.words = 60, colors = brewer.pal(7, "Paired"), random.order = FALSE))
ham <- subset(sms_raw, Tag == "ham")
suppressWarnings(wordcloud(ham$Msg, max.words = 60, colors = brewer.pal(7, "Paired"), random.order = FALSE))
Ham messages contain: can, come, just, get, will Spam messages contain : call, now, claim, free, prize
I can see that the ham messages is about interactions of real people. For example, a friend can be shown asking for help. With spam messages, it is more about telling the person to call them back and to claim prizes.
# Using VectorSource function to create one document for each sms message
# Using the VCorpus function to create a volatile corpus for each sms message.
sms_corpus <- VCorpus(VectorSource(sms_raw$Msg))
# Create a DocumentTermMatrix to split each sms message(each document) into individual words. We also clean up the data by removing numbers, stopwords, punctuation and applying stemming.
sms_dtm <- DocumentTermMatrix(sms_corpus, control =
list(tolower = TRUE,
removeNumbers = TRUE,
stopwords = TRUE,
removePunctuation = TRUE,
stemming = TRUE))
# Checking the dimension
dim(sms_dtm)
## [1] 5572 7011
We will be taking a ratio of 80:20 for the training to testing data.
#Training and Testing Data
sms_dtm_train <- sms_dtm[1:4457, ]
sms_dtm_test <- sms_dtm[4458:5572, ]
#Training & Test Label
sms_train_labels <- sms_raw[1:4457, ]$Tag
sms_test_labels <- sms_raw[4458:5572, ]$Tag
#Proportion for training & test labels
prop.table(table(sms_train_labels))
## sms_train_labels
## ham spam
## 0.8649316 0.1350684
prop.table(table(sms_test_labels))
## sms_test_labels
## ham spam
## 0.8699552 0.1300448
Using the prop.table function we can confirm the data is split into 87% ham to 13% spam.
Using findFreqTerms to find the most frequent words from the text. We also filter the DTM to only contain the most frequent words.
threshold <- 0.1
min_freq = round(sms_dtm$nrow*(threshold/100),0)
min_freq
## [1] 6
# Create vector of most frequent words
freq_words <- findFreqTerms(x = sms_dtm, lowfreq = min_freq)
str(freq_words)
## chr [1:1266] "abiola" "abl" "about" "abt" "accept" "access" "account" ...
#Filter the DTM
sms_dtm_freq_train <- sms_dtm_train[ , freq_words]
sms_dtm_freq_test <- sms_dtm_test[ , freq_words]
dim(sms_dtm_freq_test)
## [1] 1115 1266
sms_dtm_freq_train <- sms_dtm_train[ , freq_words]
sms_dtm_freq_test <- sms_dtm_test[ , freq_words]
dim(sms_dtm_freq_train)
## [1] 4457 1266
# Converts any non-zero positive value to “Yes”
# Convert all zero values to “No”
convert_values <- function(x) {
x <- ifelse(x > 0, "Yes", "No")
}
sms_train <- apply(sms_dtm_freq_train, MARGIN = 2,
convert_values)
sms_test <- apply(sms_dtm_freq_test, MARGIN = 2,
convert_values)
Using the naiveBayes function to train the classifer, it uses the presence and absence of words to predict the probability if a SMS message is spam.
sms_classifier <- naiveBayes(sms_train, sms_train_labels)
#Make predictions on test set
sms_test_pred <- predict(sms_classifier, sms_test)
#Create confusion matrix
confusionMatrix(data = sms_test_pred, reference = sms_test_labels,
positive = "spam", dnn = c("Prediction", "Actual"))
## Confusion Matrix and Statistics
##
## Actual
## Prediction ham spam
## ham 965 16
## spam 5 129
##
## Accuracy : 0.9812
## 95% CI : (0.9714, 0.9883)
## No Information Rate : 0.87
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.914
##
## Mcnemar's Test P-Value : 0.0291
##
## Sensitivity : 0.8897
## Specificity : 0.9948
## Pos Pred Value : 0.9627
## Neg Pred Value : 0.9837
## Prevalence : 0.1300
## Detection Rate : 0.1157
## Detection Prevalence : 0.1202
## Balanced Accuracy : 0.9423
##
## 'Positive' Class : spam
##
We can see that we have an accuracy of 98.12%. The model rightfully predicts that 965 is ham and wrongfully predicts 5 is spam.