Natural Language Processing based on SMS Spam filter

Wu Mingzhen(S2165138), YUYANG SU(S2165168), KE YANG (S2139578), XIN JI(S2116049)

Content

1. Introduction

a. Background; b. About Dataset;c. Project Objectives

2. Data Preparation

a. Loading libraries; b. Loading dataset; c.Cleaning

3. Exploratory Data Analysis

a. Feature Engineering; b. Descriptive Statistics

4. Modeling

a. Naive Bayes ; b. Random Forest

5. Conclusion


1. Introduction

a.Background

Among the essential platforms for communication, text messaging, more commonly known as short message services (SMS) has been selected as an effective service tool used by businesses for marketing. However, many harmful, spam and artificial messages have been designed by criminals to deliver to millions of people, due to the worldwide dominance and convenience of SMS in communication platforms, They take advantage of the technology advances to perform smishing activity, in which propagating the scams over the mobile networks would be done in a swift manner. The text messages often have links in them, which induce unsuspecting victims to a phishing site. Consequently, the victims expose themselves to the risk of monetary loss by divulging personal information, downloading malware onto their mobile device, or providing a one-time passcode that will allow a criminal to bypass multi-factor authentication (MFA). Drager (2022) claims that SMS attacks are skyrocketing over the years, in which there was around 328% rate increment in 2020, and it grew further for about 700% during the first half of 2021. As a measure to prevent SMS attacks, organizations and researchers develop robust and effective spam filters before the text messages are being delivered to the end recipients. As such, machine learning models have been utilized to filter, detect and classify the message inputs. For instance, in the research paper by Mishra & Soni (2022), Neural Network, Naïve Bayes, and Decision Tree models have been implemented to detect smishing with model accuracies of above 93%. The researchers have proven the efficacy of machine learning models in text message classification into legit (ham) or illegal (smishing) traffic type.

b. About Dataset

The SMS Spam Collection is a compilation of labeled SMS messages gathered for the purpose of studying SMS spam. It features a total of 5,574 English SMS messages, classified as either legitimate (ham) or spam.The SMS Spam Collection dataset consists of individual messages, each one occupying a separate line. Each line is divided into two columns: v1 displays the label (either ham or spam) and v2 displays the unedited text. https://www.kaggle.com/datasets/uciml/sms-spam-collection-dataset

c. Project Objectives

1.Use Bayesian algorithm and Random Forest do spam detection.

2.Which model with more detection accuracy.


2. Data Preparation

a.loading libraries
library(ggplot2)
## Warning: 程辑包'ggplot2'是用R版本4.2.2 来建造的
library(wordcloud2)
## Warning: 程辑包'wordcloud2'是用R版本4.2.2 来建造的
library(dplyr)
## Warning: 程辑包'dplyr'是用R版本4.2.2 来建造的
## 
## 载入程辑包:'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(psych)
## Warning: 程辑包'psych'是用R版本4.2.2 来建造的
## 
## 载入程辑包:'psych'
## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha
library(jiebaR)
## Warning: 程辑包'jiebaR'是用R版本4.2.2 来建造的
## 载入需要的程辑包:jiebaRD
## Warning: 程辑包'jiebaRD'是用R版本4.2.2 来建造的
## 
## 载入程辑包:'jiebaR'
## The following object is masked from 'package:psych':
## 
##     distance
library(corrplot)
## Warning: 程辑包'corrplot'是用R版本4.2.2 来建造的
## corrplot 0.92 loaded
library(tm)
## Warning: 程辑包'tm'是用R版本4.2.2 来建造的
## 载入需要的程辑包:NLP
## 
## 载入程辑包:'NLP'
## The following object is masked from 'package:ggplot2':
## 
##     annotate
library(quanteda)
## Warning: 程辑包'quanteda'是用R版本4.2.2 来建造的
## Warning in stringi::stri_info(): Your current locale is not in the list
## of available locales. Some functions may not work properly. Refer to
## stri_locale_list() for more details on known locale specifiers.

## Warning in stringi::stri_info(): Your current locale is not in the list
## of available locales. Some functions may not work properly. Refer to
## stri_locale_list() for more details on known locale specifiers.
## Package version: 3.2.4
## Unicode version: 13.0
## ICU version: 69.1
## Parallel computing: 12 of 12 threads used.
## See https://quanteda.io for tutorials and examples.
## 
## 载入程辑包:'quanteda'
## The following object is masked from 'package:tm':
## 
##     stopwords
## The following objects are masked from 'package:NLP':
## 
##     meta, meta<-
b.Loading dataset
data <- read.csv("spam.csv",encoding = "Lation1")
head(data,5)
##     v1
## 1  ham
## 2  ham
## 3 spam
## 4  ham
## 5  ham
##                                                                                                                                                            v2
## 1                                             Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...
## 2                                                                                                                               Ok lar... Joking wif u oni...
## 3 Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's
## 4                                                                                                           U dun say so early hor... U c already then say...
## 5                                                                                               Nah I don't think he goes to usf, he lives around here though
##   X X.1 X.2
## 1          
## 2          
## 3          
## 4          
## 5

The dataset we obtain is a set of labeled data that have been collected for SMS Spam research, available publicly in Kaggle, an online open-source platform for data science collaborations. This set of SMS messages contains 5,574 English messages, tagged according to its type, ham (legitimate) or spam (illegitimate). The file contains one message per line, which is composed of two columns: (1) contains the label (ham or spam) and (2) contains the raw text. The dataset we obtain is a set of labeled data that have been collected for SMS Spam research, available publicly in Kaggle, an online open-source platform for data science collaborations. This set of SMS messages contains 5,574 English messages, tagged according to its type, ham (legitimate) or spam (illegitimate). The file contains one message per line, which is composed of two columns: (1) contains the label (ham or spam) and (2) contains the raw text.

This corpus has been collected & incorporated from the following sources: Grumbletext Website (425 SMS spam messages) - a UK forum in which cell phone users make public claims about SMS spam messages; NUS SMS Corpus (NSC) (A subset of 3,375 SMS randomly chosen ham messages of the, which is a dataset of about 10,000 legitimate messages collected for research at the Department of Computer Science at the National University of Singapore) - the messages largely originate from volunteers consisted of Singaporean and mostly from students attending the University who were made aware that their contributions were going to be made publicly available; and Caroline Tag’s PhD Thesis (a list of 450 SMS ham messages collected).

c.Cleaning

You can also embed plots, for example:

data <- data[,-(3:5)]                     #Delete unneeded columns
colnames(data) <- c("label","message")                  
head(data,5)
##   label
## 1   ham
## 2   ham
## 3  spam
## 4   ham
## 5   ham
##                                                                                                                                                       message
## 1                                             Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...
## 2                                                                                                                               Ok lar... Joking wif u oni...
## 3 Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's
## 4                                                                                                           U dun say so early hor... U c already then say...
## 5                                                                                               Nah I don't think he goes to usf, he lives around here though
#对数据进行编码,ham=0,spam=1
data[which(data$label=="ham"),1] <- "0"
data[which(data$label=="spam"),1] <- "1"
data$label <- as.integer(data$label)               
table(data$label)                                 
## 
##    0    1 
## 4825  747
sum(is.na(data))                            #Check for missing values
## [1] 0

3. Exploratory Data Analysis

a.Feature Engineering
dt = data.frame(A = c(4825,747), B = c("ham","spam"))
dt = dt[order(dt$A, decreasing = TRUE),]
myLabel = as.vector(dt$B)   
myLabel = paste(myLabel, "(", round(dt$A / sum(dt$A) * 100, 2), "%)", sep = "")   

p = ggplot(dt, aes(x = "", y = A, fill = B)) +
  geom_bar(stat = "identity", width = 1) +    
  coord_polar(theta = "y") + 
  labs(x = "", y = "", title = "") + 
  theme(axis.ticks = element_blank()) + 
  theme(legend.title = element_blank(), legend.position = "top") + 
  scale_fill_discrete(breaks = dt$B, labels = myLabel) + 
  theme(axis.text.x = element_blank())  +
  geom_text(aes(y = A/2 + c(0, cumsum(A)[-length(A)]), x = sum(A)/5572, label = myLabel), size = 5)

p

btx <- data.frame(count = c(4825,747), label = as.character(c("0","1")))
ggplot(data=btx,mapping=aes(x=label,y=count,fill=label,group=factor(1)))+
  geom_bar(stat="identity",width=0.5)

In this section we would like to describe the exploratory data analysis (EDA) of the dataset. First, the dataset consists of two columns, v1 and v2. v1 consists of spam or ham tags and v2 contains raw text. Spam and ham texts have 5,572 raw data before data cleaning, with spam texts amounting to 13.41% of the total and ham texts to 86.59%. From the Figures above, we can see that there is an imbalance in the amount of data between spam and ham texts.

Sys.setlocale(category = "LC_ALL",locale = "English_United States.1252")
## Warning in Sys.setlocale(category = "LC_ALL", locale = "English_United
## States.1252"): using locale code page other than 65001 ("UTF-8") may cause
## problems
## [1] "LC_COLLATE=English_United States.1252;LC_CTYPE=English_United States.1252;LC_MONETARY=English_United States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252"
Sys.setlocale(category = "LC_ALL",locale = "English_United States.1252")
## [1] "LC_COLLATE=English_United States.1252;LC_CTYPE=English_United States.1252;LC_MONETARY=English_United States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252"
#Number of Character 
for (i in 1:nrow(data)){data$char[i] <- nchar(data$message[i])}
data$char[1] <- nchar(data$message[1])

#Number of words
wk <- worker()
for(i in 1:nrow(data)){data$words[i] <- length(segment(data$message[i],wk))}
data$words <- as.numeric(data$words)


for(i in 1:nrow(data)){data$sen[i] <- lengths(strsplit(data$message[i],","))}
head(data)
##   label
## 1     0
## 2     0
## 3     1
## 4     0
## 5     0
## 6     1
##                                                                                                                                                       message
## 1                                             Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...
## 2                                                                                                                               Ok lar... Joking wif u oni...
## 3 Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's
## 4                                                                                                           U dun say so early hor... U c already then say...
## 5                                                                                               Nah I don't think he goes to usf, he lives around here though
## 6        FreeMsg Hey there darling it's been 3 week's now and no word back! I'd like some fun you up for it still? Tb ok! XxX std chgs to send, å£1.50 to rcv
##   char words sen
## 1  111    20   2
## 2   29     6   1
## 3  155    35   1
## 4   49    11   1
## 5   61    14   2
## 6  148    35   2
b.Descriptive Statistics
d1 <- describe(data[,-2]);d1                           
##       vars    n  mean    sd median trimmed   mad min max range skew kurtosis
## label    1 5572  0.13  0.34      0    0.04  0.00   0   1     1 2.15     2.61
## char     2 5572 80.12 59.69     61   73.67 46.70   2 910   908 2.51    17.46
## words    3 5572 16.20 11.85     13   14.83 10.38   0 190   190 2.70    20.53
## sen      4 5572  1.34  0.81      1    1.16  0.00   1  14    13 4.33    31.06
##         se
## label 0.00
## char  0.80
## words 0.16
## sen   0.01
d2 <- describe(data[which(data$label==0),-2]);d2       
##       vars    n  mean    sd median trimmed   mad min max range skew kurtosis
## label    1 4825  0.00  0.00      0    0.00  0.00   0   0     0  NaN      NaN
## char     2 4825 71.02 58.02     52   62.20 35.58   2 910   908 3.39    25.14
## words    3 4825 14.67 11.78     11   12.88  7.41   0 190   190 3.39    26.27
## sen      4 4825  1.31  0.77      1    1.14  0.00   1  14    13 4.80    39.69
##         se
## label 0.00
## char  0.84
## words 0.17
## sen   0.01
d3 <- describe(data[which(data$label==1),-2]);d3       
##       vars   n   mean    sd median trimmed   mad min max range  skew kurtosis
## label    1 747   1.00  0.00      1    1.00  0.00   1   1     0   NaN      NaN
## char     2 747 138.87 29.18    149  144.39 14.83  13 224   211 -1.78     3.19
## words    3 747  26.07  6.22     27   26.81  4.45   2  39    37 -1.19     1.56
## sen      4 747   1.54  1.03      1    1.30  0.00   1   7     6  2.71     8.94
##         se
## label 0.00
## char  1.07
## words 0.23
## sen   0.04

##Visualization Comparison

data$label <- as.factor(data$label)


ggplot(data, aes(x = char, fill = label)) +
  
  geom_histogram(position = "identity", alpha = 0.4, bins = 30) + scale_fill_brewer(palette = "Set1")

ggplot(data, aes(x = words, fill = label)) +
  geom_histogram(position = "identity", bins = 30, alpha = 0.4) + scale_fill_brewer(palette = "Set1")

As a basic overview of the dataset content,We counted the number of characters and words in the text in v2 and then visualized them.. According to Figure 2 below (ham in red, spam in blue), the amount of characters in ham text far exceeds the amount of characters in spam text. However, this is due to the fact that the volume of ham is much larger than that of spam, but another piece of insight is shown in the histogram: the largest number of ham SMS is in the range >0-100, while spam SMS is in the range of about 100-150. The vocabulary of ham sms is most often composed of >0 - 25 words, while the vocabulary of spam is composed of 25 - 40 words.

#Remove outliers
i <- which(data$char>500)
data <- data[-i,]

#Correlation coefficient matrix
data$label <- as.integer(data$label)
data$label[which(data$label==1)] <- 0
data$label[which(data$label==2)] <- 1
data$label <- as.integer(data$label)
corx <- cor(data[,-2])
corrplot(corx)

corrplot(corx,method="shade",
         shade.col=NA,
         tl.col = "black",
         tl.srt = 45,
         addCoef.col = "white",
         cl.pos = "n", 
         order="AOE")

This is a correlation matrix. Matrix correlation analysis is an analysis of more than two elements of a variable and measures how well the variables are correlated. According to Figure 5, the correlation goes from dark to light, which also means from low to high. A solid light-coloured 1 means that the two variables are the same, so the correlation between the two variables is 1. In addition, we can see that char and words have a correlation of 0.98, because essentially the two variables have the same content, and that words are made up of characters. ##Data pre-processing

 #Convert all to lowercase
data$message = tolower(data$message)

#Delete deactivated words
sw <- stopwords("english")
head(sw,9)
## [1] "i"         "me"        "my"        "myself"    "we"        "our"      
## [7] "ours"      "ourselves" "you"
for (i in 1:nrow(data)){data$message[i] <- removeWords(data$message[i],sw)}

#Removing web connections
for (i in 1:nrow(data)){data$message[i] <- gsub("http\\S+","",data$message[i])}

#Removing numbers
for (i in 1:nrow(data)){data$message[i] <- gsub("\\d+","",data$message[i])}

#Removing emails
for (i in 1:nrow(data)){data$message[i] <- gsub("\\S*@\\S*\\S?","",data$message[i])}

head(data,5)
##   label
## 1     0
## 2     0
## 3     1
## 4     0
## 5     0
##                                                                                                                     message
## 1                           go  jurong point, crazy.. available   bugis n great world la e buffet... cine  got amore wat...
## 2                                                                                             ok lar... joking wif u oni...
## 3 free entry    wkly comp  win fa cup final tkts st may . text fa    receive entry question(std txt rate)t&c's apply over's
## 4                                                                               u dun say  early hor... u c already  say...
## 5                                                                             nah   think  goes  usf,  lives around  though
##   char words sen
## 1  111    20   2
## 2   29     6   1
## 3  155    35   1
## 4   49    11   1
## 5   61    14   2
#Statistical word frequency
spam_wc <- data[which(data$label==0),]  
ham_wc <- data[which(data$label==1),]

a <- c()
for (i in 1:nrow(spam_wc)){
  b <- segment(spam_wc$message[i],wk)
  a <- c(a,b)
}
c <- c()
for (i in 1:nrow(ham_wc)){
  d <- segment(ham_wc$message[i],wk)
  c <- c(c,d)
}
word_spam <- freq(a)
word_ham <- freq(c)

head(word_ham,10);head(word_spam,10)
##         char freq
## 1      house    1
## 2       shit    1
## 3        sed    1
## 4      servs    1
## 5      inclu    1
## 6  chatlines    1
## 7       gsex    1
## 8       ball    1
## 9     spider    2
## 10    marvel    1
##        char freq
## 1      pity    1
## 2  salesman    1
## 3      dump    1
## 4    dental    1
## 5     units    1
## 6      shud    1
## 7      kane    1
## 8   indians    1
## 9    influx    1
## 10   sudden    1

##Mapping word clouds

wordcloud2(word_ham, 
           size = 1,                  
           fontFamily = 'Segoe UI',   
           fontWeight = 'bold',       
           color = 'random-dark',     
           backgroundColor = "white", 
           minRotation = -pi/4,       
           maxRotation = pi/4,
           rotateRatio = 0.4,         
           shape = "circle"           
)
wordcloud2(word_spam, 
           size = 1,                  
           fontFamily = 'Segoe UI',   
           fontWeight = 'bold',       
           color = 'random-dark',     
           backgroundColor = "white", 
           minRotation = -pi/4,       
           maxRotation = pi/4,
           rotateRatio = 0.4,        
           shape = "circle"           
)

In this section, we tried to identify some commonalities between the fraudulent text messages. The dataset we use has a large amount of text information, and using word clouds makes it easy to quickly identify the most commonly used words in the text, which helps highlight key themes and topics. It is a striking visualization method for highlighting essential textual data points. It can make dull data shine and deliver crucial information quickly.

Word clouds is a grouping of words that are displayed in various sizes: the larger and bolder the term, the more frequently it appears in a document and the more important it is. Text clouds include data visualization, text data, font colors, word frequency analysis, and specific word graphics. These are ideal techniques to extract the most relevant sections of textual material, from blog posts to databases, and are also known as tag clouds or text clouds. They can also assist business users in comparing and contrasting two separate pieces of text in order to identify phrasing similarities.

The word cloud above shows some of the most frequent words in the dataset for all spam SMS. From the figure below, we can find some of the most frequent words, and we will analyze some of the frequent words to find out why they occur frequently.

“Free”: This word is often used to entice people to open the message, as they suggest that the recipient has won something or that there is a special offer available.

“Mobile” or “Call” : These words are often used to encourage recipients to contact the sender by phone, as it allows the spammer to reach the recipient more directly.

“Collect” or “Send” : These words are often used to direct the recipient to take a specific action, such as collecting a prize or sending personal information.

“SMS” or “txt” : These words are used to suggest that the message is being sent via text message, which is a common method of communication.

“Guaranteed” : This word is used to alleviate any concerns the recipient may have about the offer, and to make it seem more legitimate.

“Tone” or “Urgent” : These words are used to create a sense of urgency and to make the message seem more important.

“Chance to win” or “New” : These words are used to create excitement and to make the offer seem more attractive.

“Please call” or “Stop” : These words are often used to create a sense of urgency and to encourage the recipient to take immediate action.


4. Modeling

a. Naive Bayes

Spam is classified using plain Bayesian classification with ham representing spam and spam representing non-spam, and eventually the model correctly classifies over 97% of text messages into spam and non-spam.

# ham stands for spam
# spam stands for non-spam
DATA1 <- read.csv("spam.csv")#read sms data
head(DATA1)
##     v1
## 1  ham
## 2  ham
## 3 spam
## 4  ham
## 5  ham
## 6 spam
##                                                                                                                                                            v2
## 1                                             Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...
## 2                                                                                                                               Ok lar... Joking wif u oni...
## 3 Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's
## 4                                                                                                           U dun say so early hor... U c already then say...
## 5                                                                                               Nah I don't think he goes to usf, he lives around here though
## 6        FreeMsg Hey there darling it's been 3 week's now and no word back! I'd like some fun you up for it still? Tb ok! XxX std chgs to send, å£1.50 to rcv
##   X X.1 X.2
## 1          
## 2          
## 3          
## 4          
## 5          
## 6
data=DATA1[0:2]
data <- sapply(data,function(row) iconv(row, "latin1", "ASCII", sub=""))
data <- data[,1:2]
data <- data.frame(data)
names(data) <- c('type','text')
data$type <- factor(data$type)
table(data$type)# Check Variables
## 
##  ham spam 
## 4825  747

The first line imports the ‘tm’ library, which contains functions for text mining and natural language processing.

library('tm')#Building a corpus (SMS content)
sms_corpus <- VCorpus(VectorSource(data$text))#Observe the summary of the first and second text messages
head(sms_corpus)
## <<VCorpus>>
## Metadata:  corpus specific: 0, document level (indexed): 0
## Content:  documents: 6
print(sms_corpus)
## <<VCorpus>>
## Metadata:  corpus specific: 0, document level (indexed): 0
## Content:  documents: 5572

The next line creates a corpus of SMS text messages using the “Corpus” function from the ‘tm’ library and the text data stored in the “data” variable.

inspect(sms_corpus[1:2])
## <<VCorpus>>
## Metadata:  corpus specific: 0, document level (indexed): 0
## Content:  documents: 2
## 
## [[1]]
## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 111
## 
## [[2]]
## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 29
as.character(sms_corpus[[1]])#Watch the first text message content
## [1] "Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat..."
lapply(sms_corpus[1:2], as.character)
## $`1`
## [1] "Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat..."
## 
## $`2`
## [1] "Ok lar... Joking wif u oni..."

The “tolower” function is used to convert all text in the corpus to lowercase, the “removeNumbers” function is used to remove numbers from the text, “removeWords” function is used to delete filler words such as “to”, “and”, “but” which are defined in stopwords().

sms_corpus_clean <- tm_map(sms_corpus, content_transformer(tolower))#All information in lowercase
as.character(sms_corpus[[1]])
## [1] "Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat..."
as.character(sms_corpus_clean[[1]])
## [1] "go until jurong point, crazy.. available only in bugis n great world la e buffet... cine there got amore wat..."
sms_corpus_clean <- tm_map(sms_corpus_clean, removeNumbers)# remove numbers
sms_corpus_clean <- tm_map(sms_corpus_clean, removeWords, stopwords())#Delete filler words, such as to,and,but

The “removePunctuation” function is used to remove punctuation from the text. The “replacePunctuation” function is used to replace any punctuation marks with spaces.

sms_corpus_clean <- tm_map(sms_corpus_clean, removePunctuation)#Remove punctuation
removePunctuation("hello...world")#Create gsub functions # Replace any punctuation marks with spaces
## [1] "helloworld"
replacePunctuation <- function(x) { gsub("[[:punct:]]+", " ", x) }
replacePunctuation("hello...world")
## [1] "hello world"

The “SnowballC” library is imported, which contains the “wordStem” function which changes all words to their original form

library(SnowballC)#This function changes all words to their original form
wordStem(c("learn", "learned", "learning", "learns"))
## [1] "learn" "learn" "learn" "learn"
sms_corpus_clean <- tm_map(sms_corpus_clean, stemDocument)#In order for wordStem to be applied to the entire text corpus
sms_corpus_clean <- tm_map(sms_corpus_clean, stripWhitespace)# Delete extra spaces
lapply(sms_corpus[1:3], as.character)#Observe the results before and after some transformations
## $`1`
## [1] "Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat..."
## 
## $`2`
## [1] "Ok lar... Joking wif u oni..."
## 
## $`3`
## [1] "Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's"
lapply(sms_corpus_clean[1:3], as.character)
## $`1`
## [1] "go jurong point crazi avail bugi n great world la e buffet cine got amor wat"
## 
## $`2`
## [1] "ok lar joke wif u oni"
## 
## $`3`
## [1] "free entri wkli comp win fa cup final tkts st may text fa receiv entri questionstd txt ratetc appli s"
sms_dtm <- DocumentTermMatrix(sms_corpus_clean)#Creating sparse matrices
sms_dtm2 <- DocumentTermMatrix(sms_corpus, control = list( tolower = TRUE, removeNumbers = TRUE, stopwords = TRUE, removePunctuation = TRUE, stemming = TRUE))

The second “DocumentTermMatrix” function is used to create a sparse matrix of the cleaned text data with the control parameter.

#create a document-term sparse matrix directly from the SMS corpus
sms_dtm3 <- DocumentTermMatrix(sms_corpus, control = list( tolower = TRUE, removeNumbers = TRUE, stopwords = function(x) { removeWords(x, stopwords()) }, removePunctuation = TRUE, stemming = TRUE))
#alternative solution: using custom stop words function ensures identical result
sms_dtm#compare the result
## <<DocumentTermMatrix (documents: 5572, terms: 6472)>>
## Non-/sparse entries: 42009/36019975
## Sparsity           : 100%
## Maximal term length: 40
## Weighting          : term frequency (tf)
sms_dtm2#compare the result
## <<DocumentTermMatrix (documents: 5572, terms: 6866)>>
## Non-/sparse entries: 43435/38213917
## Sparsity           : 100%
## Maximal term length: 40
## Weighting          : term frequency (tf)
sms_dtm3#compare the result
## <<DocumentTermMatrix (documents: 5572, terms: 6472)>>
## Non-/sparse entries: 42009/36019975
## Sparsity           : 100%
## Maximal term length: 40
## Weighting          : term frequency (tf)

This code is using a document-term matrix (DTM) to train and test a Naive Bayes classifier for text classification.

sms_dtm_train <- sms_dtm[1:4169, ]#creating training and test datasets
sms_dtm_test <- sms_dtm[4170:5559, ]
sms_train_labels <- data[1:4169, ]$type#also save the labels
sms_test_labels <- data[4170:5559, ]$type
prop.table(table(sms_train_labels))#check that the proportion of spam is similar
## sms_train_labels
##       ham      spam 
## 0.8644759 0.1355241
prop.table(table(sms_test_labels))
## sms_test_labels
##       ham      spam 
## 0.8705036 0.1294964
spam <- subset(data, type == "spam")#Get a subset
ham <- subset(data, type == "ham")
sms_dtm_freq_train <- removeSparseTerms(sms_dtm_train, 0.999)#Eliminate words that occur less than 0.1% of the total number of records in the training data
sms_freq_words <- findFreqTerms(sms_dtm_train, 5)#Find the word that appears at least 5 times
str(sms_freq_words)
##  chr [1:1155] "abiola" "abl" "abt" "accept" "access" "account" "across" ...
sms_dtm_freq_train <- sms_dtm_train[ , sms_freq_words]#Check the frequency of the filtered words above
sms_dtm_freq_test <- sms_dtm_test[ , sms_freq_words]

The “predict” function is used to generate predictions on the test data using the trained classifier.

convert_counts <- function(x) { x <- ifelse(x > 0, "Yes", "No")}#If greater than 1 is Yes

The next two lines use the “apply” function to convert the counts in the DTM to frequencies.

sms_train <- apply(sms_dtm_freq_train, MARGIN = 2, convert_counts)#Replace all that are not 0 in the corpus with Yes
sms_test <- apply(sms_dtm_freq_test, MARGIN = 2, convert_counts)

The next two lines use the “apply” function to convert the counts in the DTM to frequencies. The “e1071” library is imported, which contains the “naiveBayes” function that is used to train the classifier on the training data and labels.

library(e1071)# Classification model building
## Warning: package 'e1071' was built under R version 4.2.2
sms_classifier <- naiveBayes(sms_train, sms_train_labels)
sms_test_pred <- predict(sms_classifier, sms_test)#Evaluate model performance

The “gmodels” library is imported and the “CrossTable” function is used to create a confusion matrix of the predictions compared to the true labels. The second classifier is trained with laplace smoothing with the parameter of 1.

library(gmodels)#Performance improvement of the model
## Warning: package 'gmodels' was built under R version 4.2.2
CrossTable(sms_test_pred, sms_test_labels, prop.chisq = FALSE, prop.t = FALSE, prop.r = FALSE, dnn = c('predicted', 'actual'))
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Col Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  1390 
## 
##  
##              | actual 
##    predicted |       ham |      spam | Row Total | 
## -------------|-----------|-----------|-----------|
##          ham |      1202 |        25 |      1227 | 
##              |     0.993 |     0.139 |           | 
## -------------|-----------|-----------|-----------|
##         spam |         8 |       155 |       163 | 
##              |     0.007 |     0.861 |           | 
## -------------|-----------|-----------|-----------|
## Column Total |      1210 |       180 |      1390 | 
##              |     0.871 |     0.129 |           | 
## -------------|-----------|-----------|-----------|
## 
## 
sms_classifier2 <- naiveBayes(sms_train, sms_train_labels, laplace = 1)
#Improving the model, with the laplace value set to 1, 38 SMS messages were found to be incorrectly classified.
sms_test_pred2 <- predict(sms_classifier2, sms_test)

The “CrossTable” function is used again to create a confusion matrix of the predictions from the second classifier compared to the true labels

CrossTable(sms_test_pred2, sms_test_labels, prop.chisq = FALSE, prop.t = FALSE, prop.r = FALSE, dnn = c('predicted', 'actual'))
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Col Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  1390 
## 
##  
##              | actual 
##    predicted |       ham |      spam | Row Total | 
## -------------|-----------|-----------|-----------|
##          ham |      1185 |        10 |      1195 | 
##              |     0.979 |     0.056 |           | 
## -------------|-----------|-----------|-----------|
##         spam |        25 |       170 |       195 | 
##              |     0.021 |     0.944 |           | 
## -------------|-----------|-----------|-----------|
## Column Total |      1210 |       180 |      1390 | 
##              |     0.871 |     0.129 |           | 
## -------------|-----------|-----------|-----------|
## 
## 
#The final model correctly classifies over 97% of SMS messages into spam and non-spam messages
b. Random Forest

implement a random forest model by first building categorical variables from the second column of the dataset,sequences of numbers are by far the most important variables when it comes to predicting whether a message is spam or not. Website addresses, uppercase letters, pounds symbol, and words such as “reply”,“call”,“send”,or “free” are among the most important variables.On the test data, the model also achieves more than 98% accuracy. Most importantly, sensibility and specificity rates look pretty good.

library('ggplot2') # visualization
library('randomForest') # classification algorithm
## Warning: package 'randomForest' was built under R version 4.2.2
## randomForest 4.7-1.1
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:psych':
## 
##     outlier
## The following object is masked from 'package:dplyr':
## 
##     combine
## The following object is masked from 'package:ggplot2':
## 
##     margin
library('ROCR')
## Warning: package 'ROCR' was built under R version 4.2.2

read sms data

DATA1 <- read.csv("spam.csv")
data=DATA1[0:2]
data <- sapply(data,function(row) iconv(row, "latin1", "ASCII", sub=""))
data <- data[,1:2]
data <- data.frame(data)
names(data) <- c("Spam", "Msg")

All information in lowercase

data$Msg=lapply(data$Msg,FUN="tolower")
words<-paste(data$Msg,collapse=" ")
words<-strsplit(as.character(words),"(\\.)|(\\?)|( )|(,)")
mostusedwords<-sort(table(words),decreasing=TRUE)[2:101]
mostusedwords<-names(mostusedwords)

move off the words like “to”, “a”, “you” etc. and focus on words of which length exceed 3

mostusedwords<-(mostusedwords[nchar(mostusedwords)>3])[-11]#There's a string that's not a real word.  move it off.
head(mostusedwords)
## [1] "your" "call" "have" "that" "will" "with"

Convert to integer

newdata<-as.character(data$Spam)#Convert to integer 
#For each of these characters, create a variable equal to 1 if the observation contains the character, 0 otherwise
newdata[newdata=="ham"]<-"0"
newdata[newdata=="spam"]<-"1"
newdata<-as.data.frame(as.factor(newdata))
characters<-c("[0-9]{4}","[A-Z]{4}","£","\\!\\!","\\,","\\.\\.\\.","www")
var<-c(characters,mostusedwords)
for (i in var){
  newdata<-cbind(newdata,as.numeric(grepl(i,data$Msg)))
}

names(newdata)<-c("Spam","Numbers","upperCase","Pounds","exclMark","Coma","suspPoints","WWW",mostusedwords)
names(newdata)[c(34,44)]=c("ill","dont2")

Set a random number seed,split the dataset into a training set and a test set.

set.seed(123)
train<-sample(1:nrow(newdata),floor(0.7*nrow(data)),replace=FALSE)
test<--train
data_train<-newdata[train,]
data_test<-newdata[test,]

apply random forest to the new dataset, which only has 0/1 binary variables and find the best number of variable to try at each node

mtry <- tuneRF(data_train[,2:49],data_train$Spam, ntreeTry=500,
               stepFactor=1.5,improve=0.01, trace=TRUE, plot=TRUE)
## mtry = 6  OOB error = 1.74% 
## Searching left ...
## mtry = 4     OOB error = 1.79% 
## -0.02941176 0.01 
## Searching right ...
## mtry = 9     OOB error = 1.72% 
## 0.01470588 0.01 
## mtry = 13    OOB error = 1.74% 
## -0.01492537 0.01

best.m <- mtry[mtry[, 2] == min(mtry[, 2]), 1]
print(mtry)
##        mtry   OOBError
## 4.OOB     4 0.01794872
## 6.OOB     6 0.01743590
## 9.OOB     9 0.01717949
## 13.OOB   13 0.01743590
print(best.m)
## [1] 9

The accuracy of calculation was 98.3%

rf <-randomForest(formula=Spam~.,data=data_train, mtry=best.m, importance=TRUE,ntree=500)
print(rf)
## 
## Call:
##  randomForest(formula = Spam ~ ., data = data_train, mtry = best.m,      importance = TRUE, ntree = 500) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 9
## 
##         OOB estimate of  error rate: 1.74%
## Confusion matrix:
##      0   1 class.error
## 0 3366  13 0.003847292
## 1   55 466 0.105566219
plot(rf$err.rate[, 1], type = "l", xlab = "number of trees", ylab = "OOB error")

importance(rf)
##                      0           1 MeanDecreaseAccuracy MeanDecreaseGini
## Numbers    172.0952632 158.9210477          192.7006745      472.5880208
## upperCase    0.0000000   0.0000000            0.0000000        0.0000000
## Pounds       0.0000000   0.0000000            0.0000000        0.0000000
## exclMark     1.1172944   0.9518261            1.4611442        0.3897062
## Coma         2.5576589   5.5515550            5.1631299        5.0863397
## suspPoints  15.6240145  28.2926274           31.1079850        9.1870434
## WWW         42.1480971  37.0279400           45.5684279       43.2396349
## your        17.3968641  15.1039507           22.1924877       18.2203764
## call         4.9739854  11.0261732           12.6937196       63.5490818
## have         7.9207155   9.5891217           12.2705674        7.2832101
## that         8.5722514  12.2016499           14.9669126        4.0561673
## will         1.2846415   6.7813127            4.5745800        4.9889402
## with         5.0002467   9.8576525            8.9888667        5.3972163
## just         1.4625509   1.9019000            2.2428273        2.3665446
## this         7.5962640  11.9049181           12.5090645        5.6678405
## when         3.6428213   8.6545601            8.3442605        3.8258773
## from         6.9154267   8.1305261           10.5303280       10.8283921
## what         3.7842649   8.1035873            8.3404882        2.0047049
## free        24.5118944  18.2338050           28.9535444       62.6756473
## know         6.9854280   8.3851279           10.2243796        2.2938199
## like         2.5398414   5.1816434            5.2261148        1.5303796
## then         2.9031492   9.1363941            9.2942601        1.9013251
## good         1.0109416   5.2688058            4.5184982        1.4808404
## come        -1.7205852   8.8227633            5.8765615        2.6485506
## time        -0.4881598   2.5324650            1.2915051        1.7262956
## only         5.1756840   1.4668279            4.4397403        4.9464783
## love         7.0736281   6.0648690            9.0664503        1.8468871
## there       -0.3526756   6.5865567            4.9595361        1.2709411
## send         3.9678642  12.1730326           10.7051732        6.5700517
## want         0.2991947   3.0837357            1.9210482        1.2952892
## text         9.3254846  10.3293970           11.8658324       19.8174526
## going        3.5903117   6.2480815            7.0031946        2.0117357
## ill         -0.5891839   8.3433648            7.9086753        2.7600318
## need         0.5714576   4.1866527            3.3650452        1.0664132
## about        1.6957860   6.0007984            3.9627875        1.9372744
## home        -0.9031657   5.9859760            4.7334337        1.1271422
## sorry        0.1930208   5.3476282            4.7436238        1.8187669
## still        0.6187893   3.5846136            3.3594496        0.1975442
## dont        -1.0365547   2.6521010            0.1766586        1.4798857
## back        -2.6287934   5.6372402            0.9127701        2.9239906
## stop        17.9283156  19.2062722           24.6281754       26.6493107
## reply       18.4912200  17.4257333           21.5951034       19.0004104
## dont2        2.2824449   4.2306777            4.3941038        0.8357554
## tell        -1.5755429   4.6085529            1.4401804        0.9694497
## take         1.3340541   3.7569550            3.0861831        1.3830636
## today        0.3267968   3.0031252            2.2927939        1.3598401
## later        1.2027158   6.8765101            5.2442509        2.7529123
## think        1.1453301   5.5068129            5.2540978        1.3322912
varImpPlot(rf)

predict the test dataset

pred1=as.numeric(predict(rf,type = "response",data_test))-1
head(pred1)
## [1] 1 0 0 0 0 0
table(pred1,as.numeric(data_test[,1])-1)
##      
## pred1    0    1
##     0 1445   20
##     1    1  206

Evaluate the performance of the random forest for classification.

pred2 <- predict(rf,type = "prob",data_test)

prediction is ROCR function

perf <- prediction(pred2[,2], data_test$Spam)

performance in terms of true and false positive rates

auc <- performance(perf, "auc")
pred3 = performance(perf, "tpr","fpr")
plot(pred3,main="ROC Curve for Random Forest",col=2,lwd=2)
abline(a=0,b=1,lwd=2,lty=2,col="gray")


5. Conclusion

In conclusion, both the Naive Bayes and Random Forest algorithms were effective in filtering SMS spam message.According to the above data, the best performing and best algorithm is the random forest classifier. The random forest classifier has the highest accuracy (98.3%) among all other models. The results show that the random forest model minimizes the misclassification of normal text messages as spam. On the other hand, the worst-performing algorithm was the Naive Bayes model, which had an accuracy of 94.4%.

We believe that the imbalance between spam and non-spam in the dataset will lead to bias in the models trained by the above three algorithms, and if some balancing is done on the dataset, there is a risk of overfitting or losing key information, while the RandomForest algorithm is not very sensitive to imbalanced data. It can eliminate the impact by adjusting the classifier to unbalanced data and can be adjusted to remove the effect by adjusting the classifier, and is less affected than the Naive Bayes algorithms and will naturally perform better.