Content
1. Introduction
a. Background; b. About Dataset;c. Project Objectives
2. Data Preparation
a. Loading libraries; b. Loading dataset; c.Cleaning
3. Exploratory Data Analysis
a. Feature Engineering; b. Descriptive Statistics
4. Modeling
a. Naive Bayes ; b. Random Forest
5. Conclusion
1. Introduction
a.Background
Among the essential platforms for communication, text messaging, more commonly known as short message services (SMS) has been selected as an effective service tool used by businesses for marketing. However, many harmful, spam and artificial messages have been designed by criminals to deliver to millions of people, due to the worldwide dominance and convenience of SMS in communication platforms, They take advantage of the technology advances to perform smishing activity, in which propagating the scams over the mobile networks would be done in a swift manner. The text messages often have links in them, which induce unsuspecting victims to a phishing site. Consequently, the victims expose themselves to the risk of monetary loss by divulging personal information, downloading malware onto their mobile device, or providing a one-time passcode that will allow a criminal to bypass multi-factor authentication (MFA). Drager (2022) claims that SMS attacks are skyrocketing over the years, in which there was around 328% rate increment in 2020, and it grew further for about 700% during the first half of 2021. As a measure to prevent SMS attacks, organizations and researchers develop robust and effective spam filters before the text messages are being delivered to the end recipients. As such, machine learning models have been utilized to filter, detect and classify the message inputs. For instance, in the research paper by Mishra & Soni (2022), Neural Network, Naïve Bayes, and Decision Tree models have been implemented to detect smishing with model accuracies of above 93%. The researchers have proven the efficacy of machine learning models in text message classification into legit (ham) or illegal (smishing) traffic type.
b. About Dataset
The SMS Spam Collection is a compilation of labeled SMS messages gathered for the purpose of studying SMS spam. It features a total of 5,574 English SMS messages, classified as either legitimate (ham) or spam.The SMS Spam Collection dataset consists of individual messages, each one occupying a separate line. Each line is divided into two columns: v1 displays the label (either ham or spam) and v2 displays the unedited text. https://www.kaggle.com/datasets/uciml/sms-spam-collection-dataset
c. Project Objectives
1.Use Bayesian algorithm and Random Forest do spam detection.
2.Which model with more detection accuracy.
2. Data Preparation
a.loading libraries
library(ggplot2)
## Warning: 程辑包'ggplot2'是用R版本4.2.2 来建造的
library(wordcloud2)
## Warning: 程辑包'wordcloud2'是用R版本4.2.2 来建造的
library(dplyr)
## Warning: 程辑包'dplyr'是用R版本4.2.2 来建造的
##
## 载入程辑包:'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(psych)
## Warning: 程辑包'psych'是用R版本4.2.2 来建造的
##
## 载入程辑包:'psych'
## The following objects are masked from 'package:ggplot2':
##
## %+%, alpha
library(jiebaR)
## Warning: 程辑包'jiebaR'是用R版本4.2.2 来建造的
## 载入需要的程辑包:jiebaRD
## Warning: 程辑包'jiebaRD'是用R版本4.2.2 来建造的
##
## 载入程辑包:'jiebaR'
## The following object is masked from 'package:psych':
##
## distance
library(corrplot)
## Warning: 程辑包'corrplot'是用R版本4.2.2 来建造的
## corrplot 0.92 loaded
library(tm)
## Warning: 程辑包'tm'是用R版本4.2.2 来建造的
## 载入需要的程辑包:NLP
##
## 载入程辑包:'NLP'
## The following object is masked from 'package:ggplot2':
##
## annotate
library(quanteda)
## Warning: 程辑包'quanteda'是用R版本4.2.2 来建造的
## Warning in stringi::stri_info(): Your current locale is not in the list
## of available locales. Some functions may not work properly. Refer to
## stri_locale_list() for more details on known locale specifiers.
## Warning in stringi::stri_info(): Your current locale is not in the list
## of available locales. Some functions may not work properly. Refer to
## stri_locale_list() for more details on known locale specifiers.
## Package version: 3.2.4
## Unicode version: 13.0
## ICU version: 69.1
## Parallel computing: 12 of 12 threads used.
## See https://quanteda.io for tutorials and examples.
##
## 载入程辑包:'quanteda'
## The following object is masked from 'package:tm':
##
## stopwords
## The following objects are masked from 'package:NLP':
##
## meta, meta<-
b.Loading dataset
data <- read.csv("spam.csv",encoding = "Lation1")
head(data,5)
## v1
## 1 ham
## 2 ham
## 3 spam
## 4 ham
## 5 ham
## v2
## 1 Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...
## 2 Ok lar... Joking wif u oni...
## 3 Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's
## 4 U dun say so early hor... U c already then say...
## 5 Nah I don't think he goes to usf, he lives around here though
## X X.1 X.2
## 1
## 2
## 3
## 4
## 5
The dataset we obtain is a set of labeled data that have been collected for SMS Spam research, available publicly in Kaggle, an online open-source platform for data science collaborations. This set of SMS messages contains 5,574 English messages, tagged according to its type, ham (legitimate) or spam (illegitimate). The file contains one message per line, which is composed of two columns: (1) contains the label (ham or spam) and (2) contains the raw text. The dataset we obtain is a set of labeled data that have been collected for SMS Spam research, available publicly in Kaggle, an online open-source platform for data science collaborations. This set of SMS messages contains 5,574 English messages, tagged according to its type, ham (legitimate) or spam (illegitimate). The file contains one message per line, which is composed of two columns: (1) contains the label (ham or spam) and (2) contains the raw text.
This corpus has been collected & incorporated from the following sources: Grumbletext Website (425 SMS spam messages) - a UK forum in which cell phone users make public claims about SMS spam messages; NUS SMS Corpus (NSC) (A subset of 3,375 SMS randomly chosen ham messages of the, which is a dataset of about 10,000 legitimate messages collected for research at the Department of Computer Science at the National University of Singapore) - the messages largely originate from volunteers consisted of Singaporean and mostly from students attending the University who were made aware that their contributions were going to be made publicly available; and Caroline Tag’s PhD Thesis (a list of 450 SMS ham messages collected).
c.Cleaning
You can also embed plots, for example:
data <- data[,-(3:5)] #Delete unneeded columns
colnames(data) <- c("label","message")
head(data,5)
## label
## 1 ham
## 2 ham
## 3 spam
## 4 ham
## 5 ham
## message
## 1 Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...
## 2 Ok lar... Joking wif u oni...
## 3 Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's
## 4 U dun say so early hor... U c already then say...
## 5 Nah I don't think he goes to usf, he lives around here though
#对数据进行编码,ham=0,spam=1
data[which(data$label=="ham"),1] <- "0"
data[which(data$label=="spam"),1] <- "1"
data$label <- as.integer(data$label)
table(data$label)
##
## 0 1
## 4825 747
sum(is.na(data)) #Check for missing values
## [1] 0
3. Exploratory Data Analysis
a.Feature Engineering
dt = data.frame(A = c(4825,747), B = c("ham","spam"))
dt = dt[order(dt$A, decreasing = TRUE),]
myLabel = as.vector(dt$B)
myLabel = paste(myLabel, "(", round(dt$A / sum(dt$A) * 100, 2), "%)", sep = "")
p = ggplot(dt, aes(x = "", y = A, fill = B)) +
geom_bar(stat = "identity", width = 1) +
coord_polar(theta = "y") +
labs(x = "", y = "", title = "") +
theme(axis.ticks = element_blank()) +
theme(legend.title = element_blank(), legend.position = "top") +
scale_fill_discrete(breaks = dt$B, labels = myLabel) +
theme(axis.text.x = element_blank()) +
geom_text(aes(y = A/2 + c(0, cumsum(A)[-length(A)]), x = sum(A)/5572, label = myLabel), size = 5)
p
btx <- data.frame(count = c(4825,747), label = as.character(c("0","1")))
ggplot(data=btx,mapping=aes(x=label,y=count,fill=label,group=factor(1)))+
geom_bar(stat="identity",width=0.5)
In this section we would like to describe the exploratory data analysis
(EDA) of the dataset. First, the dataset consists of two columns, v1 and
v2. v1 consists of spam or ham tags and v2 contains raw text. Spam and
ham texts have 5,572 raw data before data cleaning, with spam texts
amounting to 13.41% of the total and ham texts to 86.59%. From the
Figures above, we can see that there is an imbalance in the amount of
data between spam and ham texts.
Sys.setlocale(category = "LC_ALL",locale = "English_United States.1252")
## Warning in Sys.setlocale(category = "LC_ALL", locale = "English_United
## States.1252"): using locale code page other than 65001 ("UTF-8") may cause
## problems
## [1] "LC_COLLATE=English_United States.1252;LC_CTYPE=English_United States.1252;LC_MONETARY=English_United States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252"
Sys.setlocale(category = "LC_ALL",locale = "English_United States.1252")
## [1] "LC_COLLATE=English_United States.1252;LC_CTYPE=English_United States.1252;LC_MONETARY=English_United States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252"
#Number of Character
for (i in 1:nrow(data)){data$char[i] <- nchar(data$message[i])}
data$char[1] <- nchar(data$message[1])
#Number of words
wk <- worker()
for(i in 1:nrow(data)){data$words[i] <- length(segment(data$message[i],wk))}
data$words <- as.numeric(data$words)
for(i in 1:nrow(data)){data$sen[i] <- lengths(strsplit(data$message[i],","))}
head(data)
## label
## 1 0
## 2 0
## 3 1
## 4 0
## 5 0
## 6 1
## message
## 1 Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...
## 2 Ok lar... Joking wif u oni...
## 3 Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's
## 4 U dun say so early hor... U c already then say...
## 5 Nah I don't think he goes to usf, he lives around here though
## 6 FreeMsg Hey there darling it's been 3 week's now and no word back! I'd like some fun you up for it still? Tb ok! XxX std chgs to send, å£1.50 to rcv
## char words sen
## 1 111 20 2
## 2 29 6 1
## 3 155 35 1
## 4 49 11 1
## 5 61 14 2
## 6 148 35 2
b.Descriptive Statistics
d1 <- describe(data[,-2]);d1
## vars n mean sd median trimmed mad min max range skew kurtosis
## label 1 5572 0.13 0.34 0 0.04 0.00 0 1 1 2.15 2.61
## char 2 5572 80.12 59.69 61 73.67 46.70 2 910 908 2.51 17.46
## words 3 5572 16.20 11.85 13 14.83 10.38 0 190 190 2.70 20.53
## sen 4 5572 1.34 0.81 1 1.16 0.00 1 14 13 4.33 31.06
## se
## label 0.00
## char 0.80
## words 0.16
## sen 0.01
d2 <- describe(data[which(data$label==0),-2]);d2
## vars n mean sd median trimmed mad min max range skew kurtosis
## label 1 4825 0.00 0.00 0 0.00 0.00 0 0 0 NaN NaN
## char 2 4825 71.02 58.02 52 62.20 35.58 2 910 908 3.39 25.14
## words 3 4825 14.67 11.78 11 12.88 7.41 0 190 190 3.39 26.27
## sen 4 4825 1.31 0.77 1 1.14 0.00 1 14 13 4.80 39.69
## se
## label 0.00
## char 0.84
## words 0.17
## sen 0.01
d3 <- describe(data[which(data$label==1),-2]);d3
## vars n mean sd median trimmed mad min max range skew kurtosis
## label 1 747 1.00 0.00 1 1.00 0.00 1 1 0 NaN NaN
## char 2 747 138.87 29.18 149 144.39 14.83 13 224 211 -1.78 3.19
## words 3 747 26.07 6.22 27 26.81 4.45 2 39 37 -1.19 1.56
## sen 4 747 1.54 1.03 1 1.30 0.00 1 7 6 2.71 8.94
## se
## label 0.00
## char 1.07
## words 0.23
## sen 0.04
##Visualization Comparison
data$label <- as.factor(data$label)
ggplot(data, aes(x = char, fill = label)) +
geom_histogram(position = "identity", alpha = 0.4, bins = 30) + scale_fill_brewer(palette = "Set1")
ggplot(data, aes(x = words, fill = label)) +
geom_histogram(position = "identity", bins = 30, alpha = 0.4) + scale_fill_brewer(palette = "Set1")
As a basic overview of the dataset content,We counted the number of
characters and words in the text in v2 and then visualized them..
According to Figure 2 below (ham in red, spam in blue), the amount of
characters in ham text far exceeds the amount of characters in spam
text. However, this is due to the fact that the volume of ham is much
larger than that of spam, but another piece of insight is shown in the
histogram: the largest number of ham SMS is in the range >0-100,
while spam SMS is in the range of about 100-150. The vocabulary of ham
sms is most often composed of >0 - 25 words, while the vocabulary of
spam is composed of 25 - 40 words.
#Remove outliers
i <- which(data$char>500)
data <- data[-i,]
#Correlation coefficient matrix
data$label <- as.integer(data$label)
data$label[which(data$label==1)] <- 0
data$label[which(data$label==2)] <- 1
data$label <- as.integer(data$label)
corx <- cor(data[,-2])
corrplot(corx)
corrplot(corx,method="shade",
shade.col=NA,
tl.col = "black",
tl.srt = 45,
addCoef.col = "white",
cl.pos = "n",
order="AOE")
This is a correlation matrix. Matrix correlation analysis is an analysis
of more than two elements of a variable and measures how well the
variables are correlated. According to Figure 5, the correlation goes
from dark to light, which also means from low to high. A solid
light-coloured 1 means that the two variables are the same, so the
correlation between the two variables is 1. In addition, we can see that
char and words have a correlation of 0.98, because essentially the two
variables have the same content, and that words are made up of
characters. ##Data pre-processing
#Convert all to lowercase
data$message = tolower(data$message)
#Delete deactivated words
sw <- stopwords("english")
head(sw,9)
## [1] "i" "me" "my" "myself" "we" "our"
## [7] "ours" "ourselves" "you"
for (i in 1:nrow(data)){data$message[i] <- removeWords(data$message[i],sw)}
#Removing web connections
for (i in 1:nrow(data)){data$message[i] <- gsub("http\\S+","",data$message[i])}
#Removing numbers
for (i in 1:nrow(data)){data$message[i] <- gsub("\\d+","",data$message[i])}
#Removing emails
for (i in 1:nrow(data)){data$message[i] <- gsub("\\S*@\\S*\\S?","",data$message[i])}
head(data,5)
## label
## 1 0
## 2 0
## 3 1
## 4 0
## 5 0
## message
## 1 go jurong point, crazy.. available bugis n great world la e buffet... cine got amore wat...
## 2 ok lar... joking wif u oni...
## 3 free entry wkly comp win fa cup final tkts st may . text fa receive entry question(std txt rate)t&c's apply over's
## 4 u dun say early hor... u c already say...
## 5 nah think goes usf, lives around though
## char words sen
## 1 111 20 2
## 2 29 6 1
## 3 155 35 1
## 4 49 11 1
## 5 61 14 2
#Statistical word frequency
spam_wc <- data[which(data$label==0),]
ham_wc <- data[which(data$label==1),]
a <- c()
for (i in 1:nrow(spam_wc)){
b <- segment(spam_wc$message[i],wk)
a <- c(a,b)
}
c <- c()
for (i in 1:nrow(ham_wc)){
d <- segment(ham_wc$message[i],wk)
c <- c(c,d)
}
word_spam <- freq(a)
word_ham <- freq(c)
head(word_ham,10);head(word_spam,10)
## char freq
## 1 house 1
## 2 shit 1
## 3 sed 1
## 4 servs 1
## 5 inclu 1
## 6 chatlines 1
## 7 gsex 1
## 8 ball 1
## 9 spider 2
## 10 marvel 1
## char freq
## 1 pity 1
## 2 salesman 1
## 3 dump 1
## 4 dental 1
## 5 units 1
## 6 shud 1
## 7 kane 1
## 8 indians 1
## 9 influx 1
## 10 sudden 1
##Mapping word clouds
wordcloud2(word_ham,
size = 1,
fontFamily = 'Segoe UI',
fontWeight = 'bold',
color = 'random-dark',
backgroundColor = "white",
minRotation = -pi/4,
maxRotation = pi/4,
rotateRatio = 0.4,
shape = "circle"
)
wordcloud2(word_spam,
size = 1,
fontFamily = 'Segoe UI',
fontWeight = 'bold',
color = 'random-dark',
backgroundColor = "white",
minRotation = -pi/4,
maxRotation = pi/4,
rotateRatio = 0.4,
shape = "circle"
)
In this section, we tried to identify some commonalities between the fraudulent text messages. The dataset we use has a large amount of text information, and using word clouds makes it easy to quickly identify the most commonly used words in the text, which helps highlight key themes and topics. It is a striking visualization method for highlighting essential textual data points. It can make dull data shine and deliver crucial information quickly.
Word clouds is a grouping of words that are displayed in various sizes: the larger and bolder the term, the more frequently it appears in a document and the more important it is. Text clouds include data visualization, text data, font colors, word frequency analysis, and specific word graphics. These are ideal techniques to extract the most relevant sections of textual material, from blog posts to databases, and are also known as tag clouds or text clouds. They can also assist business users in comparing and contrasting two separate pieces of text in order to identify phrasing similarities.
The word cloud above shows some of the most frequent words in the dataset for all spam SMS. From the figure below, we can find some of the most frequent words, and we will analyze some of the frequent words to find out why they occur frequently.
“Free”: This word is often used to entice people to open the message, as they suggest that the recipient has won something or that there is a special offer available.
“Mobile” or “Call” : These words are often used to encourage recipients to contact the sender by phone, as it allows the spammer to reach the recipient more directly.
“Collect” or “Send” : These words are often used to direct the recipient to take a specific action, such as collecting a prize or sending personal information.
“SMS” or “txt” : These words are used to suggest that the message is being sent via text message, which is a common method of communication.
“Guaranteed” : This word is used to alleviate any concerns the recipient may have about the offer, and to make it seem more legitimate.
“Tone” or “Urgent” : These words are used to create a sense of urgency and to make the message seem more important.
“Chance to win” or “New” : These words are used to create excitement and to make the offer seem more attractive.
“Please call” or “Stop” : These words are often used to create a sense of urgency and to encourage the recipient to take immediate action.
4. Modeling
a. Naive Bayes
Spam is classified using plain Bayesian classification with ham representing spam and spam representing non-spam, and eventually the model correctly classifies over 97% of text messages into spam and non-spam.
# ham stands for spam
# spam stands for non-spam
DATA1 <- read.csv("spam.csv")#read sms data
head(DATA1)
## v1
## 1 ham
## 2 ham
## 3 spam
## 4 ham
## 5 ham
## 6 spam
## v2
## 1 Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...
## 2 Ok lar... Joking wif u oni...
## 3 Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's
## 4 U dun say so early hor... U c already then say...
## 5 Nah I don't think he goes to usf, he lives around here though
## 6 FreeMsg Hey there darling it's been 3 week's now and no word back! I'd like some fun you up for it still? Tb ok! XxX std chgs to send, å£1.50 to rcv
## X X.1 X.2
## 1
## 2
## 3
## 4
## 5
## 6
data=DATA1[0:2]
data <- sapply(data,function(row) iconv(row, "latin1", "ASCII", sub=""))
data <- data[,1:2]
data <- data.frame(data)
names(data) <- c('type','text')
data$type <- factor(data$type)
table(data$type)# Check Variables
##
## ham spam
## 4825 747
The first line imports the ‘tm’ library, which contains functions for text mining and natural language processing.
library('tm')#Building a corpus (SMS content)
sms_corpus <- VCorpus(VectorSource(data$text))#Observe the summary of the first and second text messages
head(sms_corpus)
## <<VCorpus>>
## Metadata: corpus specific: 0, document level (indexed): 0
## Content: documents: 6
print(sms_corpus)
## <<VCorpus>>
## Metadata: corpus specific: 0, document level (indexed): 0
## Content: documents: 5572
The next line creates a corpus of SMS text messages using the “Corpus” function from the ‘tm’ library and the text data stored in the “data” variable.
inspect(sms_corpus[1:2])
## <<VCorpus>>
## Metadata: corpus specific: 0, document level (indexed): 0
## Content: documents: 2
##
## [[1]]
## <<PlainTextDocument>>
## Metadata: 7
## Content: chars: 111
##
## [[2]]
## <<PlainTextDocument>>
## Metadata: 7
## Content: chars: 29
as.character(sms_corpus[[1]])#Watch the first text message content
## [1] "Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat..."
lapply(sms_corpus[1:2], as.character)
## $`1`
## [1] "Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat..."
##
## $`2`
## [1] "Ok lar... Joking wif u oni..."
The “tolower” function is used to convert all text in the corpus to lowercase, the “removeNumbers” function is used to remove numbers from the text, “removeWords” function is used to delete filler words such as “to”, “and”, “but” which are defined in stopwords().
sms_corpus_clean <- tm_map(sms_corpus, content_transformer(tolower))#All information in lowercase
as.character(sms_corpus[[1]])
## [1] "Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat..."
as.character(sms_corpus_clean[[1]])
## [1] "go until jurong point, crazy.. available only in bugis n great world la e buffet... cine there got amore wat..."
sms_corpus_clean <- tm_map(sms_corpus_clean, removeNumbers)# remove numbers
sms_corpus_clean <- tm_map(sms_corpus_clean, removeWords, stopwords())#Delete filler words, such as to,and,but
The “removePunctuation” function is used to remove punctuation from the text. The “replacePunctuation” function is used to replace any punctuation marks with spaces.
sms_corpus_clean <- tm_map(sms_corpus_clean, removePunctuation)#Remove punctuation
removePunctuation("hello...world")#Create gsub functions # Replace any punctuation marks with spaces
## [1] "helloworld"
replacePunctuation <- function(x) { gsub("[[:punct:]]+", " ", x) }
replacePunctuation("hello...world")
## [1] "hello world"
The “SnowballC” library is imported, which contains the “wordStem” function which changes all words to their original form
library(SnowballC)#This function changes all words to their original form
wordStem(c("learn", "learned", "learning", "learns"))
## [1] "learn" "learn" "learn" "learn"
sms_corpus_clean <- tm_map(sms_corpus_clean, stemDocument)#In order for wordStem to be applied to the entire text corpus
sms_corpus_clean <- tm_map(sms_corpus_clean, stripWhitespace)# Delete extra spaces
lapply(sms_corpus[1:3], as.character)#Observe the results before and after some transformations
## $`1`
## [1] "Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat..."
##
## $`2`
## [1] "Ok lar... Joking wif u oni..."
##
## $`3`
## [1] "Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's"
lapply(sms_corpus_clean[1:3], as.character)
## $`1`
## [1] "go jurong point crazi avail bugi n great world la e buffet cine got amor wat"
##
## $`2`
## [1] "ok lar joke wif u oni"
##
## $`3`
## [1] "free entri wkli comp win fa cup final tkts st may text fa receiv entri questionstd txt ratetc appli s"
sms_dtm <- DocumentTermMatrix(sms_corpus_clean)#Creating sparse matrices
sms_dtm2 <- DocumentTermMatrix(sms_corpus, control = list( tolower = TRUE, removeNumbers = TRUE, stopwords = TRUE, removePunctuation = TRUE, stemming = TRUE))
The second “DocumentTermMatrix” function is used to create a sparse matrix of the cleaned text data with the control parameter.
#create a document-term sparse matrix directly from the SMS corpus
sms_dtm3 <- DocumentTermMatrix(sms_corpus, control = list( tolower = TRUE, removeNumbers = TRUE, stopwords = function(x) { removeWords(x, stopwords()) }, removePunctuation = TRUE, stemming = TRUE))
#alternative solution: using custom stop words function ensures identical result
sms_dtm#compare the result
## <<DocumentTermMatrix (documents: 5572, terms: 6472)>>
## Non-/sparse entries: 42009/36019975
## Sparsity : 100%
## Maximal term length: 40
## Weighting : term frequency (tf)
sms_dtm2#compare the result
## <<DocumentTermMatrix (documents: 5572, terms: 6866)>>
## Non-/sparse entries: 43435/38213917
## Sparsity : 100%
## Maximal term length: 40
## Weighting : term frequency (tf)
sms_dtm3#compare the result
## <<DocumentTermMatrix (documents: 5572, terms: 6472)>>
## Non-/sparse entries: 42009/36019975
## Sparsity : 100%
## Maximal term length: 40
## Weighting : term frequency (tf)
This code is using a document-term matrix (DTM) to train and test a Naive Bayes classifier for text classification.
sms_dtm_train <- sms_dtm[1:4169, ]#creating training and test datasets
sms_dtm_test <- sms_dtm[4170:5559, ]
sms_train_labels <- data[1:4169, ]$type#also save the labels
sms_test_labels <- data[4170:5559, ]$type
prop.table(table(sms_train_labels))#check that the proportion of spam is similar
## sms_train_labels
## ham spam
## 0.8644759 0.1355241
prop.table(table(sms_test_labels))
## sms_test_labels
## ham spam
## 0.8705036 0.1294964
spam <- subset(data, type == "spam")#Get a subset
ham <- subset(data, type == "ham")
sms_dtm_freq_train <- removeSparseTerms(sms_dtm_train, 0.999)#Eliminate words that occur less than 0.1% of the total number of records in the training data
sms_freq_words <- findFreqTerms(sms_dtm_train, 5)#Find the word that appears at least 5 times
str(sms_freq_words)
## chr [1:1155] "abiola" "abl" "abt" "accept" "access" "account" "across" ...
sms_dtm_freq_train <- sms_dtm_train[ , sms_freq_words]#Check the frequency of the filtered words above
sms_dtm_freq_test <- sms_dtm_test[ , sms_freq_words]
The “predict” function is used to generate predictions on the test data using the trained classifier.
convert_counts <- function(x) { x <- ifelse(x > 0, "Yes", "No")}#If greater than 1 is Yes
The next two lines use the “apply” function to convert the counts in the DTM to frequencies.
sms_train <- apply(sms_dtm_freq_train, MARGIN = 2, convert_counts)#Replace all that are not 0 in the corpus with Yes
sms_test <- apply(sms_dtm_freq_test, MARGIN = 2, convert_counts)
The next two lines use the “apply” function to convert the counts in the DTM to frequencies. The “e1071” library is imported, which contains the “naiveBayes” function that is used to train the classifier on the training data and labels.
library(e1071)# Classification model building
## Warning: package 'e1071' was built under R version 4.2.2
sms_classifier <- naiveBayes(sms_train, sms_train_labels)
sms_test_pred <- predict(sms_classifier, sms_test)#Evaluate model performance
The “gmodels” library is imported and the “CrossTable” function is used to create a confusion matrix of the predictions compared to the true labels. The second classifier is trained with laplace smoothing with the parameter of 1.
library(gmodels)#Performance improvement of the model
## Warning: package 'gmodels' was built under R version 4.2.2
CrossTable(sms_test_pred, sms_test_labels, prop.chisq = FALSE, prop.t = FALSE, prop.r = FALSE, dnn = c('predicted', 'actual'))
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Col Total |
## |-------------------------|
##
##
## Total Observations in Table: 1390
##
##
## | actual
## predicted | ham | spam | Row Total |
## -------------|-----------|-----------|-----------|
## ham | 1202 | 25 | 1227 |
## | 0.993 | 0.139 | |
## -------------|-----------|-----------|-----------|
## spam | 8 | 155 | 163 |
## | 0.007 | 0.861 | |
## -------------|-----------|-----------|-----------|
## Column Total | 1210 | 180 | 1390 |
## | 0.871 | 0.129 | |
## -------------|-----------|-----------|-----------|
##
##
sms_classifier2 <- naiveBayes(sms_train, sms_train_labels, laplace = 1)
#Improving the model, with the laplace value set to 1, 38 SMS messages were found to be incorrectly classified.
sms_test_pred2 <- predict(sms_classifier2, sms_test)
The “CrossTable” function is used again to create a confusion matrix of the predictions from the second classifier compared to the true labels
CrossTable(sms_test_pred2, sms_test_labels, prop.chisq = FALSE, prop.t = FALSE, prop.r = FALSE, dnn = c('predicted', 'actual'))
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Col Total |
## |-------------------------|
##
##
## Total Observations in Table: 1390
##
##
## | actual
## predicted | ham | spam | Row Total |
## -------------|-----------|-----------|-----------|
## ham | 1185 | 10 | 1195 |
## | 0.979 | 0.056 | |
## -------------|-----------|-----------|-----------|
## spam | 25 | 170 | 195 |
## | 0.021 | 0.944 | |
## -------------|-----------|-----------|-----------|
## Column Total | 1210 | 180 | 1390 |
## | 0.871 | 0.129 | |
## -------------|-----------|-----------|-----------|
##
##
#The final model correctly classifies over 97% of SMS messages into spam and non-spam messages
b. Random Forest
implement a random forest model by first building categorical variables from the second column of the dataset,sequences of numbers are by far the most important variables when it comes to predicting whether a message is spam or not. Website addresses, uppercase letters, pounds symbol, and words such as “reply”,“call”,“send”,or “free” are among the most important variables.On the test data, the model also achieves more than 98% accuracy. Most importantly, sensibility and specificity rates look pretty good.
library('ggplot2') # visualization
library('randomForest') # classification algorithm
## Warning: package 'randomForest' was built under R version 4.2.2
## randomForest 4.7-1.1
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:psych':
##
## outlier
## The following object is masked from 'package:dplyr':
##
## combine
## The following object is masked from 'package:ggplot2':
##
## margin
library('ROCR')
## Warning: package 'ROCR' was built under R version 4.2.2
read sms data
DATA1 <- read.csv("spam.csv")
data=DATA1[0:2]
data <- sapply(data,function(row) iconv(row, "latin1", "ASCII", sub=""))
data <- data[,1:2]
data <- data.frame(data)
names(data) <- c("Spam", "Msg")
All information in lowercase
data$Msg=lapply(data$Msg,FUN="tolower")
words<-paste(data$Msg,collapse=" ")
words<-strsplit(as.character(words),"(\\.)|(\\?)|( )|(,)")
mostusedwords<-sort(table(words),decreasing=TRUE)[2:101]
mostusedwords<-names(mostusedwords)
move off the words like “to”, “a”, “you” etc. and focus on words of which length exceed 3
mostusedwords<-(mostusedwords[nchar(mostusedwords)>3])[-11]#There's a string that's not a real word. move it off.
head(mostusedwords)
## [1] "your" "call" "have" "that" "will" "with"
Convert to integer
newdata<-as.character(data$Spam)#Convert to integer
#For each of these characters, create a variable equal to 1 if the observation contains the character, 0 otherwise
newdata[newdata=="ham"]<-"0"
newdata[newdata=="spam"]<-"1"
newdata<-as.data.frame(as.factor(newdata))
characters<-c("[0-9]{4}","[A-Z]{4}","£","\\!\\!","\\,","\\.\\.\\.","www")
var<-c(characters,mostusedwords)
for (i in var){
newdata<-cbind(newdata,as.numeric(grepl(i,data$Msg)))
}
names(newdata)<-c("Spam","Numbers","upperCase","Pounds","exclMark","Coma","suspPoints","WWW",mostusedwords)
names(newdata)[c(34,44)]=c("ill","dont2")
Set a random number seed,split the dataset into a training set and a test set.
set.seed(123)
train<-sample(1:nrow(newdata),floor(0.7*nrow(data)),replace=FALSE)
test<--train
data_train<-newdata[train,]
data_test<-newdata[test,]
apply random forest to the new dataset, which only has 0/1 binary variables and find the best number of variable to try at each node
mtry <- tuneRF(data_train[,2:49],data_train$Spam, ntreeTry=500,
stepFactor=1.5,improve=0.01, trace=TRUE, plot=TRUE)
## mtry = 6 OOB error = 1.74%
## Searching left ...
## mtry = 4 OOB error = 1.79%
## -0.02941176 0.01
## Searching right ...
## mtry = 9 OOB error = 1.72%
## 0.01470588 0.01
## mtry = 13 OOB error = 1.74%
## -0.01492537 0.01
best.m <- mtry[mtry[, 2] == min(mtry[, 2]), 1]
print(mtry)
## mtry OOBError
## 4.OOB 4 0.01794872
## 6.OOB 6 0.01743590
## 9.OOB 9 0.01717949
## 13.OOB 13 0.01743590
print(best.m)
## [1] 9
The accuracy of calculation was 98.3%
rf <-randomForest(formula=Spam~.,data=data_train, mtry=best.m, importance=TRUE,ntree=500)
print(rf)
##
## Call:
## randomForest(formula = Spam ~ ., data = data_train, mtry = best.m, importance = TRUE, ntree = 500)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 9
##
## OOB estimate of error rate: 1.74%
## Confusion matrix:
## 0 1 class.error
## 0 3366 13 0.003847292
## 1 55 466 0.105566219
plot(rf$err.rate[, 1], type = "l", xlab = "number of trees", ylab = "OOB error")
importance(rf)
## 0 1 MeanDecreaseAccuracy MeanDecreaseGini
## Numbers 172.0952632 158.9210477 192.7006745 472.5880208
## upperCase 0.0000000 0.0000000 0.0000000 0.0000000
## Pounds 0.0000000 0.0000000 0.0000000 0.0000000
## exclMark 1.1172944 0.9518261 1.4611442 0.3897062
## Coma 2.5576589 5.5515550 5.1631299 5.0863397
## suspPoints 15.6240145 28.2926274 31.1079850 9.1870434
## WWW 42.1480971 37.0279400 45.5684279 43.2396349
## your 17.3968641 15.1039507 22.1924877 18.2203764
## call 4.9739854 11.0261732 12.6937196 63.5490818
## have 7.9207155 9.5891217 12.2705674 7.2832101
## that 8.5722514 12.2016499 14.9669126 4.0561673
## will 1.2846415 6.7813127 4.5745800 4.9889402
## with 5.0002467 9.8576525 8.9888667 5.3972163
## just 1.4625509 1.9019000 2.2428273 2.3665446
## this 7.5962640 11.9049181 12.5090645 5.6678405
## when 3.6428213 8.6545601 8.3442605 3.8258773
## from 6.9154267 8.1305261 10.5303280 10.8283921
## what 3.7842649 8.1035873 8.3404882 2.0047049
## free 24.5118944 18.2338050 28.9535444 62.6756473
## know 6.9854280 8.3851279 10.2243796 2.2938199
## like 2.5398414 5.1816434 5.2261148 1.5303796
## then 2.9031492 9.1363941 9.2942601 1.9013251
## good 1.0109416 5.2688058 4.5184982 1.4808404
## come -1.7205852 8.8227633 5.8765615 2.6485506
## time -0.4881598 2.5324650 1.2915051 1.7262956
## only 5.1756840 1.4668279 4.4397403 4.9464783
## love 7.0736281 6.0648690 9.0664503 1.8468871
## there -0.3526756 6.5865567 4.9595361 1.2709411
## send 3.9678642 12.1730326 10.7051732 6.5700517
## want 0.2991947 3.0837357 1.9210482 1.2952892
## text 9.3254846 10.3293970 11.8658324 19.8174526
## going 3.5903117 6.2480815 7.0031946 2.0117357
## ill -0.5891839 8.3433648 7.9086753 2.7600318
## need 0.5714576 4.1866527 3.3650452 1.0664132
## about 1.6957860 6.0007984 3.9627875 1.9372744
## home -0.9031657 5.9859760 4.7334337 1.1271422
## sorry 0.1930208 5.3476282 4.7436238 1.8187669
## still 0.6187893 3.5846136 3.3594496 0.1975442
## dont -1.0365547 2.6521010 0.1766586 1.4798857
## back -2.6287934 5.6372402 0.9127701 2.9239906
## stop 17.9283156 19.2062722 24.6281754 26.6493107
## reply 18.4912200 17.4257333 21.5951034 19.0004104
## dont2 2.2824449 4.2306777 4.3941038 0.8357554
## tell -1.5755429 4.6085529 1.4401804 0.9694497
## take 1.3340541 3.7569550 3.0861831 1.3830636
## today 0.3267968 3.0031252 2.2927939 1.3598401
## later 1.2027158 6.8765101 5.2442509 2.7529123
## think 1.1453301 5.5068129 5.2540978 1.3322912
varImpPlot(rf)
predict the test dataset
pred1=as.numeric(predict(rf,type = "response",data_test))-1
head(pred1)
## [1] 1 0 0 0 0 0
table(pred1,as.numeric(data_test[,1])-1)
##
## pred1 0 1
## 0 1445 20
## 1 1 206
Evaluate the performance of the random forest for classification.
pred2 <- predict(rf,type = "prob",data_test)
prediction is ROCR function
perf <- prediction(pred2[,2], data_test$Spam)
performance in terms of true and false positive rates
auc <- performance(perf, "auc")
pred3 = performance(perf, "tpr","fpr")
plot(pred3,main="ROC Curve for Random Forest",col=2,lwd=2)
abline(a=0,b=1,lwd=2,lty=2,col="gray")
5. Conclusion
In conclusion, both the Naive Bayes and Random Forest algorithms were effective in filtering SMS spam message.According to the above data, the best performing and best algorithm is the random forest classifier. The random forest classifier has the highest accuracy (98.3%) among all other models. The results show that the random forest model minimizes the misclassification of normal text messages as spam. On the other hand, the worst-performing algorithm was the Naive Bayes model, which had an accuracy of 94.4%.
We believe that the imbalance between spam and non-spam in the dataset will lead to bias in the models trained by the above three algorithms, and if some balancing is done on the dataset, there is a risk of overfitting or losing key information, while the RandomForest algorithm is not very sensitive to imbalanced data. It can eliminate the impact by adjusting the classifier to unbalanced data and can be adjusted to remove the effect by adjusting the classifier, and is less affected than the Naive Bayes algorithms and will naturally perform better.