Content
1. Introduction
a.Background; b.Project Objectives
2. Data Preparation
a.Loading libraries; b.Loading dataset; c.Cleaning
3. EDA
4. Modeling
5. Evaluation
7. Member’s Contribution
1. Introduction
a.Background
Among the essential platforms for communication, text messaging, more commonly known as short message services (SMS) has been selected as an effective service tool used by businesses for marketing. However, many harmful, spam and artificial messages have been designed by criminals to deliver to millions of people, due to the worldwide dominance and convenience of SMS in communication platforms, They take advantage of the technology advances to perform smishing activity, in which propagating the scams over the mobile networks would be done in a swift manner. The text messages often have links in them, which induce unsuspecting victims to a phishing site. Consequently, the victims expose themselves to the risk of monetary loss by divulging personal information, downloading malware onto their mobile device, or providing a one-time passcode that will allow a criminal to bypass multi-factor authentication (MFA). Drager (2022) claims that SMS attacks are skyrocketing over the years, in which there was around 328% rate increment in 2020, and it grew further for about 700% during the first half of 2021. As a measure to prevent SMS attacks, organizations and researchers develop robust and effective spam filters before the text messages are being delivered to the end recipients. As such, machine learning models have been utilized to filter, detect and classify the message inputs. For instance, in the research paper by Mishra & Soni (2022), Neural Network, Naïve Bayes, and Decision Tree models have been implemented to detect smishing with model accuracies of above 93%. The researchers have proven the efficacy of machine learning models in text message classification into legit (ham) or illegal (smishing) traffic type.
c. About Dataset
The SMS Spam Collection is a compilation of labeled SMS messages gathered for the purpose of studying SMS spam. It features a total of 5,574 English SMS messages, classified as either legitimate (ham) or spam.The SMS Spam Collection dataset consists of individual messages, each one occupying a separate line. Each line is divided into two columns: v1 displays the label (either ham or spam) and v2 displays the unedited text.
b. Project Objectives
** 1.To
2.To **
2. Data Preparation
a.loading libraries
library(ggplot2)
## Warning: 程辑包'ggplot2'是用R版本4.2.2 来建造的
library(wordcloud2)
## Warning: 程辑包'wordcloud2'是用R版本4.2.2 来建造的
library(dplyr)
## Warning: 程辑包'dplyr'是用R版本4.2.2 来建造的
##
## 载入程辑包:'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(psych)
## Warning: 程辑包'psych'是用R版本4.2.2 来建造的
##
## 载入程辑包:'psych'
## The following objects are masked from 'package:ggplot2':
##
## %+%, alpha
library(jiebaR)
## Warning: 程辑包'jiebaR'是用R版本4.2.2 来建造的
## 载入需要的程辑包:jiebaRD
## Warning: 程辑包'jiebaRD'是用R版本4.2.2 来建造的
##
## 载入程辑包:'jiebaR'
## The following object is masked from 'package:psych':
##
## distance
library(corrplot)
## Warning: 程辑包'corrplot'是用R版本4.2.2 来建造的
## corrplot 0.92 loaded
library(tm)
## Warning: 程辑包'tm'是用R版本4.2.2 来建造的
## 载入需要的程辑包:NLP
##
## 载入程辑包:'NLP'
## The following object is masked from 'package:ggplot2':
##
## annotate
library(quanteda)
## Warning: 程辑包'quanteda'是用R版本4.2.2 来建造的
## Warning in stringi::stri_info(): Your current locale is not in the list
## of available locales. Some functions may not work properly. Refer to
## stri_locale_list() for more details on known locale specifiers.
## Warning in stringi::stri_info(): Your current locale is not in the list
## of available locales. Some functions may not work properly. Refer to
## stri_locale_list() for more details on known locale specifiers.
## Package version: 3.2.4
## Unicode version: 13.0
## ICU version: 69.1
## Parallel computing: 16 of 16 threads used.
## See https://quanteda.io for tutorials and examples.
##
## 载入程辑包:'quanteda'
## The following object is masked from 'package:tm':
##
## stopwords
## The following objects are masked from 'package:NLP':
##
## meta, meta<-
b.Loading dataset
data <- read.csv("spam.csv",encoding = "Lation1")
head(data,5)
## v1
## 1 ham
## 2 ham
## 3 spam
## 4 ham
## 5 ham
## v2
## 1 Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...
## 2 Ok lar... Joking wif u oni...
## 3 Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's
## 4 U dun say so early hor... U c already then say...
## 5 Nah I don't think he goes to usf, he lives around here though
## X X.1 X.2
## 1
## 2
## 3
## 4
## 5
The dataset we obtain is a set of labeled data that have been collected for SMS Spam research, available publicly in Kaggle, an online open-source platform for data science collaborations. This set of SMS messages contains 5,574 English messages, tagged according to its type, ham (legitimate) or spam (illegitimate). The file contains one message per line, which is composed of two columns: (1) contains the label (ham or spam) and (2) contains the raw text.
b.Cleaning
data <- data[,-(3:5)] #删除不需要的列
colnames(data) <- c("label","message") #重命名列
head(data,5)
## label
## 1 ham
## 2 ham
## 3 spam
## 4 ham
## 5 ham
## message
## 1 Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...
## 2 Ok lar... Joking wif u oni...
## 3 Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's
## 4 U dun say so early hor... U c already then say...
## 5 Nah I don't think he goes to usf, he lives around here though
#对数据进行编码,ham=0,spam=1
data[which(data$label=="ham"),1] <- "0"
data[which(data$label=="spam"),1] <- "1"
data$label <- as.integer(data$label) #转换为整形
table(data$label) #查看频数
##
## 0 1
## 4825 747
sum(is.na(data)) #查看是否有缺失值
## [1] 0
3. EDA
dt = data.frame(A = c(4825,747), B = c("ham","spam"))
dt = dt[order(dt$A, decreasing = TRUE),]
myLabel = as.vector(dt$B)
myLabel = paste(myLabel, "(", round(dt$A / sum(dt$A) * 100, 2), "%)", sep = "")
p = ggplot(dt, aes(x = "", y = A, fill = B)) +
geom_bar(stat = "identity", width = 1) +
coord_polar(theta = "y") +
labs(x = "", y = "", title = "") +
theme(axis.ticks = element_blank()) +
theme(legend.title = element_blank(), legend.position = "top") +
scale_fill_discrete(breaks = dt$B, labels = myLabel) +
theme(axis.text.x = element_blank()) +
geom_text(aes(y = A/2 + c(0, cumsum(A)[-length(A)]), x = sum(A)/5572, label = myLabel), size = 5)
## 在图中加上百分比:x 调节标签到圆心的距离, y 调节标签的左右位置
p
btx <- data.frame(count = c(4825,747), label = as.character(c("0","1")))
ggplot(data=btx,mapping=aes(x=label,y=count,fill=label,group=factor(1)))+
geom_bar(stat="identity",width=0.5)
In this
section we would like to describe the exploratory data analysis (EDA) of
the dataset. First, the dataset consists of two columns, v1 and v2. v1
consists of spam or ham tags and v2 contains raw text. Spam and ham
texts have 5,572 raw data before data cleaning, with spam texts
amounting to 13.41% of the total and ham texts to 86.59%. From the
Figures above, we can see that there is an imbalance in the amount of
data between spam and ham texts.
##特征工程
Sys.setlocale(category = "LC_ALL",locale = "English_United States.1252")
## Warning in Sys.setlocale(category = "LC_ALL", locale = "English_United
## States.1252"): using locale code page other than 65001 ("UTF-8") may cause
## problems
## [1] "LC_COLLATE=English_United States.1252;LC_CTYPE=English_United States.1252;LC_MONETARY=English_United States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252"
Sys.setlocale(category = "LC_ALL",locale = "English_United States.1252")
## [1] "LC_COLLATE=English_United States.1252;LC_CTYPE=English_United States.1252;LC_MONETARY=English_United States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252"
#字符数
for (i in 1:nrow(data)){data$char[i] <- nchar(data$message[i])}
data$char[1] <- nchar(data$message[1])
#单词数
wk <- worker()
for(i in 1:nrow(data)){data$words[i] <- length(segment(data$message[i],wk))}
data$words <- as.numeric(data$words)
#句子数
for(i in 1:nrow(data)){data$sen[i] <- lengths(strsplit(data$message[i],","))}
head(data)
## label
## 1 0
## 2 0
## 3 1
## 4 0
## 5 0
## 6 1
## message
## 1 Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...
## 2 Ok lar... Joking wif u oni...
## 3 Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's
## 4 U dun say so early hor... U c already then say...
## 5 Nah I don't think he goes to usf, he lives around here though
## 6 FreeMsg Hey there darling it's been 3 week's now and no word back! I'd like some fun you up for it still? Tb ok! XxX std chgs to send, å£1.50 to rcv
## char words sen
## 1 111 20 2
## 2 29 6 1
## 3 155 35 1
## 4 49 11 1
## 5 61 14 2
## 6 148 35 2
##描述性统计
d1 <- describe(data[,-2]);d1 #全部短信
## vars n mean sd median trimmed mad min max range skew kurtosis
## label 1 5572 0.13 0.34 0 0.04 0.00 0 1 1 2.15 2.61
## char 2 5572 80.12 59.69 61 73.67 46.70 2 910 908 2.51 17.46
## words 3 5572 16.20 11.85 13 14.83 10.38 0 190 190 2.70 20.53
## sen 4 5572 1.34 0.81 1 1.16 0.00 1 14 13 4.33 31.06
## se
## label 0.00
## char 0.80
## words 0.16
## sen 0.01
d2 <- describe(data[which(data$label==0),-2]);d2 #正常短信
## vars n mean sd median trimmed mad min max range skew kurtosis
## label 1 4825 0.00 0.00 0 0.00 0.00 0 0 0 NaN NaN
## char 2 4825 71.02 58.02 52 62.20 35.58 2 910 908 3.39 25.14
## words 3 4825 14.67 11.78 11 12.88 7.41 0 190 190 3.39 26.27
## sen 4 4825 1.31 0.77 1 1.14 0.00 1 14 13 4.80 39.69
## se
## label 0.00
## char 0.84
## words 0.17
## sen 0.01
d3 <- describe(data[which(data$label==1),-2]);d3 #垃圾短信
## vars n mean sd median trimmed mad min max range skew kurtosis
## label 1 747 1.00 0.00 1 1.00 0.00 1 1 0 NaN NaN
## char 2 747 138.87 29.18 149 144.39 14.83 13 224 211 -1.78 3.19
## words 3 747 26.07 6.22 27 26.81 4.45 2 39 37 -1.19 1.56
## sen 4 747 1.54 1.03 1 1.30 0.00 1 7 6 2.71 8.94
## se
## label 0.00
## char 1.07
## words 0.23
## sen 0.04
##可视化比较
data$label <- as.factor(data$label)
#字符数
ggplot(data, aes(x = char, fill = label)) +
# 直方图函数:position设置堆积模式为重叠
geom_histogram(position = "identity", alpha = 0.4, bins = 30) + scale_fill_brewer(palette = "Set1")
#单词数
ggplot(data, aes(x = words, fill = label)) +
geom_histogram(position = "identity", bins = 30, alpha = 0.4) + scale_fill_brewer(palette = "Set1")
As a basic overview of the dataset content,We counted the number of
characters and words in the text in v2 and then visualized them..
According to Figure 2 below (ham in red, spam in blue), the amount of
characters in ham text far exceeds the amount of characters in spam
text. However, this is due to the fact that the volume of ham is much
larger than that of spam, but another piece of insight is shown in the
histogram: the largest number of ham SMS is in the range >0-100,
while spam SMS is in the range of about 100-150. The vocabulary of ham
sms is most often composed of >0 - 25 words, while the vocabulary of
spam is composed of 25 - 40 words.
#删除异常值
i <- which(data$char>500)
data <- data[-i,]
#相关系数矩阵
data$label <- as.integer(data$label)
data$label[which(data$label==1)] <- 0
data$label[which(data$label==2)] <- 1
data$label <- as.integer(data$label)
corx <- cor(data[,-2])
corrplot(corx)
corrplot(corx,method="shade",
shade.col=NA,
tl.col = "black",
tl.srt = 45,
addCoef.col = "white",
cl.pos = "n",
order="AOE")
This is a correlation matrix. Matrix correlation analysis is an analysis
of more than two elements of a variable and measures how well the
variables are correlated. According to Figure 5, the correlation goes
from dark to light, which also means from low to high. A solid
light-coloured 1 means that the two variables are the same, so the
correlation between the two variables is 1. In addition, we can see that
char and words have a correlation of 0.98, because essentially the two
variables have the same content, and that words are made up of
characters. ##数据预处理
#全部转化为小写
data$message = tolower(data$message)
#删除停用词
sw <- stopwords("english")
head(sw,9)
## [1] "i" "me" "my" "myself" "we" "our"
## [7] "ours" "ourselves" "you"
for (i in 1:nrow(data)){data$message[i] <- removeWords(data$message[i],sw)}
#去除web连接
for (i in 1:nrow(data)){data$message[i] <- gsub("http\\S+","",data$message[i])}
#去除数字
for (i in 1:nrow(data)){data$message[i] <- gsub("\\d+","",data$message[i])}
#去除邮件
for (i in 1:nrow(data)){data$message[i] <- gsub("\\S*@\\S*\\S?","",data$message[i])}
head(data,5)
## label
## 1 0
## 2 0
## 3 1
## 4 0
## 5 0
## message
## 1 go jurong point, crazy.. available bugis n great world la e buffet... cine got amore wat...
## 2 ok lar... joking wif u oni...
## 3 free entry wkly comp win fa cup final tkts st may . text fa receive entry question(std txt rate)t&c's apply over's
## 4 u dun say early hor... u c already say...
## 5 nah think goes usf, lives around though
## char words sen
## 1 111 20 2
## 2 29 6 1
## 3 155 35 1
## 4 49 11 1
## 5 61 14 2
#统计词频
spam_wc <- data[which(data$label==0),]
ham_wc <- data[which(data$label==1),]
a <- c()
for (i in 1:nrow(spam_wc)){
b <- segment(spam_wc$message[i],wk)
a <- c(a,b)
}
c <- c()
for (i in 1:nrow(ham_wc)){
d <- segment(ham_wc$message[i],wk)
c <- c(c,d)
}
word_spam <- freq(a)
word_ham <- freq(c)
head(word_ham,10);head(word_spam,10)
## char freq
## 1 house 1
## 2 shit 1
## 3 sed 1
## 4 servs 1
## 5 inclu 1
## 6 chatlines 1
## 7 gsex 1
## 8 ball 1
## 9 spider 2
## 10 marvel 1
## char freq
## 1 pity 1
## 2 salesman 1
## 3 dump 1
## 4 dental 1
## 5 units 1
## 6 shud 1
## 7 kane 1
## 8 indians 1
## 9 influx 1
## 10 sudden 1
##绘制词云
wordcloud2(word_spam,
size = 1, # 字体大小
fontFamily = 'Segoe UI', # 字体
fontWeight = 'bold', # 字体粗细
color = 'random-dark', # 字体颜色
backgroundColor = "white", # 背景颜色
minRotation = -pi/4, # minRotation和maxRotation控制文本旋转角度的范围
maxRotation = pi/4,
rotateRatio = 0.4, # 文本旋转的概率 0.4表示大约有40%的词发生了旋转
shape = "circle" # 轮廓形状
)
In this section, we tried to identify some commonalities between the fraudulent text messages. The dataset we use has a large amount of text information, and using word clouds makes it easy to quickly identify the most commonly used words in the text, which helps highlight key themes and topics. It is a striking visualization method for highlighting essential textual data points. It can make dull data shine and deliver crucial information quickly.
Word clouds is a grouping of words that are displayed in various sizes: the larger and bolder the term, the more frequently it appears in a document and the more important it is. Text clouds include data visualization, text data, font colors, word frequency analysis, and specific word graphics. These are ideal techniques to extract the most relevant sections of textual material, from blog posts to databases, and are also known as tag clouds or text clouds. They can also assist business users in comparing and contrasting two separate pieces of text in order to identify phrasing similarities.
The word cloud above shows some of the most frequent words in the dataset for all spam SMS. From the figure below, we can find some of the most frequent words, and we will analyze some of the frequent words to find out why they occur frequently.
“Free”: This word is often used to entice people to open the message, as they suggest that the recipient has won something or that there is a special offer available. “Mobile” or “Call” : These words are often used to encourage recipients to contact the sender by phone, as it allows the spammer to reach the recipient more directly. “Collect” or “Send” : These words are often used to direct the recipient to take a specific action, such as collecting a prize or sending personal information. “SMS” or “txt” : These words are used to suggest that the message is being sent via text message, which is a common method of communication. “Guaranteed” : This word is used to alleviate any concerns the recipient may have about the offer, and to make it seem more legitimate. “Tone” or “Urgent” : These words are used to create a sense of urgency and to make the message seem more important. “Chance to win” or “New” : These words are used to create excitement and to make the offer seem more attractive. “Please call” or “Stop” : These words are often used to create a sense of urgency and to encourage the recipient to take immediate action.