try.knit

Content

1. Introduction

a.Background; b.Project Objectives

2. Data Preparation

a.Loading libraries; b.Loading dataset; c.Cleaning

3. EDA

4. Modeling

5. Evaluation

7. Member’s Contribution

1. Introduction

a.Background

Among the essential platforms for communication, text messaging, more commonly known as short message services (SMS) has been selected as an effective service tool used by businesses for marketing. However, many harmful, spam and artificial messages have been designed by criminals to deliver to millions of people, due to the worldwide dominance and convenience of SMS in communication platforms, They take advantage of the technology advances to perform smishing activity, in which propagating the scams over the mobile networks would be done in a swift manner. The text messages often have links in them, which induce unsuspecting victims to a phishing site. Consequently, the victims expose themselves to the risk of monetary loss by divulging personal information, downloading malware onto their mobile device, or providing a one-time passcode that will allow a criminal to bypass multi-factor authentication (MFA). Drager (2022) claims that SMS attacks are skyrocketing over the years, in which there was around 328% rate increment in 2020, and it grew further for about 700% during the first half of 2021. As a measure to prevent SMS attacks, organizations and researchers develop robust and effective spam filters before the text messages are being delivered to the end recipients. As such, machine learning models have been utilized to filter, detect and classify the message inputs. For instance, in the research paper by Mishra & Soni (2022), Neural Network, Naïve Bayes, and Decision Tree models have been implemented to detect smishing with model accuracies of above 93%. The researchers have proven the efficacy of machine learning models in text message classification into legit (ham) or illegal (smishing) traffic type.

c. About Dataset

The SMS Spam Collection is a compilation of labeled SMS messages gathered for the purpose of studying SMS spam. It features a total of 5,574 English SMS messages, classified as either legitimate (ham) or spam.The SMS Spam Collection dataset consists of individual messages, each one occupying a separate line. Each line is divided into two columns: v1 displays the label (either ham or spam) and v2 displays the unedited text.

b. Project Objectives

** 1.To

2.To **

2. Data Preparation

a.loading libraries

library(ggplot2)

## Warning: 程辑包'ggplot2'是用R版本4.2.2 来建造的

library(wordcloud2)

## Warning: 程辑包'wordcloud2'是用R版本4.2.2 来建造的

library(dplyr)

## Warning: 程辑包'dplyr'是用R版本4.2.2 来建造的

## 
## 载入程辑包：'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(psych)

## Warning: 程辑包'psych'是用R版本4.2.2 来建造的

## 
## 载入程辑包：'psych'

## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha

library(jiebaR)

## Warning: 程辑包'jiebaR'是用R版本4.2.2 来建造的

## 载入需要的程辑包：jiebaRD

## Warning: 程辑包'jiebaRD'是用R版本4.2.2 来建造的

## 
## 载入程辑包：'jiebaR'

## The following object is masked from 'package:psych':
## 
##     distance

library(corrplot)

## Warning: 程辑包'corrplot'是用R版本4.2.2 来建造的

## corrplot 0.92 loaded

library(tm)

## Warning: 程辑包'tm'是用R版本4.2.2 来建造的

## 载入需要的程辑包：NLP

## 
## 载入程辑包：'NLP'

## The following object is masked from 'package:ggplot2':
## 
##     annotate

library(quanteda)

## Warning: 程辑包'quanteda'是用R版本4.2.2 来建造的

## Warning in stringi::stri_info(): Your current locale is not in the list
## of available locales. Some functions may not work properly. Refer to
## stri_locale_list() for more details on known locale specifiers.

## Warning in stringi::stri_info(): Your current locale is not in the list
## of available locales. Some functions may not work properly. Refer to
## stri_locale_list() for more details on known locale specifiers.

## Package version: 3.2.4
## Unicode version: 13.0
## ICU version: 69.1

## Parallel computing: 16 of 16 threads used.

## See https://quanteda.io for tutorials and examples.

## 
## 载入程辑包：'quanteda'

## The following object is masked from 'package:tm':
## 
##     stopwords

## The following objects are masked from 'package:NLP':
## 
##     meta, meta<-

b.Loading dataset

data <- read.csv("spam.csv",encoding = "Lation1")
head(data,5)

##     v1
## 1  ham
## 2  ham
## 3 spam
## 4  ham
## 5  ham
##                                                                                                                                                            v2
## 1                                             Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...
## 2                                                                                                                               Ok lar... Joking wif u oni...
## 3 Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's
## 4                                                                                                           U dun say so early hor... U c already then say...
## 5                                                                                               Nah I don't think he goes to usf, he lives around here though
##   X X.1 X.2
## 1          
## 2          
## 3          
## 4          
## 5

The dataset we obtain is a set of labeled data that have been collected for SMS Spam research, available publicly in Kaggle, an online open-source platform for data science collaborations. This set of SMS messages contains 5,574 English messages, tagged according to its type, ham (legitimate) or spam (illegitimate). The file contains one message per line, which is composed of two columns: (1) contains the label (ham or spam) and (2) contains the raw text.

b.Cleaning

data <- data[,-(3:5)]                     #删除不需要的列
colnames(data) <- c("label","message")                  #重命名列
head(data,5)

##   label
## 1   ham
## 2   ham
## 3  spam
## 4   ham
## 5   ham
##                                                                                                                                                       message
## 1                                             Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...
## 2                                                                                                                               Ok lar... Joking wif u oni...
## 3 Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's
## 4                                                                                                           U dun say so early hor... U c already then say...
## 5                                                                                               Nah I don't think he goes to usf, he lives around here though

#对数据进行编码，ham=0，spam=1
data[which(data$label=="ham"),1] <- "0"
data[which(data$label=="spam"),1] <- "1"
data$label <- as.integer(data$label)               #转换为整形
table(data$label)                                  #查看频数

## 
##    0    1 
## 4825  747

sum(is.na(data))                            #查看是否有缺失值

## [1] 0

3. EDA

dt = data.frame(A = c(4825,747), B = c("ham","spam"))
dt = dt[order(dt$A, decreasing = TRUE),]
myLabel = as.vector(dt$B)   
myLabel = paste(myLabel, "(", round(dt$A / sum(dt$A) * 100, 2), "%)", sep = "")   

p = ggplot(dt, aes(x = "", y = A, fill = B)) +
  geom_bar(stat = "identity", width = 1) +    
  coord_polar(theta = "y") + 
  labs(x = "", y = "", title = "") + 
  theme(axis.ticks = element_blank()) + 
  theme(legend.title = element_blank(), legend.position = "top") + 
  scale_fill_discrete(breaks = dt$B, labels = myLabel) + 
  theme(axis.text.x = element_blank())  +
  geom_text(aes(y = A/2 + c(0, cumsum(A)[-length(A)]), x = sum(A)/5572, label = myLabel), size = 5)
## 在图中加上百分比：x 调节标签到圆心的距离, y 调节标签的左右位置
p

btx <- data.frame(count = c(4825,747), label = as.character(c("0","1")))
ggplot(data=btx,mapping=aes(x=label,y=count,fill=label,group=factor(1)))+
  geom_bar(stat="identity",width=0.5)

In this section we would like to describe the exploratory data analysis (EDA) of the dataset. First, the dataset consists of two columns, v1 and v2. v1 consists of spam or ham tags and v2 contains raw text. Spam and ham texts have 5,572 raw data before data cleaning, with spam texts amounting to 13.41% of the total and ham texts to 86.59%. From the Figures above, we can see that there is an imbalance in the amount of data between spam and ham texts.

##特征工程

Sys.setlocale(category = "LC_ALL",locale = "English_United States.1252")

## Warning in Sys.setlocale(category = "LC_ALL", locale = "English_United
## States.1252"): using locale code page other than 65001 ("UTF-8") may cause
## problems

## [1] "LC_COLLATE=English_United States.1252;LC_CTYPE=English_United States.1252;LC_MONETARY=English_United States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252"

Sys.setlocale(category = "LC_ALL",locale = "English_United States.1252")

## [1] "LC_COLLATE=English_United States.1252;LC_CTYPE=English_United States.1252;LC_MONETARY=English_United States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252"

#字符数
for (i in 1:nrow(data)){data$char[i] <- nchar(data$message[i])}
data$char[1] <- nchar(data$message[1])

#单词数
wk <- worker()
for(i in 1:nrow(data)){data$words[i] <- length(segment(data$message[i],wk))}
data$words <- as.numeric(data$words)

#句子数
for(i in 1:nrow(data)){data$sen[i] <- lengths(strsplit(data$message[i],","))}
head(data)

##   label
## 1     0
## 2     0
## 3     1
## 4     0
## 5     0
## 6     1
##                                                                                                                                                       message
## 1                                             Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...
## 2                                                                                                                               Ok lar... Joking wif u oni...
## 3 Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's
## 4                                                                                                           U dun say so early hor... U c already then say...
## 5                                                                                               Nah I don't think he goes to usf, he lives around here though
## 6        FreeMsg Hey there darling it's been 3 week's now and no word back! I'd like some fun you up for it still? Tb ok! XxX std chgs to send, å£1.50 to rcv
##   char words sen
## 1  111    20   2
## 2   29     6   1
## 3  155    35   1
## 4   49    11   1
## 5   61    14   2
## 6  148    35   2

##描述性统计

d1 <- describe(data[,-2]);d1                           #全部短信

##       vars    n  mean    sd median trimmed   mad min max range skew kurtosis
## label    1 5572  0.13  0.34      0    0.04  0.00   0   1     1 2.15     2.61
## char     2 5572 80.12 59.69     61   73.67 46.70   2 910   908 2.51    17.46
## words    3 5572 16.20 11.85     13   14.83 10.38   0 190   190 2.70    20.53
## sen      4 5572  1.34  0.81      1    1.16  0.00   1  14    13 4.33    31.06
##         se
## label 0.00
## char  0.80
## words 0.16
## sen   0.01

d2 <- describe(data[which(data$label==0),-2]);d2       #正常短信

##       vars    n  mean    sd median trimmed   mad min max range skew kurtosis
## label    1 4825  0.00  0.00      0    0.00  0.00   0   0     0  NaN      NaN
## char     2 4825 71.02 58.02     52   62.20 35.58   2 910   908 3.39    25.14
## words    3 4825 14.67 11.78     11   12.88  7.41   0 190   190 3.39    26.27
## sen      4 4825  1.31  0.77      1    1.14  0.00   1  14    13 4.80    39.69
##         se
## label 0.00
## char  0.84
## words 0.17
## sen   0.01

d3 <- describe(data[which(data$label==1),-2]);d3       #垃圾短信

##       vars   n   mean    sd median trimmed   mad min max range  skew kurtosis
## label    1 747   1.00  0.00      1    1.00  0.00   1   1     0   NaN      NaN
## char     2 747 138.87 29.18    149  144.39 14.83  13 224   211 -1.78     3.19
## words    3 747  26.07  6.22     27   26.81  4.45   2  39    37 -1.19     1.56
## sen      4 747   1.54  1.03      1    1.30  0.00   1   7     6  2.71     8.94
##         se
## label 0.00
## char  1.07
## words 0.23
## sen   0.04

##可视化比较

data$label <- as.factor(data$label)

#字符数
ggplot(data, aes(x = char, fill = label)) +
  # 直方图函数：position设置堆积模式为重叠
  geom_histogram(position = "identity", alpha = 0.4, bins = 30) + scale_fill_brewer(palette = "Set1")

#单词数
ggplot(data, aes(x = words, fill = label)) +
  geom_histogram(position = "identity", bins = 30, alpha = 0.4) + scale_fill_brewer(palette = "Set1")

As a basic overview of the dataset content,We counted the number of characters and words in the text in v2 and then visualized them.. According to Figure 2 below (ham in red, spam in blue), the amount of characters in ham text far exceeds the amount of characters in spam text. However, this is due to the fact that the volume of ham is much larger than that of spam, but another piece of insight is shown in the histogram: the largest number of ham SMS is in the range >0-100, while spam SMS is in the range of about 100-150. The vocabulary of ham sms is most often composed of >0 - 25 words, while the vocabulary of spam is composed of 25 - 40 words.

#删除异常值
i <- which(data$char>500)
data <- data[-i,]

#相关系数矩阵
data$label <- as.integer(data$label)
data$label[which(data$label==1)] <- 0
data$label[which(data$label==2)] <- 1
data$label <- as.integer(data$label)
corx <- cor(data[,-2])
corrplot(corx)

corrplot(corx,method="shade",
         shade.col=NA,
         tl.col = "black",
         tl.srt = 45,
         addCoef.col = "white",
         cl.pos = "n", 
         order="AOE")

This is a correlation matrix. Matrix correlation analysis is an analysis of more than two elements of a variable and measures how well the variables are correlated. According to Figure 5, the correlation goes from dark to light, which also means from low to high. A solid light-coloured 1 means that the two variables are the same, so the correlation between the two variables is 1. In addition, we can see that char and words have a correlation of 0.98, because essentially the two variables have the same content, and that words are made up of characters. ##数据预处理

 #全部转化为小写
data$message = tolower(data$message)

#删除停用词
sw <- stopwords("english")
head(sw,9)

## [1] "i"         "me"        "my"        "myself"    "we"        "our"      
## [7] "ours"      "ourselves" "you"

for (i in 1:nrow(data)){data$message[i] <- removeWords(data$message[i],sw)}

#去除web连接
for (i in 1:nrow(data)){data$message[i] <- gsub("http\\S+","",data$message[i])}

#去除数字
for (i in 1:nrow(data)){data$message[i] <- gsub("\\d+","",data$message[i])}

#去除邮件
for (i in 1:nrow(data)){data$message[i] <- gsub("\\S*@\\S*\\S?","",data$message[i])}

head(data,5)

##   label
## 1     0
## 2     0
## 3     1
## 4     0
## 5     0
##                                                                                                                     message
## 1                           go  jurong point, crazy.. available   bugis n great world la e buffet... cine  got amore wat...
## 2                                                                                             ok lar... joking wif u oni...
## 3 free entry    wkly comp  win fa cup final tkts st may . text fa    receive entry question(std txt rate)t&c's apply over's
## 4                                                                               u dun say  early hor... u c already  say...
## 5                                                                             nah   think  goes  usf,  lives around  though
##   char words sen
## 1  111    20   2
## 2   29     6   1
## 3  155    35   1
## 4   49    11   1
## 5   61    14   2

#统计词频
spam_wc <- data[which(data$label==0),]  
ham_wc <- data[which(data$label==1),]

a <- c()
for (i in 1:nrow(spam_wc)){
  b <- segment(spam_wc$message[i],wk)
  a <- c(a,b)
}
c <- c()
for (i in 1:nrow(ham_wc)){
  d <- segment(ham_wc$message[i],wk)
  c <- c(c,d)
}
word_spam <- freq(a)
word_ham <- freq(c)

head(word_ham,10);head(word_spam,10)

##         char freq
## 1      house    1
## 2       shit    1
## 3        sed    1
## 4      servs    1
## 5      inclu    1
## 6  chatlines    1
## 7       gsex    1
## 8       ball    1
## 9     spider    2
## 10    marvel    1

##        char freq
## 1      pity    1
## 2  salesman    1
## 3      dump    1
## 4    dental    1
## 5     units    1
## 6      shud    1
## 7      kane    1
## 8   indians    1
## 9    influx    1
## 10   sudden    1

##绘制词云

wordcloud2(word_spam, 
           size = 1,                  # 字体大小
           fontFamily = 'Segoe UI',   # 字体
           fontWeight = 'bold',       # 字体粗细
           color = 'random-dark',     # 字体颜色
           backgroundColor = "white", # 背景颜色
           minRotation = -pi/4,       # minRotation和maxRotation控制文本旋转角度的范围
           maxRotation = pi/4,
           rotateRatio = 0.4,         # 文本旋转的概率 0.4表示大约有40%的词发生了旋转
           shape = "circle"           # 轮廓形状
)

In this section, we tried to identify some commonalities between the fraudulent text messages. The dataset we use has a large amount of text information, and using word clouds makes it easy to quickly identify the most commonly used words in the text, which helps highlight key themes and topics. It is a striking visualization method for highlighting essential textual data points. It can make dull data shine and deliver crucial information quickly.

Word clouds is a grouping of words that are displayed in various sizes: the larger and bolder the term, the more frequently it appears in a document and the more important it is. Text clouds include data visualization, text data, font colors, word frequency analysis, and specific word graphics. These are ideal techniques to extract the most relevant sections of textual material, from blog posts to databases, and are also known as tag clouds or text clouds. They can also assist business users in comparing and contrasting two separate pieces of text in order to identify phrasing similarities.

The word cloud above shows some of the most frequent words in the dataset for all spam SMS. From the figure below, we can find some of the most frequent words, and we will analyze some of the frequent words to find out why they occur frequently.

“Free”: This word is often used to entice people to open the message, as they suggest that the recipient has won something or that there is a special offer available. “Mobile” or “Call” : These words are often used to encourage recipients to contact the sender by phone, as it allows the spammer to reach the recipient more directly. “Collect” or “Send” : These words are often used to direct the recipient to take a specific action, such as collecting a prize or sending personal information. “SMS” or “txt” : These words are used to suggest that the message is being sent via text message, which is a common method of communication. “Guaranteed” : This word is used to alleviate any concerns the recipient may have about the offer, and to make it seem more legitimate. “Tone” or “Urgent” : These words are used to create a sense of urgency and to make the message seem more important. “Chance to win” or “New” : These words are used to create excitement and to make the offer seem more attractive. “Please call” or “Stop” : These words are often used to create a sense of urgency and to encourage the recipient to take immediate action.

Natural Language Processing based on SMS Spam filter

Wu Mingzhen(S2165138), YUYANG SU(S2165168),KE YANG (S2139578), XIN JI(S2116049)

Content

1. Introduction

a.Background; b.Project Objectives

2. Data Preparation

a.Loading libraries; b.Loading dataset; c.Cleaning

3. EDA

4. Modeling

5. Evaluation

7. Member’s Contribution

1. Introduction

a.Background

c. About Dataset

b. Project Objectives

2. Data Preparation

a.loading libraries

b.Loading dataset

b.Cleaning

3. EDA