Natural Language Processing based on SMS Spam filter

Wu Mingzhen(S2165138),

YUYANG SU(S2165168),

KE YANG (S2139578),

XIN JI(S2116049)


Content

1. Introduction

a.Background; b.Project Objectives

2. Data Preparation

a.Loading libraries; b.Loading dataset; c.Cleaning

3. EDA

4. Modeling

5. Evaluation

7. Member’s Contribution


1. Introduction

a.Background

Among the essential platforms for communication, text messaging, more commonly known as short message services (SMS) has been selected as an effective service tool used by businesses for marketing. However, many harmful, spam and artificial messages have been designed by criminals to deliver to millions of people, due to the worldwide dominance and convenience of SMS in communication platforms, They take advantage of the technology advances to perform smishing activity, in which propagating the scams over the mobile networks would be done in a swift manner. The text messages often have links in them, which induce unsuspecting victims to a phishing site. Consequently, the victims expose themselves to the risk of monetary loss by divulging personal information, downloading malware onto their mobile device, or providing a one-time passcode that will allow a criminal to bypass multi-factor authentication (MFA). Drager (2022) claims that SMS attacks are skyrocketing over the years, in which there was around 328% rate increment in 2020, and it grew further for about 700% during the first half of 2021. As a measure to prevent SMS attacks, organizations and researchers develop robust and effective spam filters before the text messages are being delivered to the end recipients. As such, machine learning models have been utilized to filter, detect and classify the message inputs. For instance, in the research paper by Mishra & Soni (2022), Neural Network, Naïve Bayes, and Decision Tree models have been implemented to detect smishing with model accuracies of above 93%. The researchers have proven the efficacy of machine learning models in text message classification into legit (ham) or illegal (smishing) traffic type.

c. About Dataset

The SMS Spam Collection is a compilation of labeled SMS messages gathered for the purpose of studying SMS spam. It features a total of 5,574 English SMS messages, classified as either legitimate (ham) or spam.The SMS Spam Collection dataset consists of individual messages, each one occupying a separate line. Each line is divided into two columns: v1 displays the label (either ham or spam) and v2 displays the unedited text. https://www.kaggle.com/datasets/uciml/sms-spam-collection-dataset

b. Project Objectives

** 1.To

2.To **


2. Data Preparation

a.loading libraries
library(ggplot2)
library(wordcloud2)
library(dplyr)
## 
## 载入程辑包:'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(psych)
## 
## 载入程辑包:'psych'
## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha
library(jiebaR)
## 载入需要的程辑包:jiebaRD
## 
## 载入程辑包:'jiebaR'
## The following object is masked from 'package:psych':
## 
##     distance
library(corrplot)
## corrplot 0.92 loaded
library(tm)
## 载入需要的程辑包:NLP
## 
## 载入程辑包:'NLP'
## The following object is masked from 'package:ggplot2':
## 
##     annotate
library(quanteda)
## Warning in stringi::stri_info(): Your current locale is not in the list
## of available locales. Some functions may not work properly. Refer to
## stri_locale_list() for more details on known locale specifiers.

## Warning in stringi::stri_info(): Your current locale is not in the list
## of available locales. Some functions may not work properly. Refer to
## stri_locale_list() for more details on known locale specifiers.
## Package version: 3.2.4
## Unicode version: 14.0
## ICU version: 70.1
## Parallel computing: 8 of 8 threads used.
## See https://quanteda.io for tutorials and examples.
## 
## 载入程辑包:'quanteda'
## The following object is masked from 'package:tm':
## 
##     stopwords
## The following objects are masked from 'package:NLP':
## 
##     meta, meta<-
b.Loading dataset
data <- read.csv("spam.csv",encoding = "Lation1")
head(data,5)
##     v1
## 1  ham
## 2  ham
## 3 spam
## 4  ham
## 5  ham
##                                                                                                                                                            v2
## 1                                             Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...
## 2                                                                                                                               Ok lar... Joking wif u oni...
## 3 Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's
## 4                                                                                                           U dun say so early hor... U c already then say...
## 5                                                                                               Nah I don't think he goes to usf, he lives around here though
##   X X.1 X.2
## 1          
## 2          
## 3          
## 4          
## 5

The dataset we obtain is a set of labeled data that have been collected for SMS Spam research, available publicly in Kaggle, an online open-source platform for data science collaborations. This set of SMS messages contains 5,574 English messages, tagged according to its type, ham (legitimate) or spam (illegitimate). The file contains one message per line, which is composed of two columns: (1) contains the label (ham or spam) and (2) contains the raw text.