Introduction

Communication media is increasingly significant in our community. It has been a necessary tool for society, with various use cases such as information exchange, relationships, transactions, and business collaborations. Among the essential platforms for communication, text messaging, more commonly known as short message services (SMS) has been selected as an effective service tool used by businesses for marketing. According to Rizzo Young (2021), priority has been given to SMS as a marketing strategy by organizations due to the higher response rate of SMS, which is about 209% higher than other communications tools like emails, phone calls, and messaging via social media platforms.

Due to the worldwide dominance and convenience of SMS in communication platforms, many harmful, spam and artificial messages have been designed by criminals to deliver to millions of people. They take advantage of the technology advances to perform smishing activity, in which propagating the scams over the mobile networks would be done in a swift manner. Drager (2022) claims that SMS attacks are skyrocketing over the years, in which there was around 328% rate increment in 2020, and it grew further for about 700% during the first half of 2021.

According to Christoper Yap (2022), the consequences of phishing can be destructive – a smishing SMS always appears to come from a legitimate organization, which used to be a financial and banking institution. The text messages often have links in them, which induce unsuspecting victims to a phishing site. Consequently, the victims expose themselves to the risk of monetary loss by divulging personal information, downloading malware onto their mobile device, or providing a one-time passcode that will allow a criminal to bypass multi-factor authentication (MFA). Meanwhile, the reputation of a certain organization is jeopardized.

As a measure to prevent SMS attacks, organizations and researchers develop robust and effective spam filters before the text messages are being delivered to the end recipients. As such, machine learning models have been utilized to filter, detect and classify the message inputs. For instance, in the research paper by Mishra & Soni (2022), Neural Network, Naïve Bayes, and Decision Tree models have been implemented to detect smishing with model accuracies of above 93%. The researchers have proven the efficacy of machine learning models in text message classification into legit (ham) or illegal (smishing) traffic type.

In this paper, we aim to develop a model for the classification of text messages into either illegitimate (spam) or legit (ham) type. Of the spam messages, we will also look into determining their commonalities. The machine learning models to be applied in this research paper are Support Vector Machine (SVM), Naïve Bayes and Random Forest. Evaluation of these models will be carried out and determine which model is the best to detect smishing. These models are being selected due to their ability for classification. ## About Dataset

The SMS Spam Collection is a compilation of labeled SMS messages gathered for the purpose of studying SMS spam. It features a total of 5,574 English SMS messages, classified as either legitimate (ham) or spam.

The SMS Spam Collection dataset consists of individual messages, each one occupying a separate line. Each line is divided into two columns: v1 displays the label (either ham or spam) and v2 displays the unedited text. The corpus was gathered from various free-to-use or research-specific sources on the internet, including:

-A collection of 425 SMS spam messages manually extracted from the Grumbletext website, a UK forum where cell phone users share complaints about SMS spam messages without necessarily providing the actual messages. The process of identifying the text of spam messages from these complaints was difficult and time-consuming, requiring close examination of multiple web pages. The Grumbletext website can be found at Grumbletext.

-A subset of 3,375 randomly selected legitimate SMS messages from the NUS SMS Corpus (NSC), a dataset of around 10,000 messages collected for research at the National University of Singapore’s Department of Computer Science. The majority of these messages were sent by Singaporeans, mostly university students, who were aware that their contributions would be made publicly available. The NUS SMS Corpus can be accessed at NUS SMS Corpus.

-A set of 450 legitimate SMS messages taken from Caroline Tag’s PhD thesis, available at Caroline Tag’s PhD thesis.

-Finally, the SMS Spam Corpus v.0.1 Big, consisting of 1,002 legitimate SMS messages and 322 spam messages, which can be found at SMS Spam Corpus. This corpus has been used in multiple academic studies.

Project Objectives

This paper aims to develop a data product targeted to act as a filtration tool to classify text messages into illegitimate (spam) or legit (ham) traffic type in several effective and efficient methods.

In order to develop an effective classifier in detecting spam text messages, one of the objectives in this paper is to identify the commonalities of spam text messages. We will study and determine the common underlying patterns being used in spam messages. Spam score is one of the parameters to determine a spam probability, which represents some computation on the amount of times a message was sent.

Apart from that, this paper also aims to determine the best model to classify text messages, by performing simulation among Support Vector Machine (SVM), Naïve Bayes and Random Forest machine learning models.