#Introduction

As a final step in machine learning course, every student should complete 1 case as their machine learning capstone project. In this article, we will explain one cases of machine learning capstone project: “Spam SMS Classification”.

#Datasets

The train dataset will be used to train and evaluate the model, while the test dataset is used for the final evaluation. The final evaluation requires you to submit your prediction of the test dataset to the leaderboard in order to obtain the final model evaluation (more details are provided below). The data scheme is illustrated as follows:

#Case Study

The SMS dataset is collected by team for educational purposes. It is a real SMS dataset with a spam/ham label for each message.

SMS: “I didn’t get your message!”

Someone might contact you through old-school way of SMS and you might even skip it because the amount of the spams in your inbox is just way too much. The SMS is classified as spam is collected through user’s report for unwanted SMS. Can we build a spam classifier?

The problem above urge you to classify whether a text message would be a SPAM or HAM based on the content.

For this case study, there are SMS dataset sources, as follow 1. SMS Train Datasets 2. SMS Test Datasets

#Data Preprocess and Exploratory Data Analysis ## A. Text data preprocessing

1. Load the library and data

To load the following package.

library(lubridate) #Make Dealing with Dates
library(tidyverse) #The tidyverse package is designed to make it easy to install and load core packages from the tidyverse in a single command -> ggplot2, dplyr, tidyr, readr, purrr, tibble, stringr, forcats
library(readr) #Read Rectangular Text Data
library(caret) #Classification and Regression Training
library(dplyr) #A Grammar of Data Manipulation
library(e1071) #Misc Functions of the Department of Statistics, Probability Theory Group (Formerly: E1071)
library(ROCR) #Visualizing the Performance of Scoring Classifiers
library(SnowballC) #Snowball Stemmers Based on the C 'libstemmer' UTF-8 Library

1.1. Input Data

In this phase, To read the source SMS data and use the read_csv() function from the readr package to speed up the data reading process.

1.2. Data format checking / mutate

datetime –> datetime
status –> factor

sms_datatrain <- read.csv("4. sms-cl-spam/data/data-train.csv",
                    stringsAsFactors = FALSE, row.names = , 
                    encoding = "UTF-8") %>% 
  mutate(status = as.factor(status),
          datetime  = as_datetime(datetime),
          hourly = hour(datetime)
         )

glimpse(sms_datatrain)

#> Rows: 2,004
#> Columns: 4
#> $ datetime <dttm> 2017-02-15 14:48:00, 2017-02-15 15:24:00, 2017-02-15 16:0...
#> $ text     <chr> "Telegram code 53784", "Rezeki Nomplok Dompetku Pengiriman...
#> $ status   <fct> ham, spam, ham, ham, ham, ham, ham, spam, spam, spam, spam...
#> $ hourly   <int> 14, 15, 16, 16, 18, 18, 18, 10, 11, 18, 18, 9, 9, 11, 13, ...

1.3. To Observe spam text pattern using wordcloud display

Taking 20 text samples with ‘spam’ category, then observe which words / sentences can be indicator or predictor as text spam. it can also use the a wordcloud to look at the most frequent words in the overall collection of words(“bag of words”) available.

To see “SPAM” word pattern thru bag of word.

library(wordcloud)

# your code here
sms_datatrain %>% 
  filter(status == "spam") %>%
  head(20) %>% 
  pull(text)     # ambil kolom text sebagai vektor character

#>  [1] "Rezeki Nomplok Dompetku Pengiriman Uang! Kirim uang di Alfamart & dptkan hadiah jutaan rupiah setiap hari.Periode s.d. 28Feb17.Info: http://bit.ly/dmpurna MFI1" 
#>  [2] "YEAY! Free Ice Tea atau Cashback up to 30% dg transaksi  di AH Resto! Hanya untuk pengguna TCASH TAP. S&K Berlaku. Info tsel.me/tappromo"                        
#>  [3] "Voting your Offer. Disc 40%, 1 crispy chicken+1 spicy chicken+ nasi+lotteria tea Rp.26rb. Tukar SMS ini di LOTTERIA terdekat. Berlaku hari ini. SKB. Promo *606#"
#>  [4] "Ayo bergabung dgn Freedom Postpaid! Makin rame makin seru, ajak teman & keluarga diskonnya lebih besar. Daftar di http://im3.do/uxU PAI1"                        
#>  [5] "Nikmati kemudahan mewujudkan impian kamu dan pasangan utk masa depan yg lebih cerah. Cek Dana Bantuan Sahabat di DOMPETKU! Info: http://bit.ly/dmpdbs MFI3"      
#>  [6] "Gratis 1 bulan Spotify Premium khusus FreedomCombo. Bisa bebas dengar musik,bikin playlist sepuasnya tanpa iklan dgn Spotify Premium. Aktifkan di *123*123# CVI1"
#>  [7] "Masukan Username & Password ini di aplikasi Spotify. Username:085722688068 Password:1khF1SpC Mohon lgsg ganti alamat email & password di aplikasi Spotify."      
#>  [8] "TIPS HEMAT DATA: pakai resolusi 480p ketika menonton video online, temukan di setting layanan video yg kamu nikmati. Ini akan membuat pemakaian data lbh hemat"  
#>  [9] "Ayam (syp/paha bwh),Nasi,Ades Rp.18.181. Add on CD Bebi Glenn Rp.22.727. Tkr SMS hari ini di CFC CIBUBUR JUNC. Selama persediaan msh ada. Promo*606#"            
#> [10] "YEAY! Kejutan cashback & freebies dg TCASH TAP! Terus #pakeTCASH, cek HP kamu & dapatkan kejutannya. S&K berlaku. Info cek tsel.me/yeay"                         
#> [11] "Beli EXTRA kuota 1GB harga DISKON cuma Rp10rb. Ketik YA9 kirim ke 929 sd. 20/02/17"                                                                              
#> [12] "Hari Senin saatnya Nonton Hemat hanya 25ribu di Cinema XXI dgn TCASH TAP! Dptkan stiker TCASH TAP di GraPARI terdekat. Info tsel.me/tappromo"                    
#> [13] "Disc 50% setiap Senin, 10% di hari lainnya di Coffee Bean dgn TCASH TAP! Dapatkan stiker TCASH TAP di graPARI terdekat. Info tsel.me/tappromo"                   
#> [14] "Beli EXTRA kuota 1GB harga DISKON cuma Rp10rb. Ketik YA9 kirim ke 929 sd. 20/02/17"                                                                              
#> [15] "Pelanggan 085722688068, Ada yang Ngajak Kamu Chatting. Hubungi *858*11# untuk baca. Pesan akan dihapus dalam 5 menit. Silakan hubungi *858*11# sekarang."        
#> [16] "Dapatkan hasil investasi yg bersaing dgn deposito, segera maksimalkan hasil Investasi Anda dgn fitur Auto-Invest di DOMPETKU! Info: http://bit.ly/bnp2211 MFI2"  
#> [17] "Harga spesial Rp 75ribu atau Disc 44% utk tiket Jungle Land. Khusus pengguna TCASH TAP! S&K Berlaku. Info tsel.me/tappromo"                                      
#> [18] "Cashback 50% setiap di Haagen-Dasz Selasa&Kamis / 10% hari lainnya khusus dengan TCASH TAP! S&K berlaku. Info tsel.me/tappromo"                                  
#> [19] "Raih kesempatan mendapat Smartphone keren, cukup download iflix skrg & tonton film sebanyak2nya. Unduh iflix di im3.do/iflix Info kunjungi website kami. CVI2"   
#> [20] "Hanya dengan Isi ulang akumulasi 150rb sebelum 24 Febuari 2017, Dapatkan GRATIS 3GB berlaku 30 hari. Mau? Ayo buruan isi ulang sekarang juga."

sms_datatrain %>% 
  filter(status == "spam") %>% 
  pull(text) %>% 
  wordcloud( max.words = 250, scale = c(2, 0.4),random.order = FALSE,colors=brewer.pal(8, "BrBG"))

Kata yang berpotensial mengindikasi bahwa suatu text adalah spam: kuota, pulsa, paket, sms, internetan, rezeki, hadiah, disc, promo, diskon, gratis, bonus..

2. Reported a distribution plot of total hourly frequency for each status

See the report of SMS hourly, using histogram bar chart.

library(ggplot2)

hist(sms_datatrain$hourly, breaks = 30) # 2 variabel kategorik

See the report of SMS hourly, using Boc Plot chart.

# layer 2: geom_point
ggplot(data = sms_datatrain,  mapping = aes(x = hourly, y = status)) +
  geom_boxplot() +
  geom_point()

Look the interesting report of SMS hourly, using bar chart (GEOM_COL), so we can see the proportion or composition of SPAM/HAM SMS in hourly.

sms_datatrain_freq <- as.data.frame(table(sms_datatrain$status,
                                 sms_datatrain$hourly))
sms_datatrain_freq

#>    Var1 Var2 Freq
#> 1   ham    0    4
#> 2  spam    0   30
#> 3   ham    1    1
#> 4  spam    1   19
#> 5   ham    2    2
#> 6  spam    2    4
#> 7   ham    3    0
#> 8  spam    3    5
#> 9   ham    4    0
#> 10 spam    4    2
#> 11  ham    5    7
#> 12 spam    5    4
#> 13  ham    6   10
#> 14 spam    6   11
#> 15  ham    7   85
#> 16 spam    7   28
#> 17  ham    8   91
#> 18 spam    8   78
#> 19  ham    9  130
#> 20 spam    9   65
#> 21  ham   10   86
#> 22 spam   10   64
#> 23  ham   11   58
#> 24 spam   11   60
#> 25  ham   12   73
#> 26 spam   12   71
#> 27  ham   13   68
#> 28 spam   13   79
#> 29  ham   14   60
#> 30 spam   14   55
#> 31  ham   15   63
#> 32 spam   15   58
#> 33  ham   16   65
#> 34 spam   16   40
#> 35  ham   17   77
#> 36 spam   17   38
#> 37  ham   18   49
#> 38 spam   18   16
#> 39  ham   19   60
#> 40 spam   19   31
#> 41  ham   20   56
#> 42 spam   20   28
#> 43  ham   21   67
#> 44 spam   21   23
#> 45  ham   22   40
#> 46 spam   22   18
#> 47  ham   23   10
#> 48 spam   23   15

ggplot(data = sms_datatrain_freq, mapping = aes(x = Freq,  y = Var2  )) +
geom_col(mapping = aes(fill = Var1), position = "fill") +
  labs(x = "Spam & Ham count Proportion",
       y = "Hour",
       fill = "",
       title = "Proportion of Spam & Ham SMS",
       subtitle = "Publish Hour vs Spam/Ham") +
  scale_fill_brewer(palette = "Set2") +
  theme_minimal() +
  theme(legend.position = "top")

ggplot(data = sms_datatrain_freq, mapping = aes(x = Freq,  y = Var2  )) +
geom_col(mapping = aes(fill = Var1), position = "dodge") +
  labs(x = "Spam & Ham counts",
       y = "Hour",
       fill = "",
       title = "No. of Spam & Ham SMS",
       subtitle = "Publish Hour vs Spam/Ham") +
  scale_fill_brewer(palette = "Set2") +
  theme_minimal() +
  theme(legend.position = "right")

Based on the Data Visualization using Wordcloud, Boxplot & Bar chart, we can conclude: - Mostly, SPAM SMS were from Telco provider Promotion (ie: Kuota, paket, sms, gratis, etc.) - SPAM SMS were sent at early day - Peak of SMS broacasting & activity were from morning till afternoon

2. Data Preprocessing

Before to start the model development, we need to have a clear the text data first. At a later stage, the text is converted to corpus format and then cleaned.

Corpus is a collection of documents. In this case, one document is equivalent to one SMS observation. In one SMS there can be one or more sentences.

summary: in general, the steps that are often done for text cleansing are: - 1. Case-folding, to change all words with lower case - 2. Remove numbers, to remove all numbers - 3. Remove stopwords, to delete words that often appear in the corpus and are usually meaningless - 4. Remove punctuation, to replace certain characters with spaces/blank - 5. Stemming, cutting words into their base words - 6. Remove white space,we remove excessive white space, because the next tokenizing process will cut word by word based on the space character ("")

2.1. Text to Corpus

Corpus is a collection of documents. In this case, one document is equivalent to one SMS observation. In one SMS there can be one or more sentences.

One of the packages that we can use for text mining is tm. Changing from vector text to corpus can be done using the function VCorpus ()

library(tm)

# ubah format menjadi corpus
sms.corpus_train <- VCorpus(VectorSource(sms_datatrain$text))
sms.corpus_train

#> <<VCorpus>>
#> Metadata:  corpus specific: 0, document level (indexed): 0
#> Content:  documents: 2004

nrow(sms_datatrain)

#> [1] 2004

So let’s inspect the contents of (example: the 9th) SMS:

sms.corpus_train[[9]]$content

#> [1] "Voting your Offer. Disc 40%, 1 crispy chicken+1 spicy chicken+ nasi+lotteria tea Rp.26rb. Tukar SMS ini di LOTTERIA terdekat. Berlaku hari ini. SKB. Promo *606#"

2.2. Text Cleansing: Case-folding, Remove numbers, Remove stopwords

Text cleasing for data train

library(stopwords)
# 1. case-folding: mengubah semua text menjadi lowercase
sms.corpus_train <- tm_map(sms.corpus_train, content_transformer(tolower))

# 2. remove numbers: menghapus angka
sms.corpus_train <- tm_map(sms.corpus_train, removeNumbers)

# 3. remove stopwords: menghapus kata yang sering muncul di corpus dan biasanya tidak meaningful
sms.corpus_train <- tm_map(sms.corpus_train, removeWords, stopwords("english"))

# cek content ke-9
sms.corpus_train[[9]]$content

#> [1] "voting  offer. disc %,  crispy chicken+ spicy chicken+ nasi+lotteria tea rp.rb. tukar sms ini di lotteria terdekat. berlaku hari ini. skb. promo *#"

2.3. Transformer - removePunctuation function

We create a transformer function that can replace a certain character with a space (" "):

transformer <- content_transformer(FUN = function(x, pattern){
  gsub(x = x, 
       pattern = pattern, 
       replacement = " ") 
})

We can use a function removePunctuation to replace character. Punctuation omitted: ! ’ # S % & ’ ( ) * + , - . / : ; < = > ? @ [ / ] ^ _ { | } ~

# replace ".", "/", "@", "-" with a white space
sms.corpus_train <- 
    VCorpus(VectorSource(sms_datatrain$text)) %>% 
    tm_map(content_transformer(tolower)) %>% 
    tm_map(removeNumbers) %>% 
    tm_map(removeWords, stopwords("english")) %>% 
    tm_map( transformer, "/") %>% 
    tm_map( transformer, "@") %>% 
    tm_map( transformer, "-") %>% 
    tm_map( transformer, "\\.") %>%  # \\. mengakses "." di awal, tengah, maupun akhir kalimat
    tm_map(transformer, "<[^>]+>.") %>%  # Remove Pipe
    tm_map(transformer, "@\\S+.") %>%  # Remove mention
    tm_map( transformer, "*#") %>%  # Remove hastag
    tm_map(transformer, "http[^[:space:]]*.") %>%  # remove URL
    tm_map(transformer, "&amp;.") %>%   # remove amp
    tm_map(transformer, "[[:punct:]].") %>%  # remove titik3
    tm_map(transformer, "[^[:alpha:][:space:]].") %>%  # remove all
    tm_map(removePunctuation) # remove all punctuation, but without white space (dari package tm)

# cek content ke-9
sms.corpus_train[[9]]$content

#> [1] "voting  offer  disc    crispy chicken spicy chicken nasi otteria tea rp rb  tukar sms ini di lotteria terdekat  berlaku hari ini  skb  promo  "

sms.corpus_train[[5]]$content

#> [1] "apakah anda mencoba mengakses akun anda dari perangkat lain jika ya mohon klik tautan ini   api gojek co id customers device oken e  e bac dc dalam  jam ke depan  jika tidak mohon abaikan pesan ini"

2.4. Data Stemming

Next we do ** stemming ** or cutting the word into the root word. For example **walking, walked, walks* becomes walk.

library(SnowballC)
# lakukan instalasi package: SnowballC
# stemming
sms.corpus_train <- tm::tm_map(sms.corpus_train, stemDocument)

# cek content ke-9 
sms.corpus_train[[9]]$content

#> [1] "vote offer disc crispi chicken spici chicken nasi otteria tea rp rb tukar sms ini di lotteria terdekat berlaku hari ini skb promo"

2.5. White space - Text Processing

Finally, we remove excess white space, because the next tokenizing process will cut word by word based on the space character ("").

# remove white space
sms.corpus_train <- tm_map(sms.corpus_train, stripWhitespace)

sms.corpus_train[[9]]$content

#> [1] "vote offer disc crispi chicken spici chicken nasi otteria tea rp rb tukar sms ini di lotteria terdekat berlaku hari ini skb promo"

sms.corpus_train %>% 
  wordcloud( max.words = 200, scale = c(2, 0.5),random.order = FALSE, colors=brewer.pal(8, "BrBG"))

3. Reported some text characteristics related to spam and ham

3.1. Document-Term Matrix (DTM) & Tokenization

At this stage, our data is still in ‘text’ format. The question is how will our model learn if the predictor is still in text?

We need to transform the text data into Document-Term Matrix (DTM) through the tokenization process. Tokenization is the process of breaking a sentence into several terms (can be 1 word, word pairs, etc.). In DTM, 1 word will become 1 predictor with the value in the form of the frequency of occurrence of the word in a document.

# ubah menjadi DTM
sms_train.dtm <- DocumentTermMatrix(x = sms.corpus_train)
inspect(sms_train.dtm)

#> <<DocumentTermMatrix (documents: 2004, terms: 2877)>>
#> Non-/sparse entries: 22642/5742866
#> Sparsity           : 100%
#> Maximal term length: 22
#> Weighting          : term frequency (tf)
#> Sample             :
#>      Terms
#> Docs  anda atau dgn info kamu kuota paket pulsa saya sms
#>   193    0    0   0    0    0     0     0     0    0   0
#>   196    0    0   0    0    0     0     0     0    0   0
#>   197    0    0   0    0    0     0     0     0    0   1
#>   225    0    0   0    0    0     0     0     0    0   0
#>   239    0    1   0    0    0     0     0     0    4   0
#>   29     0    1   0    0    0     0     0     0    4   0
#>   378    0    0   0    0    0     0     0     0    0   0
#>   409    0    0   0    0    0     0     0     0    0   0
#>   410    1    0   1    2    0     0     2     2    0   7
#>   955    0    0   0    0    0     0     3     2    0   0

Term:

Documents type: SMS
terms: unique words in all of our SMS
non-sparse: a value that is not 0 in the matrix
sparse: value 0 in the matrix

Let’s take a look at the cleansed 437th SMS, then we can confirm that the word “pulse” appears 3 times:

sms.corpus_train[[437]]$content

#> [1] "pulsamu p mau pulsa sos bs dipakai ke semuaoper sd hr embayaran rp beajasa rp akan dipotong saat isi pulsa bala ya ke tkn"

Cross-Validation

To have robust a prediction model, the data_train for development to split into 2 data: - Data Train, the base data for model development - Data Test for validation, to test models based on Train Data

Splitting the data with sms_train_val dan sms_test_val dengan perbandingan 75%-25%.

RNGkind(sample.kind = "Rounding")
set.seed(100)

# train-test splitting
index      <- sample(nrow(sms_train.dtm), nrow(sms_train.dtm)*0.75)

# sms.dtm = DocumentTermMatrix yang tidak ada labelnya
sms_train_val <- sms_train.dtm[index,] 
sms_test_val  <- sms_train.dtm[-index,]

the label for target prediction:

# label untuk train dan test, tersimpan pada dataframe sms
label_train_val <- sms_datatrain[index, 'status']
label_test_val  <- sms_datatrain[-index, 'status']

To check the composition/ proportion of target class at label_train_val dan label_test_val:

prop.table(table(label_train_val))

#> label_train_val
#>      ham     spam 
#> 0.584165 0.415835

prop.table(table(label_test_val))

#> label_test_val
#>       ham      spam 
#> 0.5668663 0.4331337

Further Data Preprocessing

Remove Infrequent Words

To check dimention of sms_train data –> ‘sms_train_val’ & ‘sms_test_val’ that it will be used for model development:

dim(sms_train_val)

#> [1] 1503 2877

dim(sms_test_val)

#> [1]  501 2877

We have very many predictors, up to 2877. Let’s reduce the noise in our data by taking words that appear quite often, for example, at least 10 times in all SMS. Use the findFreqTerms () function:

# sms_train dari sms.dtm
sms_freq_val <- findFreqTerms(sms_train_val, lowfreq = 8) 
length(sms_freq_val)

#> [1] 462

Note: Determination of lowfreq = 8 is not absolute and can be changed for feature selection. Please note: The bigger the lowfreq, the less terms we use as a feature / predictor.

Let’s subset the sms_train data only for the words that appear insms_freq:

sms_train_Val <- sms_train_val[,sms_freq_val] # terms letaknya di kolom
inspect(sms_train_val) #inspect(sms_train)

#> <<DocumentTermMatrix (documents: 1503, terms: 2877)>>
#> Non-/sparse entries: 16826/4307305
#> Sparsity           : 100%
#> Maximal term length: 22
#> Weighting          : term frequency (tf)
#> Sample             :
#>       Terms
#> Docs   anda atau dgn info kamu kuota paket pulsa saya sms
#>   1400    1    1   0    0    0     2     1     1    0   1
#>   196     0    0   0    0    0     0     0     0    0   0
#>   197     0    0   0    0    0     0     0     0    0   1
#>   225     0    0   0    0    0     0     0     0    0   0
#>   239     0    1   0    0    0     0     0     0    4   0
#>   29      0    1   0    0    0     0     0     0    4   0
#>   378     0    0   0    0    0     0     0     0    0   0
#>   403     0    0   0    0    0     0     0     0    0   0
#>   409     0    0   0    0    0     0     0     0    0   0
#>   410     1    0   1    2    0     0     2     2    0   7

Bernoulli Converter

The value in the sms_train matrix is still frequency. For probability calculation, the frequency will be changed to only the conditions appear (1) or not (0). One way is by using Bernoulli Converter.

If frequency > 0, then value 1 (appear)
If frequency == 0, then value 0 (not appear)

bernoulli_conv <- function(x){
  # parameter ifelse: kondisi, TRUE, FALSE
  x <- as.factor(ifelse(x > 0, 1, 0)) 
  return(x)
  }

# testing fungsi
bernoulli_conv(c(3,0,0,1,4,0))

#> [1] 1 0 0 1 1 0
#> Levels: 0 1

Next, to apply bernoulli_conv ke sms_train_val & sms_test_val:

sms_train_bn_val <- apply(X = sms_train_val, MARGIN = 2, FUN = bernoulli_conv)
sms_test_bn_val  <- apply(X = sms_test_val, MARGIN = 2, FUN = bernoulli_conv)

dim(sms_train_bn_val)

#> [1] 1503 2877

dim(sms_test_bn_val)

#> [1]  501 2877

MARGIN = 1 -> to apply FUN by row
MARGIN = 2 -> to apply FUN by column, becuase we want to keep a as-is matrix with form: ‘DocumentTermMatrix’

Check the result:

sms_train_bn_val[1:25, 35:40]

#>       Terms
#> Docs   ahlinya ahmad air aja ajaa ajak
#>   617  "0"     "0"   "0" "0" "0"  "0" 
#>   517  "0"     "0"   "0" "0" "0"  "0" 
#>   1106 "0"     "0"   "0" "0" "0"  "0" 
#>   113  "0"     "0"   "0" "0" "0"  "0" 
#>   938  "0"     "0"   "0" "0" "0"  "0" 
#>   968  "0"     "0"   "0" "0" "0"  "0" 
#>   1624 "0"     "0"   "0" "0" "0"  "0" 
#>   740  "0"     "0"   "0" "0" "0"  "0" 
#>   1091 "0"     "0"   "0" "0" "0"  "0" 
#>   340  "0"     "0"   "0" "0" "0"  "0" 
#>   1247 "0"     "0"   "0" "0" "0"  "0" 
#>   1759 "0"     "0"   "0" "0" "0"  "0" 
#>   559  "0"     "0"   "0" "0" "0"  "0" 
#>   794  "0"     "0"   "0" "0" "0"  "0" 
#>   1518 "0"     "0"   "0" "0" "0"  "0" 
#>   1331 "0"     "0"   "0" "0" "0"  "0" 
#>   407  "0"     "0"   "0" "0" "0"  "0" 
#>   711  "0"     "0"   "0" "0" "0"  "0" 
#>   714  "0"     "0"   "0" "0" "0"  "0" 
#>   1371 "0"     "0"   "0" "0" "0"  "0" 
#>   1064 "0"     "0"   "0" "0" "0"  "0" 
#>   1410 "0"     "0"   "0" "0" "0"  "0" 
#>   1068 "0"     "0"   "0" "0" "0"  "0" 
#>   1484 "0"     "0"   "0" "0" "0"  "0" 
#>   832  "0"     "0"   "0" "0" "0"  "0"

Model Model Development and Fitting

For this project, we will use Naive Bayes Model, The sms_train_bn is ready and using Library (e1071). # 1. Naive Bayes Model script:

library(e1071)

# train
naive_spam <- naiveBayes(x = sms_train_bn_val, # predictor-predictor yang berupa matrix
                         y = label_train_val) # label atau target variable

head(naive_spam$tables)

#> $aagc
#>                aagc
#> label_train_val           0           1
#>            ham  0.998861048 0.001138952
#>            spam 1.000000000 0.000000000
#> 
#> $abaikan
#>                abaikan
#> label_train_val          0          1
#>            ham  0.98974943 0.01025057
#>            spam 0.99840000 0.00160000
#> 
#> $abi
#>                abi
#> label_train_val      0      1
#>            ham  1.0000 0.0000
#>            spam 0.9968 0.0032
#> 
#> $abu
#>                abu
#> label_train_val           0           1
#>            ham  0.998861048 0.001138952
#>            spam 1.000000000 0.000000000
#> 
#> $abung
#>                abung
#> label_train_val 0
#>            ham  1
#>            spam 1
#> 
#> $acara
#>                acara
#> label_train_val           0           1
#>            ham  0.998861048 0.001138952
#>            spam 1.000000000 0.000000000

2. Model Prediction

Predict the target class in sms_test_bn_val. Save it to the sms_pred_class object, which will be used to evaluate the confusion matrix.

# predict
dim(sms_test_bn_val)

#> [1]  501 2877

sms_pred_class <- predict(object = naive_spam, # model naive bayes
                          newdata = sms_test_bn_val, # testing data
                          type = "class") # memprediksi kelas
head(sms_pred_class)

#> [1] ham  spam spam ham  spam spam
#> Levels: ham spam

table(sms_pred_class)

#> sms_pred_class
#>  ham spam 
#>  273  228

prop.table(table(sms_pred_class)) #sms_pred_class

#> sms_pred_class
#>       ham      spam 
#> 0.5449102 0.4550898

to check the spam/ham prediction data/composition

sms_test_bn_val_df <- data.frame(sms_test_bn_val)
sms_pred_class_df  <- data.frame(sms_pred_class)

dim(sms_test_bn_val_df)

#> [1]  501 2877

prop.table(table(sms_pred_class_df))

#> sms_pred_class_df
#>       ham      spam 
#> 0.5449102 0.4550898

(table(sms_pred_class_df))

#> sms_pred_class_df
#>  ham spam 
#>  273  228

3. Model Evaluation

3.1. Confusion Matrix

Evaluate the naive_spam model using the confusion matrix and the existing metrics, with library(caret)

library(caret)

confusionMatrix(data = sms_pred_class, # label hasil prediksi
                reference = label_test_val, # label actual
                positive = "spam") # kelas positif: spam

#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction ham spam
#>       ham  264    9
#>       spam  20  208
#>                                               
#>                Accuracy : 0.9421              
#>                  95% CI : (0.9179, 0.9609)    
#>     No Information Rate : 0.5669              
#>     P-Value [Acc > NIR] : < 0.0000000000000002
#>                                               
#>                   Kappa : 0.8828              
#>                                               
#>  Mcnemar's Test P-Value : 0.06332             
#>                                               
#>             Sensitivity : 0.9585              
#>             Specificity : 0.9296              
#>          Pos Pred Value : 0.9123              
#>          Neg Pred Value : 0.9670              
#>              Prevalence : 0.4331              
#>          Detection Rate : 0.4152              
#>    Detection Prevalence : 0.4551              
#>       Balanced Accuracy : 0.9441              
#>                                               
#>        'Positive' Class : spam                
#>

Which cases do we want to minimize ? - False Negative: sms is actually spam, but the prediction model is as ham (consequently spam sms enters our inbox) - False Positive: sms is actually ham, but it is predicted to be spam (as a result, ham sms goes to our spam folder)

Minimize: False Positive

Which metrics do we pay attention to? Include the reasons:

Precision: High precision value is able to select spam SMS as selectively as possible, thus reducing the possibility of important SMS (ham) being spam.

According to the metrics, is our model performing well enough?

Pos Pred Value: 0.9123, our model’s performance is very good.

3.2. 💻 ROC and AUC

ROC and AUC as a model evaluation tool: a. ROC is a curve depicting TPR vs FPR for each threshold, while AUC is the area under the ROC curve. b. The more convex the ROC curve (that is, the AUC approaches the value of one), the better the model is at separating positive and negative classes.

3.2.1. Prepare a prediction result

First, prepare a prediction result in the form of a probability from sms_test_bn_val, save it to an object with the namesms_pred_prob:

sms_pred_prob <- predict(naive_spam, sms_test_bn_val, type = "raw")
head(sms_pred_prob)

#>                         ham                      spam
#> [1,] 1.00000000000000000000 0.00000000000000004174301
#> [2,] 0.00000000000004828383 0.99999999999995181632073
#> [3,] 0.00001158056778772665 0.99998841943221228323324
#> [4,] 0.99964574107841930317 0.00035425892158074773165
#> [5,] 0.00024669691554197196 0.99975330308445797822259
#> [6,] 0.14715885403056078173 0.85284114596943916275507

3.2.2. Prepare ROC data

Prepare ROC data to make things easier for us, save it to an object with the name data_sms_roc:

data_sms_roc <- data.frame(pred_prob = sms_pred_prob[,'spam'],
                           actual_label = ifelse(label_test_val == 'spam', 1, 0))
head(data_sms_roc)

#>                   pred_prob actual_label
#> 1 0.00000000000000004174301            0
#> 2 0.99999999999995181632073            1
#> 3 0.99998841943221228323324            1
#> 4 0.00035425892158074773165            1
#> 5 0.99975330308445797822259            1
#> 6 0.85284114596943916275507            0

3.2.3. Prepare a ROC Curve

Create an ROC curve. Save the resulting prediction () object with the name sms_roc:

library(ROCR)


sms_roc <- prediction(predictions = data_sms_roc$pred_prob,
                      labels = data_sms_roc$actual_label)

plot(performance(sms_roc, "tpr", "fpr"))
abline(0, 1, lty = 2) # garis diagonal, yaitu performa ketika asal tebak

3.2.4. Prepare a ROC Curve

Calculate the AUC value:

sms_auc <- performance(sms_roc, measure = "auc")
sms_auc@y.values

#> [[1]]
#> [1] 0.9841955

AUC value = 0.9841955, so it can be concluded that the performance of the Naive Bayes Spam Classifier model is very good in separating which class is positive (spam) from class negative (ham).

📝 SUMMARY / CONCLUSION

ROC and AUC as a model evaluation tool:
1. ROC is a curve depicting TPR vs FPR for each threshold, while AUC is the area under the ROC curve.
2. The more convex the ROC curve (that is, the AUC approaches the value of one), the better the model is at separating positive and negative classes.
Naive Bayes is often used for text classification cases, because the computation time is fast enough to predict many words.
The Workflow for text mining are:

Read data dan wrangling
Eksplorasi data
Text cleansing
Tokenization (DocumentTermMatrix)
Cross-Validation
Further Preprocessing: i. Feature selection findFreqTerms() ii. Bernoulli converter
Train model dan prediksi
Evaluasi model

The CAPSTONE PROJECT: SMS Classification SPAM, using Naive Bayes Model and based on Validation, also ROC & AUC, We can conclude that the model is very predictive. AUC value = 0.9841955, so it can be concluded that the performance of the Naive Bayes Spam Classifier model is very good in separating which class is positive (spam) from class negative (ham).

SUBMISSION

A. Predict on Data Test

Prepare the data test, read and transform as same with Data train & validation at Model Development.

sms_datatest <- read.csv("4. sms-cl-spam/data/data-test.csv",
                    stringsAsFactors = FALSE,nrows = , row.names = , 
                    encoding = "UTF-8") %>% 
  mutate(status = as.factor(status)
         ) 

transformer <- content_transformer(FUN = function(x, pattern){
  gsub(x = x, 
       pattern = pattern, 
       replacement = " ") 
})

sms.corpus_test <- 
    VCorpus(VectorSource(sms_datatest$text)) %>% 
    tm_map(content_transformer(tolower)) %>% 
    tm_map(removeNumbers) %>% 
    tm_map(removeWords, stopwords("english")) %>% 
    tm_map( transformer, "/") %>% 
    tm_map( transformer, "@") %>% 
    tm_map( transformer, "-") %>% 
    tm_map( transformer, "\\.") %>%  # \\. mengakses "." di awal, tengah, maupun akhir kalimat
    tm_map(transformer, "<[^>]+>.") %>%  # Remove Pipe
    tm_map(transformer, "@\\S+.") %>%  # Remove mention
    tm_map( transformer, "*#") %>%  # Remove hastag
    tm_map(transformer, "http[^[:space:]]*.") %>%  # remove URL
    tm_map(transformer, "&amp;.") %>%   # remove amp
    tm_map(transformer, "[[:punct:]].") %>%  # remove titik3
    tm_map(transformer, "[^[:alpha:][:space:]].") %>%  # remove all
    tm_map(removePunctuation) %>% 
    tm::tm_map(stemDocument) %>% 
    tm_map(stripWhitespace)

glimpse(sms_datatest)

#> Rows: 283
#> Columns: 3
#> $ datetime <chr> "2018-03-01T00:32:00Z", "2018-03-01T08:57:00Z", "2018-03-0...
#> $ text     <chr> "Km baru saja akses Apps Sehari-hari terpopuler.Nikmati ak...
#> $ status   <fct> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...

# cek content ke-9
sms.corpus_test[[9]]$content

#> [1] "preorder samsung galaxi s cashback s d rpribu disc s d rpribu bln dgn kartukredit di samsung store tertentu s d mar s info"

sms_test.dtm  <- DocumentTermMatrix(x = sms.corpus_test)
inspect(sms_test.dtm)

#> <<DocumentTermMatrix (documents: 283, terms: 808)>>
#> Non-/sparse entries: 4021/224643
#> Sparsity           : 98%
#> Maximal term length: 22
#> Weighting          : term frequency (tf)
#> Sample             :
#>      Terms
#> Docs  anda axi diblokir info jam kamu ketik registrasi sms ulang
#>   134    0   0        0    1   0    1     0          0   2     0
#>   180    1   0        0    0   0    0     1          0   1     0
#>   200    0   0        0    0   0    0     0          0   0     0
#>   208    1   0        0    0   0    0     1          1   1     2
#>   209    1   0        0    0   0    0     1          1   1     2
#>   212    1   0        0    0   0    0     1          1   1     2
#>   64     2   0        0    0   0    0     0          0   1     0
#>   82     2   0        0    0   0    0     0          0   1     0
#>   84     2   0        0    0   0    0     0          0   1     0
#>   88     2   0        0    0   0    0     0          0   1     0

index_test <- sample(nrow(sms_test.dtm) )
label_test_test <- sms_datatest[index_test, 'status']

sms_test_test <- sms_test.dtm
prop.table(table(label_test_test))

#> numeric(0)

dim(label_test_test)

#> NULL

dim(sms_test_test)

#> [1] 283 808

sms_freq_test <- findFreqTerms(sms_test_test, lowfreq = 8) 
length(sms_freq_test)

#> [1] 125

sms_test_test1 <- sms_test_test[,sms_freq_test] # terms letaknya di kolom
inspect(sms_test_test1)

#> <<DocumentTermMatrix (documents: 283, terms: 125)>>
#> Non-/sparse entries: 2586/32789
#> Sparsity           : 93%
#> Maximal term length: 11
#> Weighting          : term frequency (tf)
#> Sample             :
#>     Terms
#> Docs anda axi diblokir info jam kamu ketik registrasi sms ulang
#>   27    1   0        2    0   0    0     1          1   2     2
#>   32    1   0        2    1   0    0     1          1   2     2
#>   38    1   0        2    1   0    0     1          1   2     2
#>   44    1   0        2    1   0    0     1          1   2     2
#>   60    1   0        2    0   0    0     1          1   2     2
#>   61    1   0        2    0   0    0     1          1   2     2
#>   69    1   0        2    1   0    0     1          1   2     2
#>   72    1   0        2    0   0    0     1          1   2     2
#>   79    1   0        2    0   0    0     1          1   2     2
#>   81    1   0        2    1   0    0     1          1   2     2

dim(sms_test_test1)

#> [1] 283 125

bernoulli_conv <- function(x){
  # parameter ifelse: kondisi, TRUE, FALSE
  x <- as.factor(ifelse(x > 0, 1, 0)) 
  return(x)
}

sms_test_bn_test <- apply(X = sms_test_test1, MARGIN = 2, FUN = bernoulli_conv)

dim(sms_test_bn_test)

#> [1] 283 125

sms_test_bn_test_df <- data.frame(sms_test_bn_test)

pred_test <- predict(object = naive_spam, # model naive bayes
                          newdata = sms_test_bn_test_df , #%>% select(-status), # testing data
                          type = "class") # memprediksi kelas
#sms_pred_class

#head(sms_test_bn)

pred_test_df  <- data.frame(pred_test)

dim(sms_test_bn_test_df)

#> [1] 283 125

(table(pred_test_df))

#> pred_test_df
#>  ham spam 
#>   92  191

head(pred_test)

#> [1] spam spam spam spam ham  spam
#> Levels: ham spam

table(pred_test)

#> pred_test
#>  ham spam 
#>   92  191

prop.table(table(pred_test))

#> pred_test
#>       ham      spam 
#> 0.3250883 0.6749117

library(wordcloud)

wordcloud(sms_datatest$text, max.words = 200, scale = c(2, 0.5),random.order = FALSE)

B.Create submission data

submission <- sms_test_bn_test_df %>% 
  mutate(datetime = sms_datatest$datetime,
         text = sms_datatest$text
         ) %>% 
  mutate(status = pred_test) %>% 
  select(datetime,status)

table(submission$status)

#> 
#>  ham spam 
#>   92  191

prop.table(table(submission$status))

#> 
#>       ham      spam 
#> 0.3250883 0.6749117

# save data
write.csv(submission, "submission-dwi_susiyanto.csv", row.names = F)

# check first 3 data
head(submission, 3)

#>               datetime status
#> 1 2018-03-01T00:32:00Z   spam
#> 2 2018-03-01T08:57:00Z   spam
#> 3 2018-03-01T09:15:00Z   spam

nrow(submission)

#> [1] 283

R Machine Learning Capstone by Dwi Susiyanto, Spam SMS Prediction/Classification