Capstone ML : SMS

Introduction
- Capstone Project
Setting up the environment
Data Preparing
Data Processing
Build Model
The NaiveBayes Algorithm
The Decision Tree Algorithm
Conclusion
Data Submission

Introduction

Capstone Project

This project is my final step in machine learning course. And I chose SMS dataset as my machine learning capstone project. The SMS dataset is a set of SMS tagged messages that have been collected by Team Algoritma. It contains one set of SMS messages in Bahasa Indonesia of 2.004 messages, tagged acording being ham (legitimate) or spam.

I will use this dataset to build a prediction model that will accurately classify which texts are spam or ham(legitimate). I use The Naive Bayes algorithm and compare with The Decision Tree algorithm to find the best model for this dataset.

Setting up the environment

First load all the required libraries(packages)

library(dplyr)
library(lubridate)
library(tm)
library(readr)
library(e1071)
library(caret)
library(RColorBrewer)
library(wordcloud)
library(ggplot2)
library(plotly)
library(partykit)

Data Preparing

Reading the data

sms <- read.csv("C2/C2/data_input/sms-cl-spam/data/data-train.csv", stringsAsFactors = FALSE, encoding = "UTF-8")
glimpse(sms)

## Observations: 2,004
## Variables: 3
## $ datetime <chr> "2017-02-15T14:48:00Z", "2017-02-15T15:24:00Z", "2017...
## $ text     <chr> "Telegram code 53784", "Rezeki Nomplok Dompetku Pengi...
## $ status   <chr> "ham", "spam", "ham", "ham", "ham", "ham", "ham", "sp...

Visualize the Data

sms$datetime <- ymd_hms(sms$datetime)
sms <- sms %>% 
  mutate(Hour = hour(datetime))

plot1 <- ggplot(sms, aes(Hour, fill= status)) + 
  geom_bar() + 
  facet_wrap(~sms$status)+
  xlab("Hour")+
  ggtitle("Overview of Ham & Spam to SMS")

ggplotly(plot1)

Exploring Data

set.seed(100)
sms <- sms %>% 
  select("label" = status, "text" = text) %>% 
  mutate("label" = as.factor(label))
class(sms)

## [1] "data.frame"

sms[sample(nrow(sms), 10),"text"]

##  [1] "GRATIS INTERNET 2GB plus UNLIMITED CHAT & SOSMED selama 30hari. DataRollover. PROMO 35Rb (Normal 40rb). MAU? Tekan OK kirim SMS ke 929 sd 26/02/18"         
##  [2] "jngan sampai kelewatan iRing Iis Dahlia-Rindu (SP), FREE lho.Lngsung bls YA aja ya!"                                                                        
##  [3] "G-679179 is your Google verification code."                                                                                                                 
##  [4] "awas ketinggalan iRing Aldy Maldini-Biar Aku Yang Pergi, Rp.0,1/3hr prpnjngan Rp.3190 .Lngsung bls YA aja ya!"                                              
##  [5] "iRing keren cuman buat km, Rizki Ridho DAcademy-Terlanjur Sakit (DM17),Rp.0,1/3hr prpnjngan Rp.3190 dengan hnya bls YA lho!"                                
##  [6] "Km baru saja akses Apps Sehari-hari terpopuler.Nikmati akses YOUTUBE ga habis habis dgn beli pkt Unlimited HANYA di *123# atau myIM3 http://im3.do/m3"      
##  [7] "Dapatkan cashback s.d Rp50rb di Pepper Lunch, \n Bakerzin,Raffel's Sandwich atau cashback s.d Rp100rb di \n Inul Vizta jika bayar pakai PayPro.Info @indosa"
##  [8] "Bentar lagi turun"                                                                                                                                          
##  [9] "Sisa kuota 116.0MB. Beli EXTRA Kuota atau upgrade FREEDOM COMBO di*123# atau im3.do/mc"                                                                     
## [10] "AGAR TDK DIBLOKIR,sgr registrasi ulang kartu prabayar Anda.Tlp dan SMS keluar diblokir mulai 1-31 Maret 2018.Ketik ULANG#NIK#No.KK SMS ke 4444.Info 838"

Data Processing

Change data type to corpus

I prepare a corpus of all the documents in the dataframe.

sms.corpus <- VCorpus(VectorSource(sms$text))
class(sms.corpus)

## [1] "VCorpus" "Corpus"

sms.corpus[[6]]$content

## [1] "Apakah Anda mencoba mengakses akun Anda dari perangkat lain? Jika ya, mohon klik tautan ini https://api.gojek.co.id/customers/device?token=f192293e-3117-46e9-bac3-1d1473c23113 dalam 72 jam ke depan. Jika tidak, mohon abaikan pesan ini"

Data Cleaning

Next, I clean up the corpus by eliminating numbers, punctuation, white space, and by converting to lower case. I use the tm_map() function from the ‘tm’ package to this end.

Tolower

sms.corpus <- tm_map(sms.corpus, FUN = content_transformer(tolower))
sms.corpus[[6]]$content

## [1] "apakah anda mencoba mengakses akun anda dari perangkat lain? jika ya, mohon klik tautan ini https://api.gojek.co.id/customers/device?token=f192293e-3117-46e9-bac3-1d1473c23113 dalam 72 jam ke depan. jika tidak, mohon abaikan pesan ini"

removeNumbers

sms.corpus <- tm_map(sms.corpus, FUN = removeNumbers)
sms.corpus[[6]]$content

## [1] "apakah anda mencoba mengakses akun anda dari perangkat lain? jika ya, mohon klik tautan ini https://api.gojek.co.id/customers/device?token=fe--e-bac-dc dalam  jam ke depan. jika tidak, mohon abaikan pesan ini"

removeWords / stopwords

stopword_id <- readr::read_lines("C2/stoplist-id.txt")
sms.corpus <- tm_map(sms.corpus, removeWords, stopword_id )
sms.corpus <- tm_map(sms.corpus, removeWords, stopwords("english") )
sms.corpus[[6]]$content

## [1] "  mencoba mengakses akun   perangkat ?  ya, mohon klik tautan  https://api.gojek.co.id/customers/device?token=fe--e-bac-dc   jam  .  , mohon abaikan pesan "

removePunctuation

sms.corpus <- tm_map(sms.corpus, removePunctuation)
sms.corpus[[6]]$content

## [1] "  mencoba mengakses akun   perangkat   ya mohon klik tautan  httpsapigojekcoidcustomersdevicetokenfeebacdc   jam     mohon abaikan pesan "

stripWhitespace

sms.corpus <- tm_map(sms.corpus, stripWhitespace)
sms.corpus[[6]]$content

## [1] " mencoba mengakses akun perangkat ya mohon klik tautan httpsapigojekcoidcustomersdevicetokenfeebacdc jam mohon abaikan pesan "

The Document Term Matrix or Tokenization

In this approach, I represent each word in a document as a token (or feature) with a document term matrix (DTM). The rows of the DTM correspond to documents in the collection, columns correspond to terms, and its elements are the term frequencies. I use a built-in function from the ‘tm’ package to create the DTM

sms.dtm <- DocumentTermMatrix(sms.corpus)

inspect(sms.dtm)

## <<DocumentTermMatrix (documents: 2004, terms: 2819)>>
## Non-/sparse entries: 16944/5632332
## Sparsity           : 100%
## Maximal term length: 79
## Weighting          : term frequency (tf)
## Sample             :
##       Terms
## Docs   beli bonus dgn info internet kuota paket pulsa sms utk
##   1124    0     0   2    1        1     1     0     0   0   0
##   1177    0     0   2    1        1     1     0     0   0   0
##   193     0     0   0    0        0     0     0     0   0   0
##   197     0     0   0    0        0     0     0     0   1   0
##   225     0     0   0    0        1     0     0     0   0   0
##   29      0     0   0    0        0     0     0     0   0   0
##   378     0     0   0    0        0     0     0     0   0   0
##   409     0     0   0    0        0     0     0     0   0   0
##   410     0     1   1    2        0     0     2     2   0   0
##   955     0     0   0    0        2     0     3     2   0   0

TermDocMatrix

To see the number of occurrences of words in the document

sms.tdm <- TermDocumentMatrix(sms.corpus)
inspect(sms.tdm)

## <<TermDocumentMatrix (terms: 2819, documents: 2004)>>
## Non-/sparse entries: 16944/5632332
## Sparsity           : 100%
## Maximal term length: 79
## Weighting          : term frequency (tf)
## Sample             :
##           Docs
## Terms      1124 1177 193 197 225 29 378 409 410 955
##   beli        0    0   0   0   0  0   0   0   0   0
##   bonus       0    0   0   0   0  0   0   0   1   0
##   dgn         2    2   0   0   0  0   0   0   1   0
##   info        1    1   0   0   0  0   0   0   2   0
##   internet    1    1   0   0   1  0   0   0   0   2
##   kuota       1    1   0   0   0  0   0   0   0   0
##   paket       0    0   0   0   0  0   0   0   2   3
##   pulsa       0    0   0   0   0  0   0   0   2   2
##   sms         0    0   0   1   0  0   0   0   0   0
##   utk         0    0   0   0   0  0   0   0   0   0

m <- as.matrix(sms.tdm)
v <- sort(rowSums(m),decreasing=TRUE)
d <- data.frame(word = names(v),freq=v)
head(d)

##        word freq
## info   info  314
## kuota kuota  308
## pulsa pulsa  250
## sms     sms  244
## dgn     dgn  208
## paket paket  187

Wordcloud

set.seed(100)
wordcloud(words = d$word, freq = d$freq, min.freq = 1,
          max.words=200, random.order=FALSE, rot.per=0.35, 
          colors=brewer.pal(8, "Dark2"))

Build Model

Cross Validation

Next, I create 80:20 partitions of the document term matrix for training and testing purposes.

set.seed(100)
index <- sample(nrow(sms.dtm), size = nrow(sms.dtm)*0.8)
train <- sms.dtm[index,]
test <- sms.dtm[-index,]

Feature Selection

dim(train)

## [1] 1603 2819

The DTM contains 2819 features but not all of them will be useful for classification. I reduce the number of features by ignoring words which appear in less than 20 reviews. To do this, I use ‘findFreqTerms’ function to indentify the frequent words.

sms_freq <- findFreqTerms(sms.dtm, 20)
train <- train[,sms_freq]
test <- test[,sms_freq]

Train and test label

train_label <- sms[index,1]
test_label <- sms[-index,1]

Bernoulli

inspect(train)

## <<DocumentTermMatrix (documents: 1603, terms: 201)>>
## Non-/sparse entries: 7700/314503
## Sparsity           : 98%
## Maximal term length: 14
## Weighting          : term frequency (tf)
## Sample             :
##       Terms
## Docs   beli bonus dgn info internet kuota paket pulsa sms utk
##   1042    0     1   2    1        0     1     0     0   0   0
##   1067    1     0   0    1        0     1     0     0   0   0
##   1068    1     0   0    1        0     1     0     0   0   0
##   1072    1     1   0    1        1     2     0     0   0   0
##   1400    0     0   0    0        0     2     1     1   1   1
##   410     0     1   1    2        0     0     2     2   0   0
##   566     2     0   1    0        1     1     1     0   0   2
##   650     1     0   1    0        1     0     0     0   0   1
##   955     0     0   0    0        2     0     3     2   0   0
##   999     0     1   2    1        0     1     0     0   0   0

bernoulli_conv <- function(x){
  x <- as.factor(as.numeric(x > 0))
}
train_bn <- apply(train, MARGIN = 2, bernoulli_conv)
test_bn <- apply(test, MARGIN = 2, bernoulli_conv)
class(train_bn)

## [1] "matrix"

The NaiveBayes Algorithm

To train the model I use the naiveBayes function from the ‘e1071’ package. Since Naive Bayes evaluates products of probabilities, I need some way of assigning non-zero probabilities to words which do not occur in the sample. I use Laplace 1 smoothing to this end.

NaiveBayes Model

model_naive <- naiveBayes(train_bn, y = train_label, laplace = 1)

Testing the predictions

pred <- predict(model_naive, newdata = test_bn, type = "class")

pred2 <- predict(model_naive, newdata = test_bn, type = "raw")
pred2 <- ifelse(pred2[,2] >= 0.3, "spam", "ham")
pred2 <- as.factor(pred2)

Model evaluation

confusionMatrix(pred2, test_label, positive = "spam")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction ham spam
##       ham  206    9
##       spam  22  164
##                                               
##                Accuracy : 0.9227              
##                  95% CI : (0.8921, 0.9469)    
##     No Information Rate : 0.5686              
##     P-Value [Acc > NIR] : < 0.0000000000000002
##                                               
##                   Kappa : 0.8438              
##                                               
##  Mcnemar's Test P-Value : 0.03114             
##                                               
##             Sensitivity : 0.9480              
##             Specificity : 0.9035              
##          Pos Pred Value : 0.8817              
##          Neg Pred Value : 0.9581              
##              Prevalence : 0.4314              
##          Detection Rate : 0.4090              
##    Detection Prevalence : 0.4638              
##       Balanced Accuracy : 0.9257              
##                                               
##        'Positive' Class : spam                
##

The Decision Tree Algorithm

Data Preparing

set.seed(100)
index <-  sample(nrow(sms), nrow(sms)*0.80)
train_dfn <- sms[index,]
test_dfn <- sms[-index,]

Data Processing

class(train_bn)

## [1] "matrix"

train_df <- as.data.frame(train_bn)
train_df$label <- train_dfn$label
test_df <- as.data.frame(test_bn)
test_df$label <- test_dfn$label

Decision Tree Model

model_df <-  ctree(label~., train_df)
model_df

## 
## Model formula:
## label ~ aja + akses + aktif + aktifkan + aplikasi + apps + aspen + 
##     axis + axisnet + ayo + balas + banking + bebas + beli + berhasil + 
##     berita + berlaku + bersifat + biaya + blm + blok + bls + 
##     bni + bonus + booking + bronet + bukit + cashback + cek + 
##     cimb + combo + cuman + customer + cvi + daftar + dana + dapatkan + 
##     dapetin + data + datarollover + dgn + diaktifkan + diblokir + 
##     diblokirsgr + digit + dihapus + dipotong + diprosesreply + 
##     diskon + dlm + download + dpt + dptkan + extra + film + free + 
##     freedom + freedomcombo + gojek + gopay + gratis + habis + 
##     hadiah + harga + hemat + hny + hnya + httpimm + httpimmc + 
##     hub + hubungi + idr + iflix + immc + indosat + info + internet + 
##     internetan + iring + isi + iya + jalan + jam + jaringan + 
##     kartu + kasih + kelebihan + keren + kerennya + kesempatan + 
##     ketik + khusus + kirim + klik + kode + kuota + kursi + layanan + 
##     lho + login + maaf + mai + malam + maret + masuk + mba + 
##     memberitahukan + mencukupi + menit + mobile + myim + nelp + 
##     nelpon + niaga + nikmati + nomor + nonton + norm + normal + 
##     nya + oke + ooredoo + pai + pakai + paket + paketnya + passcode + 
##     pastikan + pay + paypro + pelanggan + pemakaian + penawaran + 
##     perkb + permai + pesan + pesanan + pin + pkt + point + prabayar + 
##     premium + promo + prpnjngan + puas + pulsa + rahasia + rbhr + 
##     registrasi + rek + rollover + rphr + rprb + rprbhr + saldo + 
##     sdh + selamat + sepuasnya + sesuai + sih + silakan + sisa + 
##     skrg + sms + sos + spi + spotify + sukses + tap + tarif + 
##     tcash + tdk + tekan + telp + tercatat + terima + tersedia + 
##     tkn + tlp + top + topup + tower + transaksi + transfer + 
##     trf + tseltappromo + tujuan + tukarkan + tunai + tunggu + 
##     udah + udh + ulang + ulangnikkk + unlimited + usage + utama + 
##     utk + via + youtube + yth
## 
## Fitted party:
## [1] root
## |   [2] info in 0
## |   |   [3] ooredoo in 0
## |   |   |   [4] iring in 0
## |   |   |   |   [5] pelanggan in 0
## |   |   |   |   |   [6] gratis in 0
## |   |   |   |   |   |   [7] beli in 0
## |   |   |   |   |   |   |   [8] isi in 0
## |   |   |   |   |   |   |   |   [9] kuota in 0
## |   |   |   |   |   |   |   |   |   [10] ayo in 0
## |   |   |   |   |   |   |   |   |   |   [11] berlaku in 0
## |   |   |   |   |   |   |   |   |   |   |   [12] mai in 0: ham (n = 925, err = 8.1%)
## |   |   |   |   |   |   |   |   |   |   |   [13] mai in 1: spam (n = 12, err = 0.0%)
## |   |   |   |   |   |   |   |   |   |   [14] berlaku in 1: spam (n = 17, err = 11.8%)
## |   |   |   |   |   |   |   |   |   [15] ayo in 1: spam (n = 24, err = 12.5%)
## |   |   |   |   |   |   |   |   [16] kuota in 1: spam (n = 47, err = 21.3%)
## |   |   |   |   |   |   |   [17] isi in 1
## |   |   |   |   |   |   |   |   [18] mencukupi in 0: spam (n = 43, err = 14.0%)
## |   |   |   |   |   |   |   |   [19] mencukupi in 1: ham (n = 9, err = 0.0%)
## |   |   |   |   |   |   [20] beli in 1
## |   |   |   |   |   |   |   [21] sisa in 0: spam (n = 50, err = 2.0%)
## |   |   |   |   |   |   |   [22] sisa in 1: ham (n = 12, err = 0.0%)
## |   |   |   |   |   [23] gratis in 1: spam (n = 47, err = 4.3%)
## |   |   |   |   [24] pelanggan in 1: spam (n = 57, err = 5.3%)
## |   |   |   [25] iring in 1: spam (n = 53, err = 0.0%)
## |   |   [26] ooredoo in 1: spam (n = 66, err = 0.0%)
## |   [27] info in 1: spam (n = 241, err = 14.9%)
## 
## Number of inner nodes:    13
## Number of terminal nodes: 14

Visualize Data

plot(model_df)

Testing the prediction

predict_df <- predict(model_df, test_df, type = "response")

Model Evaluation

confusionMatrix(predict_df, test_df$label , positive = "spam")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction ham spam
##       ham  208   16
##       spam  20  157
##                                              
##                Accuracy : 0.9102             
##                  95% CI : (0.8779, 0.9363)   
##     No Information Rate : 0.5686             
##     P-Value [Acc > NIR] : <0.0000000000000002
##                                              
##                   Kappa : 0.8175             
##                                              
##  Mcnemar's Test P-Value : 0.6171             
##                                              
##             Sensitivity : 0.9075             
##             Specificity : 0.9123             
##          Pos Pred Value : 0.8870             
##          Neg Pred Value : 0.9286             
##              Prevalence : 0.4314             
##          Detection Rate : 0.3915             
##    Detection Prevalence : 0.4414             
##       Balanced Accuracy : 0.9099             
##                                              
##        'Positive' Class : spam               
##

Conclusion

From the results, I got that the NaiveBayes model has the highest Accuracy and Recall value. Therefore, I am going to use it to submit into data submission

Data Submission

Preparing Data

data_submission_ns <- read.csv("sms-cl-spam/data/data-submission.csv")

data_sub.corpus <- VCorpus(VectorSource(data_submission_ns$text))
class(data_sub.corpus)

## [1] "VCorpus" "Corpus"

data_sub.corpus[[6]]$content

## [1] "iRing keren cuman buat km, Via Vallen-Bojo Galak (Reff),Rp.0,1/3hr prpnjngan Rp.3190 dengan hnya bls YA lho!"

data_sub_dtm <- DocumentTermMatrix(data_sub.corpus)

inspect(data_sub_dtm)

## <<DocumentTermMatrix (documents: 283, terms: 1197)>>
## Non-/sparse entries: 4645/334106
## Sparsity           : 99%
## Maximal term length: 101
## Weighting          : term frequency (tf)
## Sample             :
##      Terms
## Docs  24jam bronet info838 kamu kartu kuota paket registrasi sms ulang
##   134     0      0       0    0     0     0     0          0   0     0
##   139     1      1       1    0     0     0     0          0   0     0
##   160     0      0       0    0     0     0     0          0   0     0
##   162     0      0       0    0     0     0     0          0   0     0
##   169     1      1       1    0     0     0     0          0   0     0
##   180     0      0       0    0     0     2     1          0   1     0
##   183     0      0       0    0     0     0     0          0   1     0
##   186     1      1       1    0     0     0     0          0   0     0
##   223     1      1       1    0     0     0     0          0   0     0
##   272     0      0       0    0     0     1     1          0   0     0

inspect(data_sub_dtm)

## <<DocumentTermMatrix (documents: 283, terms: 1197)>>
## Non-/sparse entries: 4645/334106
## Sparsity           : 99%
## Maximal term length: 101
## Weighting          : term frequency (tf)
## Sample             :
##      Terms
## Docs  24jam bronet info838 kamu kartu kuota paket registrasi sms ulang
##   134     0      0       0    0     0     0     0          0   0     0
##   139     1      1       1    0     0     0     0          0   0     0
##   160     0      0       0    0     0     0     0          0   0     0
##   162     0      0       0    0     0     0     0          0   0     0
##   169     1      1       1    0     0     0     0          0   0     0
##   180     0      0       0    0     0     2     1          0   1     0
##   183     0      0       0    0     0     0     0          0   1     0
##   186     1      1       1    0     0     0     0          0   0     0
##   223     1      1       1    0     0     0     0          0   0     0
##   272     0      0       0    0     0     1     1          0   0     0

bernoulli_conv <- function(x){
  x <- as.factor(as.numeric(x > 0))
}
data_sub_bn <- apply(data_sub_dtm, MARGIN = 2, bernoulli_conv)

class(data_sub_bn)

## [1] "matrix"

Testing the prediction

pred_data_sub <- predict(model_naive, newdata = data_sub_bn, type = "class")

pred_data_sub2 <- predict(model_naive, newdata = data_sub_bn, type = "raw")

pred_data_sub2 <- as.data.frame(pred_data_sub2)

data_submission_ns$ham <- pred_data_sub2$ham
data_submission_ns$spam <- pred_data_sub2$spam
data_submission_ns$status <- pred_data_sub

Save to csv

write.csv(x = data_submission_ns, file = "Triwahyudi4_sms-c1-spam.csv")