Capstone ML : SMS

Introduction

Capstone Project

This project is my final step in machine learning course. And I chose SMS dataset as my machine learning capstone project. The SMS dataset is a set of SMS tagged messages that have been collected by Team Algoritma. It contains one set of SMS messages in Bahasa Indonesia of 2.004 messages, tagged acording being ham (legitimate) or spam.

I will use this dataset to build a prediction model that will accurately classify which texts are spam or ham(legitimate). I use The Naive Bayes algorithm and compare with The Decision Tree algorithm to find the best model for this dataset.

Data Preparing

Reading the data

## Observations: 2,004
## Variables: 3
## $ datetime <chr> "2017-02-15T14:48:00Z", "2017-02-15T15:24:00Z", "2017...
## $ text     <chr> "Telegram code 53784", "Rezeki Nomplok Dompetku Pengi...
## $ status   <chr> "ham", "spam", "ham", "ham", "ham", "ham", "ham", "sp...

Visualize the Data

Exploring Data

## [1] "data.frame"
##  [1] "GRATIS INTERNET 2GB plus UNLIMITED CHAT & SOSMED selama 30hari. DataRollover. PROMO 35Rb (Normal 40rb). MAU? Tekan OK kirim SMS ke 929 sd 26/02/18"         
##  [2] "jngan sampai kelewatan iRing Iis Dahlia-Rindu (SP), FREE lho.Lngsung bls YA aja ya!"                                                                        
##  [3] "G-679179 is your Google verification code."                                                                                                                 
##  [4] "awas ketinggalan iRing Aldy Maldini-Biar Aku Yang Pergi, Rp.0,1/3hr prpnjngan Rp.3190 .Lngsung bls YA aja ya!"                                              
##  [5] "iRing keren cuman buat km, Rizki Ridho DAcademy-Terlanjur Sakit (DM17),Rp.0,1/3hr prpnjngan Rp.3190 dengan hnya bls YA lho!"                                
##  [6] "Km baru saja akses Apps Sehari-hari terpopuler.Nikmati akses YOUTUBE ga habis habis dgn beli pkt Unlimited HANYA di *123# atau myIM3 http://im3.do/m3"      
##  [7] "Dapatkan cashback s.d Rp50rb di Pepper Lunch, \n Bakerzin,Raffel's Sandwich atau cashback s.d Rp100rb di \n Inul Vizta jika bayar pakai PayPro.Info @indosa"
##  [8] "Bentar lagi turun"                                                                                                                                          
##  [9] "Sisa kuota 116.0MB. Beli EXTRA Kuota atau upgrade FREEDOM COMBO di*123# atau im3.do/mc"                                                                     
## [10] "AGAR TDK DIBLOKIR,sgr registrasi ulang kartu prabayar Anda.Tlp dan SMS keluar diblokir mulai 1-31 Maret 2018.Ketik ULANG#NIK#No.KK SMS ke 4444.Info 838"

Data Processing

Change data type to corpus

I prepare a corpus of all the documents in the dataframe.

## [1] "VCorpus" "Corpus"
## [1] "Apakah Anda mencoba mengakses akun Anda dari perangkat lain? Jika ya, mohon klik tautan ini https://api.gojek.co.id/customers/device?token=f192293e-3117-46e9-bac3-1d1473c23113 dalam 72 jam ke depan. Jika tidak, mohon abaikan pesan ini"

Data Cleaning

Next, I clean up the corpus by eliminating numbers, punctuation, white space, and by converting to lower case. I use the tm_map() function from the ‘tm’ package to this end.

Tolower

`

## [1] "apakah anda mencoba mengakses akun anda dari perangkat lain? jika ya, mohon klik tautan ini https://api.gojek.co.id/customers/device?token=f192293e-3117-46e9-bac3-1d1473c23113 dalam 72 jam ke depan. jika tidak, mohon abaikan pesan ini"

removeNumbers

## [1] "apakah anda mencoba mengakses akun anda dari perangkat lain? jika ya, mohon klik tautan ini https://api.gojek.co.id/customers/device?token=fe--e-bac-dc dalam  jam ke depan. jika tidak, mohon abaikan pesan ini"

removeWords / stopwords

## [1] "  mencoba mengakses akun   perangkat ?  ya, mohon klik tautan  https://api.gojek.co.id/customers/device?token=fe--e-bac-dc   jam  .  , mohon abaikan pesan "

removePunctuation

## [1] "  mencoba mengakses akun   perangkat   ya mohon klik tautan  httpsapigojekcoidcustomersdevicetokenfeebacdc   jam     mohon abaikan pesan "

stripWhitespace

## [1] " mencoba mengakses akun perangkat ya mohon klik tautan httpsapigojekcoidcustomersdevicetokenfeebacdc jam mohon abaikan pesan "

The Document Term Matrix or Tokenization

In this approach, I represent each word in a document as a token (or feature) with a document term matrix (DTM). The rows of the DTM correspond to documents in the collection, columns correspond to terms, and its elements are the term frequencies. I use a built-in function from the ‘tm’ package to create the DTM

## <<DocumentTermMatrix (documents: 2004, terms: 2819)>>
## Non-/sparse entries: 16944/5632332
## Sparsity           : 100%
## Maximal term length: 79
## Weighting          : term frequency (tf)
## Sample             :
##       Terms
## Docs   beli bonus dgn info internet kuota paket pulsa sms utk
##   1124    0     0   2    1        1     1     0     0   0   0
##   1177    0     0   2    1        1     1     0     0   0   0
##   193     0     0   0    0        0     0     0     0   0   0
##   197     0     0   0    0        0     0     0     0   1   0
##   225     0     0   0    0        1     0     0     0   0   0
##   29      0     0   0    0        0     0     0     0   0   0
##   378     0     0   0    0        0     0     0     0   0   0
##   409     0     0   0    0        0     0     0     0   0   0
##   410     0     1   1    2        0     0     2     2   0   0
##   955     0     0   0    0        2     0     3     2   0   0

TermDocMatrix

To see the number of occurrences of words in the document

## <<TermDocumentMatrix (terms: 2819, documents: 2004)>>
## Non-/sparse entries: 16944/5632332
## Sparsity           : 100%
## Maximal term length: 79
## Weighting          : term frequency (tf)
## Sample             :
##           Docs
## Terms      1124 1177 193 197 225 29 378 409 410 955
##   beli        0    0   0   0   0  0   0   0   0   0
##   bonus       0    0   0   0   0  0   0   0   1   0
##   dgn         2    2   0   0   0  0   0   0   1   0
##   info        1    1   0   0   0  0   0   0   2   0
##   internet    1    1   0   0   1  0   0   0   0   2
##   kuota       1    1   0   0   0  0   0   0   0   0
##   paket       0    0   0   0   0  0   0   0   2   3
##   pulsa       0    0   0   0   0  0   0   0   2   2
##   sms         0    0   0   1   0  0   0   0   0   0
##   utk         0    0   0   0   0  0   0   0   0   0
##        word freq
## info   info  314
## kuota kuota  308
## pulsa pulsa  250
## sms     sms  244
## dgn     dgn  208
## paket paket  187

Build Model

Cross Validation

Next, I create 80:20 partitions of the document term matrix for training and testing purposes.

Feature Selection

## [1] 1603 2819

The DTM contains 2819 features but not all of them will be useful for classification. I reduce the number of features by ignoring words which appear in less than 20 reviews. To do this, I use ‘findFreqTerms’ function to indentify the frequent words.

Train and test label

Bernoulli

## <<DocumentTermMatrix (documents: 1603, terms: 201)>>
## Non-/sparse entries: 7700/314503
## Sparsity           : 98%
## Maximal term length: 14
## Weighting          : term frequency (tf)
## Sample             :
##       Terms
## Docs   beli bonus dgn info internet kuota paket pulsa sms utk
##   1042    0     1   2    1        0     1     0     0   0   0
##   1067    1     0   0    1        0     1     0     0   0   0
##   1068    1     0   0    1        0     1     0     0   0   0
##   1072    1     1   0    1        1     2     0     0   0   0
##   1400    0     0   0    0        0     2     1     1   1   1
##   410     0     1   1    2        0     0     2     2   0   0
##   566     2     0   1    0        1     1     1     0   0   2
##   650     1     0   1    0        1     0     0     0   0   1
##   955     0     0   0    0        2     0     3     2   0   0
##   999     0     1   2    1        0     1     0     0   0   0
## [1] "matrix"

The NaiveBayes Algorithm

To train the model I use the naiveBayes function from the ‘e1071’ package. Since Naive Bayes evaluates products of probabilities, I need some way of assigning non-zero probabilities to words which do not occur in the sample. I use Laplace 1 smoothing to this end.

Model evaluation

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction ham spam
##       ham  206    9
##       spam  22  164
##                                               
##                Accuracy : 0.9227              
##                  95% CI : (0.8921, 0.9469)    
##     No Information Rate : 0.5686              
##     P-Value [Acc > NIR] : < 0.0000000000000002
##                                               
##                   Kappa : 0.8438              
##                                               
##  Mcnemar's Test P-Value : 0.03114             
##                                               
##             Sensitivity : 0.9480              
##             Specificity : 0.9035              
##          Pos Pred Value : 0.8817              
##          Neg Pred Value : 0.9581              
##              Prevalence : 0.4314              
##          Detection Rate : 0.4090              
##    Detection Prevalence : 0.4638              
##       Balanced Accuracy : 0.9257              
##                                               
##        'Positive' Class : spam                
## 

The Decision Tree Algorithm

Decision Tree Model

## 
## Model formula:
## label ~ aja + akses + aktif + aktifkan + aplikasi + apps + aspen + 
##     axis + axisnet + ayo + balas + banking + bebas + beli + berhasil + 
##     berita + berlaku + bersifat + biaya + blm + blok + bls + 
##     bni + bonus + booking + bronet + bukit + cashback + cek + 
##     cimb + combo + cuman + customer + cvi + daftar + dana + dapatkan + 
##     dapetin + data + datarollover + dgn + diaktifkan + diblokir + 
##     diblokirsgr + digit + dihapus + dipotong + diprosesreply + 
##     diskon + dlm + download + dpt + dptkan + extra + film + free + 
##     freedom + freedomcombo + gojek + gopay + gratis + habis + 
##     hadiah + harga + hemat + hny + hnya + httpimm + httpimmc + 
##     hub + hubungi + idr + iflix + immc + indosat + info + internet + 
##     internetan + iring + isi + iya + jalan + jam + jaringan + 
##     kartu + kasih + kelebihan + keren + kerennya + kesempatan + 
##     ketik + khusus + kirim + klik + kode + kuota + kursi + layanan + 
##     lho + login + maaf + mai + malam + maret + masuk + mba + 
##     memberitahukan + mencukupi + menit + mobile + myim + nelp + 
##     nelpon + niaga + nikmati + nomor + nonton + norm + normal + 
##     nya + oke + ooredoo + pai + pakai + paket + paketnya + passcode + 
##     pastikan + pay + paypro + pelanggan + pemakaian + penawaran + 
##     perkb + permai + pesan + pesanan + pin + pkt + point + prabayar + 
##     premium + promo + prpnjngan + puas + pulsa + rahasia + rbhr + 
##     registrasi + rek + rollover + rphr + rprb + rprbhr + saldo + 
##     sdh + selamat + sepuasnya + sesuai + sih + silakan + sisa + 
##     skrg + sms + sos + spi + spotify + sukses + tap + tarif + 
##     tcash + tdk + tekan + telp + tercatat + terima + tersedia + 
##     tkn + tlp + top + topup + tower + transaksi + transfer + 
##     trf + tseltappromo + tujuan + tukarkan + tunai + tunggu + 
##     udah + udh + ulang + ulangnikkk + unlimited + usage + utama + 
##     utk + via + youtube + yth
## 
## Fitted party:
## [1] root
## |   [2] info in 0
## |   |   [3] ooredoo in 0
## |   |   |   [4] iring in 0
## |   |   |   |   [5] pelanggan in 0
## |   |   |   |   |   [6] gratis in 0
## |   |   |   |   |   |   [7] beli in 0
## |   |   |   |   |   |   |   [8] isi in 0
## |   |   |   |   |   |   |   |   [9] kuota in 0
## |   |   |   |   |   |   |   |   |   [10] ayo in 0
## |   |   |   |   |   |   |   |   |   |   [11] berlaku in 0
## |   |   |   |   |   |   |   |   |   |   |   [12] mai in 0: ham (n = 925, err = 8.1%)
## |   |   |   |   |   |   |   |   |   |   |   [13] mai in 1: spam (n = 12, err = 0.0%)
## |   |   |   |   |   |   |   |   |   |   [14] berlaku in 1: spam (n = 17, err = 11.8%)
## |   |   |   |   |   |   |   |   |   [15] ayo in 1: spam (n = 24, err = 12.5%)
## |   |   |   |   |   |   |   |   [16] kuota in 1: spam (n = 47, err = 21.3%)
## |   |   |   |   |   |   |   [17] isi in 1
## |   |   |   |   |   |   |   |   [18] mencukupi in 0: spam (n = 43, err = 14.0%)
## |   |   |   |   |   |   |   |   [19] mencukupi in 1: ham (n = 9, err = 0.0%)
## |   |   |   |   |   |   [20] beli in 1
## |   |   |   |   |   |   |   [21] sisa in 0: spam (n = 50, err = 2.0%)
## |   |   |   |   |   |   |   [22] sisa in 1: ham (n = 12, err = 0.0%)
## |   |   |   |   |   [23] gratis in 1: spam (n = 47, err = 4.3%)
## |   |   |   |   [24] pelanggan in 1: spam (n = 57, err = 5.3%)
## |   |   |   [25] iring in 1: spam (n = 53, err = 0.0%)
## |   |   [26] ooredoo in 1: spam (n = 66, err = 0.0%)
## |   [27] info in 1: spam (n = 241, err = 14.9%)
## 
## Number of inner nodes:    13
## Number of terminal nodes: 14

Visualize Data

Model Evaluation

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction ham spam
##       ham  208   16
##       spam  20  157
##                                              
##                Accuracy : 0.9102             
##                  95% CI : (0.8779, 0.9363)   
##     No Information Rate : 0.5686             
##     P-Value [Acc > NIR] : <0.0000000000000002
##                                              
##                   Kappa : 0.8175             
##                                              
##  Mcnemar's Test P-Value : 0.6171             
##                                              
##             Sensitivity : 0.9075             
##             Specificity : 0.9123             
##          Pos Pred Value : 0.8870             
##          Neg Pred Value : 0.9286             
##              Prevalence : 0.4314             
##          Detection Rate : 0.3915             
##    Detection Prevalence : 0.4414             
##       Balanced Accuracy : 0.9099             
##                                              
##        'Positive' Class : spam               
## 

Conclusion

From the results, I got that the NaiveBayes model has the highest Accuracy and Recall value. Therefore, I am going to use it to submit into data submission

Data Submission

Preparing Data

## [1] "VCorpus" "Corpus"
## [1] "iRing keren cuman buat km, Via Vallen-Bojo Galak (Reff),Rp.0,1/3hr prpnjngan Rp.3190 dengan hnya bls YA lho!"
## <<DocumentTermMatrix (documents: 283, terms: 1197)>>
## Non-/sparse entries: 4645/334106
## Sparsity           : 99%
## Maximal term length: 101
## Weighting          : term frequency (tf)
## Sample             :
##      Terms
## Docs  24jam bronet info838 kamu kartu kuota paket registrasi sms ulang
##   134     0      0       0    0     0     0     0          0   0     0
##   139     1      1       1    0     0     0     0          0   0     0
##   160     0      0       0    0     0     0     0          0   0     0
##   162     0      0       0    0     0     0     0          0   0     0
##   169     1      1       1    0     0     0     0          0   0     0
##   180     0      0       0    0     0     2     1          0   1     0
##   183     0      0       0    0     0     0     0          0   1     0
##   186     1      1       1    0     0     0     0          0   0     0
##   223     1      1       1    0     0     0     0          0   0     0
##   272     0      0       0    0     0     1     1          0   0     0
## <<DocumentTermMatrix (documents: 283, terms: 1197)>>
## Non-/sparse entries: 4645/334106
## Sparsity           : 99%
## Maximal term length: 101
## Weighting          : term frequency (tf)
## Sample             :
##      Terms
## Docs  24jam bronet info838 kamu kartu kuota paket registrasi sms ulang
##   134     0      0       0    0     0     0     0          0   0     0
##   139     1      1       1    0     0     0     0          0   0     0
##   160     0      0       0    0     0     0     0          0   0     0
##   162     0      0       0    0     0     0     0          0   0     0
##   169     1      1       1    0     0     0     0          0   0     0
##   180     0      0       0    0     0     2     1          0   1     0
##   183     0      0       0    0     0     0     0          0   1     0
##   186     1      1       1    0     0     0     0          0   0     0
##   223     1      1       1    0     0     0     0          0   0     0
##   272     0      0       0    0     0     1     1          0   0     0
## [1] "matrix"

Wahyudi Tri

Sunday, September-22-2019