Capstone ML : SMS
Capstone ML : SMS
Introduction
Capstone Project
This project is my final step in machine learning course. And I chose SMS dataset as my machine learning capstone project. The SMS dataset is a set of SMS tagged messages that have been collected by Team Algoritma. It contains one set of SMS messages in Bahasa Indonesia of 2.004 messages, tagged acording being ham (legitimate) or spam.
I will use this dataset to build a prediction model that will accurately classify which texts are spam or ham(legitimate). I use The Naive Bayes algorithm and compare with The Decision Tree algorithm to find the best model for this dataset.
Setting up the environment
First load all the required libraries(packages)
Data Preparing
Reading the data
sms <- read.csv("C2/C2/data_input/sms-cl-spam/data/data-train.csv", stringsAsFactors = FALSE, encoding = "UTF-8")
glimpse(sms)## Observations: 2,004
## Variables: 3
## $ datetime <chr> "2017-02-15T14:48:00Z", "2017-02-15T15:24:00Z", "2017...
## $ text <chr> "Telegram code 53784", "Rezeki Nomplok Dompetku Pengi...
## $ status <chr> "ham", "spam", "ham", "ham", "ham", "ham", "ham", "sp...
Visualize the Data
Exploring Data
set.seed(100)
sms <- sms %>%
select("label" = status, "text" = text) %>%
mutate("label" = as.factor(label))
class(sms)## [1] "data.frame"
## [1] "GRATIS INTERNET 2GB plus UNLIMITED CHAT & SOSMED selama 30hari. DataRollover. PROMO 35Rb (Normal 40rb). MAU? Tekan OK kirim SMS ke 929 sd 26/02/18"
## [2] "jngan sampai kelewatan iRing Iis Dahlia-Rindu (SP), FREE lho.Lngsung bls YA aja ya!"
## [3] "G-679179 is your Google verification code."
## [4] "awas ketinggalan iRing Aldy Maldini-Biar Aku Yang Pergi, Rp.0,1/3hr prpnjngan Rp.3190 .Lngsung bls YA aja ya!"
## [5] "iRing keren cuman buat km, Rizki Ridho DAcademy-Terlanjur Sakit (DM17),Rp.0,1/3hr prpnjngan Rp.3190 dengan hnya bls YA lho!"
## [6] "Km baru saja akses Apps Sehari-hari terpopuler.Nikmati akses YOUTUBE ga habis habis dgn beli pkt Unlimited HANYA di *123# atau myIM3 http://im3.do/m3"
## [7] "Dapatkan cashback s.d Rp50rb di Pepper Lunch, \n Bakerzin,Raffel's Sandwich atau cashback s.d Rp100rb di \n Inul Vizta jika bayar pakai PayPro.Info @indosa"
## [8] "Bentar lagi turun"
## [9] "Sisa kuota 116.0MB. Beli EXTRA Kuota atau upgrade FREEDOM COMBO di*123# atau im3.do/mc"
## [10] "AGAR TDK DIBLOKIR,sgr registrasi ulang kartu prabayar Anda.Tlp dan SMS keluar diblokir mulai 1-31 Maret 2018.Ketik ULANG#NIK#No.KK SMS ke 4444.Info 838"
Data Processing
Change data type to corpus
I prepare a corpus of all the documents in the dataframe.
## [1] "VCorpus" "Corpus"
## [1] "Apakah Anda mencoba mengakses akun Anda dari perangkat lain? Jika ya, mohon klik tautan ini https://api.gojek.co.id/customers/device?token=f192293e-3117-46e9-bac3-1d1473c23113 dalam 72 jam ke depan. Jika tidak, mohon abaikan pesan ini"
Data Cleaning
Next, I clean up the corpus by eliminating numbers, punctuation, white space, and by converting to lower case. I use the tm_map() function from the ‘tm’ package to this end.
Tolower
`
## [1] "apakah anda mencoba mengakses akun anda dari perangkat lain? jika ya, mohon klik tautan ini https://api.gojek.co.id/customers/device?token=f192293e-3117-46e9-bac3-1d1473c23113 dalam 72 jam ke depan. jika tidak, mohon abaikan pesan ini"
removeNumbers
## [1] "apakah anda mencoba mengakses akun anda dari perangkat lain? jika ya, mohon klik tautan ini https://api.gojek.co.id/customers/device?token=fe--e-bac-dc dalam jam ke depan. jika tidak, mohon abaikan pesan ini"
removeWords / stopwords
stopword_id <- readr::read_lines("C2/stoplist-id.txt")
sms.corpus <- tm_map(sms.corpus, removeWords, stopword_id )
sms.corpus <- tm_map(sms.corpus, removeWords, stopwords("english") )
sms.corpus[[6]]$content## [1] " mencoba mengakses akun perangkat ? ya, mohon klik tautan https://api.gojek.co.id/customers/device?token=fe--e-bac-dc jam . , mohon abaikan pesan "
removePunctuation
## [1] " mencoba mengakses akun perangkat ya mohon klik tautan httpsapigojekcoidcustomersdevicetokenfeebacdc jam mohon abaikan pesan "
stripWhitespace
## [1] " mencoba mengakses akun perangkat ya mohon klik tautan httpsapigojekcoidcustomersdevicetokenfeebacdc jam mohon abaikan pesan "
The Document Term Matrix or Tokenization
In this approach, I represent each word in a document as a token (or feature) with a document term matrix (DTM). The rows of the DTM correspond to documents in the collection, columns correspond to terms, and its elements are the term frequencies. I use a built-in function from the ‘tm’ package to create the DTM
## <<DocumentTermMatrix (documents: 2004, terms: 2819)>>
## Non-/sparse entries: 16944/5632332
## Sparsity : 100%
## Maximal term length: 79
## Weighting : term frequency (tf)
## Sample :
## Terms
## Docs beli bonus dgn info internet kuota paket pulsa sms utk
## 1124 0 0 2 1 1 1 0 0 0 0
## 1177 0 0 2 1 1 1 0 0 0 0
## 193 0 0 0 0 0 0 0 0 0 0
## 197 0 0 0 0 0 0 0 0 1 0
## 225 0 0 0 0 1 0 0 0 0 0
## 29 0 0 0 0 0 0 0 0 0 0
## 378 0 0 0 0 0 0 0 0 0 0
## 409 0 0 0 0 0 0 0 0 0 0
## 410 0 1 1 2 0 0 2 2 0 0
## 955 0 0 0 0 2 0 3 2 0 0
TermDocMatrix
To see the number of occurrences of words in the document
## <<TermDocumentMatrix (terms: 2819, documents: 2004)>>
## Non-/sparse entries: 16944/5632332
## Sparsity : 100%
## Maximal term length: 79
## Weighting : term frequency (tf)
## Sample :
## Docs
## Terms 1124 1177 193 197 225 29 378 409 410 955
## beli 0 0 0 0 0 0 0 0 0 0
## bonus 0 0 0 0 0 0 0 0 1 0
## dgn 2 2 0 0 0 0 0 0 1 0
## info 1 1 0 0 0 0 0 0 2 0
## internet 1 1 0 0 1 0 0 0 0 2
## kuota 1 1 0 0 0 0 0 0 0 0
## paket 0 0 0 0 0 0 0 0 2 3
## pulsa 0 0 0 0 0 0 0 0 2 2
## sms 0 0 0 1 0 0 0 0 0 0
## utk 0 0 0 0 0 0 0 0 0 0
m <- as.matrix(sms.tdm)
v <- sort(rowSums(m),decreasing=TRUE)
d <- data.frame(word = names(v),freq=v)
head(d)## word freq
## info info 314
## kuota kuota 308
## pulsa pulsa 250
## sms sms 244
## dgn dgn 208
## paket paket 187
Build Model
Cross Validation
Next, I create 80:20 partitions of the document term matrix for training and testing purposes.
Feature Selection
## [1] 1603 2819
The DTM contains 2819 features but not all of them will be useful for classification. I reduce the number of features by ignoring words which appear in less than 20 reviews. To do this, I use ‘findFreqTerms’ function to indentify the frequent words.
Train and test label
Bernoulli
## <<DocumentTermMatrix (documents: 1603, terms: 201)>>
## Non-/sparse entries: 7700/314503
## Sparsity : 98%
## Maximal term length: 14
## Weighting : term frequency (tf)
## Sample :
## Terms
## Docs beli bonus dgn info internet kuota paket pulsa sms utk
## 1042 0 1 2 1 0 1 0 0 0 0
## 1067 1 0 0 1 0 1 0 0 0 0
## 1068 1 0 0 1 0 1 0 0 0 0
## 1072 1 1 0 1 1 2 0 0 0 0
## 1400 0 0 0 0 0 2 1 1 1 1
## 410 0 1 1 2 0 0 2 2 0 0
## 566 2 0 1 0 1 1 1 0 0 2
## 650 1 0 1 0 1 0 0 0 0 1
## 955 0 0 0 0 2 0 3 2 0 0
## 999 0 1 2 1 0 1 0 0 0 0
bernoulli_conv <- function(x){
x <- as.factor(as.numeric(x > 0))
}
train_bn <- apply(train, MARGIN = 2, bernoulli_conv)
test_bn <- apply(test, MARGIN = 2, bernoulli_conv)
class(train_bn)## [1] "matrix"
The NaiveBayes Algorithm
To train the model I use the naiveBayes function from the ‘e1071’ package. Since Naive Bayes evaluates products of probabilities, I need some way of assigning non-zero probabilities to words which do not occur in the sample. I use Laplace 1 smoothing to this end.
NaiveBayes Model
Testing the predictions
Model evaluation
## Confusion Matrix and Statistics
##
## Reference
## Prediction ham spam
## ham 206 9
## spam 22 164
##
## Accuracy : 0.9227
## 95% CI : (0.8921, 0.9469)
## No Information Rate : 0.5686
## P-Value [Acc > NIR] : < 0.0000000000000002
##
## Kappa : 0.8438
##
## Mcnemar's Test P-Value : 0.03114
##
## Sensitivity : 0.9480
## Specificity : 0.9035
## Pos Pred Value : 0.8817
## Neg Pred Value : 0.9581
## Prevalence : 0.4314
## Detection Rate : 0.4090
## Detection Prevalence : 0.4638
## Balanced Accuracy : 0.9257
##
## 'Positive' Class : spam
##
The Decision Tree Algorithm
Data Preparing
Data Processing
## [1] "matrix"
Decision Tree Model
##
## Model formula:
## label ~ aja + akses + aktif + aktifkan + aplikasi + apps + aspen +
## axis + axisnet + ayo + balas + banking + bebas + beli + berhasil +
## berita + berlaku + bersifat + biaya + blm + blok + bls +
## bni + bonus + booking + bronet + bukit + cashback + cek +
## cimb + combo + cuman + customer + cvi + daftar + dana + dapatkan +
## dapetin + data + datarollover + dgn + diaktifkan + diblokir +
## diblokirsgr + digit + dihapus + dipotong + diprosesreply +
## diskon + dlm + download + dpt + dptkan + extra + film + free +
## freedom + freedomcombo + gojek + gopay + gratis + habis +
## hadiah + harga + hemat + hny + hnya + httpimm + httpimmc +
## hub + hubungi + idr + iflix + immc + indosat + info + internet +
## internetan + iring + isi + iya + jalan + jam + jaringan +
## kartu + kasih + kelebihan + keren + kerennya + kesempatan +
## ketik + khusus + kirim + klik + kode + kuota + kursi + layanan +
## lho + login + maaf + mai + malam + maret + masuk + mba +
## memberitahukan + mencukupi + menit + mobile + myim + nelp +
## nelpon + niaga + nikmati + nomor + nonton + norm + normal +
## nya + oke + ooredoo + pai + pakai + paket + paketnya + passcode +
## pastikan + pay + paypro + pelanggan + pemakaian + penawaran +
## perkb + permai + pesan + pesanan + pin + pkt + point + prabayar +
## premium + promo + prpnjngan + puas + pulsa + rahasia + rbhr +
## registrasi + rek + rollover + rphr + rprb + rprbhr + saldo +
## sdh + selamat + sepuasnya + sesuai + sih + silakan + sisa +
## skrg + sms + sos + spi + spotify + sukses + tap + tarif +
## tcash + tdk + tekan + telp + tercatat + terima + tersedia +
## tkn + tlp + top + topup + tower + transaksi + transfer +
## trf + tseltappromo + tujuan + tukarkan + tunai + tunggu +
## udah + udh + ulang + ulangnikkk + unlimited + usage + utama +
## utk + via + youtube + yth
##
## Fitted party:
## [1] root
## | [2] info in 0
## | | [3] ooredoo in 0
## | | | [4] iring in 0
## | | | | [5] pelanggan in 0
## | | | | | [6] gratis in 0
## | | | | | | [7] beli in 0
## | | | | | | | [8] isi in 0
## | | | | | | | | [9] kuota in 0
## | | | | | | | | | [10] ayo in 0
## | | | | | | | | | | [11] berlaku in 0
## | | | | | | | | | | | [12] mai in 0: ham (n = 925, err = 8.1%)
## | | | | | | | | | | | [13] mai in 1: spam (n = 12, err = 0.0%)
## | | | | | | | | | | [14] berlaku in 1: spam (n = 17, err = 11.8%)
## | | | | | | | | | [15] ayo in 1: spam (n = 24, err = 12.5%)
## | | | | | | | | [16] kuota in 1: spam (n = 47, err = 21.3%)
## | | | | | | | [17] isi in 1
## | | | | | | | | [18] mencukupi in 0: spam (n = 43, err = 14.0%)
## | | | | | | | | [19] mencukupi in 1: ham (n = 9, err = 0.0%)
## | | | | | | [20] beli in 1
## | | | | | | | [21] sisa in 0: spam (n = 50, err = 2.0%)
## | | | | | | | [22] sisa in 1: ham (n = 12, err = 0.0%)
## | | | | | [23] gratis in 1: spam (n = 47, err = 4.3%)
## | | | | [24] pelanggan in 1: spam (n = 57, err = 5.3%)
## | | | [25] iring in 1: spam (n = 53, err = 0.0%)
## | | [26] ooredoo in 1: spam (n = 66, err = 0.0%)
## | [27] info in 1: spam (n = 241, err = 14.9%)
##
## Number of inner nodes: 13
## Number of terminal nodes: 14
Testing the prediction
Model Evaluation
## Confusion Matrix and Statistics
##
## Reference
## Prediction ham spam
## ham 208 16
## spam 20 157
##
## Accuracy : 0.9102
## 95% CI : (0.8779, 0.9363)
## No Information Rate : 0.5686
## P-Value [Acc > NIR] : <0.0000000000000002
##
## Kappa : 0.8175
##
## Mcnemar's Test P-Value : 0.6171
##
## Sensitivity : 0.9075
## Specificity : 0.9123
## Pos Pred Value : 0.8870
## Neg Pred Value : 0.9286
## Prevalence : 0.4314
## Detection Rate : 0.3915
## Detection Prevalence : 0.4414
## Balanced Accuracy : 0.9099
##
## 'Positive' Class : spam
##
Conclusion
From the results, I got that the NaiveBayes model has the highest Accuracy and Recall value. Therefore, I am going to use it to submit into data submission
Data Submission
Preparing Data
## [1] "VCorpus" "Corpus"
## [1] "iRing keren cuman buat km, Via Vallen-Bojo Galak (Reff),Rp.0,1/3hr prpnjngan Rp.3190 dengan hnya bls YA lho!"
## <<DocumentTermMatrix (documents: 283, terms: 1197)>>
## Non-/sparse entries: 4645/334106
## Sparsity : 99%
## Maximal term length: 101
## Weighting : term frequency (tf)
## Sample :
## Terms
## Docs 24jam bronet info838 kamu kartu kuota paket registrasi sms ulang
## 134 0 0 0 0 0 0 0 0 0 0
## 139 1 1 1 0 0 0 0 0 0 0
## 160 0 0 0 0 0 0 0 0 0 0
## 162 0 0 0 0 0 0 0 0 0 0
## 169 1 1 1 0 0 0 0 0 0 0
## 180 0 0 0 0 0 2 1 0 1 0
## 183 0 0 0 0 0 0 0 0 1 0
## 186 1 1 1 0 0 0 0 0 0 0
## 223 1 1 1 0 0 0 0 0 0 0
## 272 0 0 0 0 0 1 1 0 0 0
## <<DocumentTermMatrix (documents: 283, terms: 1197)>>
## Non-/sparse entries: 4645/334106
## Sparsity : 99%
## Maximal term length: 101
## Weighting : term frequency (tf)
## Sample :
## Terms
## Docs 24jam bronet info838 kamu kartu kuota paket registrasi sms ulang
## 134 0 0 0 0 0 0 0 0 0 0
## 139 1 1 1 0 0 0 0 0 0 0
## 160 0 0 0 0 0 0 0 0 0 0
## 162 0 0 0 0 0 0 0 0 0 0
## 169 1 1 1 0 0 0 0 0 0 0
## 180 0 0 0 0 0 2 1 0 1 0
## 183 0 0 0 0 0 0 0 0 1 0
## 186 1 1 1 0 0 0 0 0 0 0
## 223 1 1 1 0 0 0 0 0 0 0
## 272 0 0 0 0 0 1 1 0 0 0
bernoulli_conv <- function(x){
x <- as.factor(as.numeric(x > 0))
}
data_sub_bn <- apply(data_sub_dtm, MARGIN = 2, bernoulli_conv)
class(data_sub_bn)## [1] "matrix"