What model will you use to classify the text? Answer: Naive Bayes
How many token or word you will use for training the model? Answer: 2280
Answer:
26
(1 Points) Accuracy in (your own) validation dataset reach > 80%. Answer: yes
(1 Points) Sensitivity in (your own) validation dataset reach > 80%. Answer: yes
(1 Points) Specificity in (your own) validation dataset reach > 85%. Answer: yes
(1 Points) Precision in (your own) validation dataset reach > 90%. Answer: yes
(2 Points) Accuracy in test dataset reach > 80%. Answer: yes
(2 Points) Sensitivity in test dataset reach > 80%. Answer: yes
(2 Points) Specificity in test dataset reach > 85%. Answer: yes
(2 Points) Precision in test dataset reach > 90%. Answer: yes
15 (3 Points) Use LIME method to interpret the model that you have used
What is the difference between interpreting black box model with LIME and using an interpretable machine learning model? Answer:model has higher performance in term of accuracy or precision,
How good is the explanation fit? What does it signify? Answer: We can see that for 1 ,2 ,and 3 text in Lime 100%, the probability to be not recommended is 100%.the first and second got 40% more accurate than third.
What are the most and the least important factors for each observation? Answer: The red-labeled text means that the word decrease probabilty of status SPAM, and the blue-labeled text means increase probabilty
sms <- data.table::fread("data/data-train.csv") %>%
mutate(status = as.factor(status))
sms <- tibble::rowid_to_column(sms, "ID")
head(sms)## ID datetime
## 1 1 2017-02-15T14:48:00Z
## 2 2 2017-02-15T15:24:00Z
## 3 3 2017-02-15T16:07:00Z
## 4 4 2017-02-15T16:59:00Z
## 5 5 2017-02-15T18:05:00Z
## 6 6 2017-02-15T18:05:00Z
## text
## 1 Telegram code 53784
## 2 Rezeki Nomplok Dompetku Pengiriman Uang! Kirim uang di Alfamart & dptkan hadiah jutaan rupiah setiap hari.Periode s.d. 28Feb17.Info: http://bit.ly/dmpurna MFI1
## 3 WhatsApp code 123-994.\r\n\r\nYou can also tap on this link to verify your phone: v.whatsapp.com/123994
## 4 Transaksi travel online pakai CIMB Clicks gratis perlindungan kecelakaan & tiket nonton di Pasarpolis.com. Ayo transaksi & nikmati manfaatnya! Info S&K 14041.
## 5 Apakah Anda mencoba mengakses akun Anda dari perangkat lain? Jika ya, mohon klik tautan ini https://api.gojek.co.id/customers/device?token=f192293e-3117-46e9-bac3-1d1473c23113 dalam 72 jam ke depan. Jika tidak, mohon abaikan pesan ini
## 6 Apakah Anda mencoba mengakses akun Anda dari perangkat lain? Jika ya, mohon klik tautan ini https://api.gojek.co.id/customers/device?token=f192293e-3117-46e9-bac3-1d1473c23113 dalam 72 jam ke depan. Jika tidak, mohon abaikan pesan ini
## status
## 1 ham
## 2 spam
## 3 ham
## 4 ham
## 5 ham
## 6 ham
sms$datetime <- ymd_hms(sms$datetime)
sms <- sms %>%
mutate(Hour = hour(datetime))
plot1 <- ggplot(sms, aes(Hour, fill= status)) +
geom_bar() +
facet_wrap(~sms$status)+
xlab("Hour")+
ggtitle("Overview of Ham & Spam to SMS")
ggplotly(plot1)## [1] "Rezeki Nomplok Dompetku Pengiriman Uang! Kirim uang di Alfamart & dptkan hadiah jutaan rupiah setiap hari.Periode s.d. 28Feb17.Info: http://bit.ly/dmpurna MFI1"
## [2] "YEAY! Free Ice Tea atau Cashback up to 30% dg transaksi di AH Resto! Hanya untuk pengguna TCASH TAP. S&K Berlaku. Info tsel.me/tappromo"
## [3] "Voting your Offer. Disc 40%, 1 crispy chicken+1 spicy chicken+ nasi+lotteria tea Rp.26rb. Tukar SMS ini di LOTTERIA terdekat. Berlaku hari ini. SKB. Promo *606#"
## [4] "Ayo bergabung dgn Freedom Postpaid! Makin rame makin seru, ajak teman & keluarga diskonnya lebih besar. Daftar di http://im3.do/uxU PAI1"
## [5] "Nikmati kemudahan mewujudkan impian kamu dan pasangan utk masa depan yg lebih cerah. Cek Dana Bantuan Sahabat di DOMPETKU! Info: http://bit.ly/dmpdbs MFI3"
## [6] "Gratis 1 bulan Spotify Premium khusus FreedomCombo. Bisa bebas dengar musik,bikin playlist sepuasnya tanpa iklan dgn Spotify Premium. Aktifkan di *123*123# CVI1"
## [1] "Telegram code 53784"
## [2] "WhatsApp code 123-994.\r\n\r\nYou can also tap on this link to verify your phone: v.whatsapp.com/123994"
## [3] "Transaksi travel online pakai CIMB Clicks gratis perlindungan kecelakaan & tiket nonton di Pasarpolis.com. Ayo transaksi & nikmati manfaatnya! Info S&K 14041."
## [4] "Apakah Anda mencoba mengakses akun Anda dari perangkat lain? Jika ya, mohon klik tautan ini https://api.gojek.co.id/customers/device?token=f192293e-3117-46e9-bac3-1d1473c23113 dalam 72 jam ke depan. Jika tidak, mohon abaikan pesan ini"
## [5] "Apakah Anda mencoba mengakses akun Anda dari perangkat lain? Jika ya, mohon klik tautan ini https://api.gojek.co.id/customers/device?token=f192293e-3117-46e9-bac3-1d1473c23113 dalam 72 jam ke depan. Jika tidak, mohon abaikan pesan ini"
## [6] "15/02/2017 18:08:02 Silakan gunakan passcode 7791 untuk Login Go Mobile CIMB Niaga. Passcode bersifat RAHASIA. Jangan memberitahukan kepada siapapun!"
set.seed(100)
data_train1 <- downSample(x = data_train$ID,
y = data_train$status, yname = "status")
data_train1 <- data_train1 %>% rename(ID = x)
table(data_train1$status) %>%
prop.table()##
## ham spam
## 0.5 0.5
data_train <- merge(data_train1, data_train, by = "ID")
data_train <- data_train %>%
select(ID, status.x, text) %>% rename(status= status.x)
table(data_train$status) %>%
prop.table()##
## ham spam
## 0.5 0.5
review_corpus <- VCorpus( VectorSource(data_train$text))
stopwords.id <- readLines("stopwords-id.txt", warn = FALSE)
review_corpus <- review_corpus %>%
tm_map(content_transformer(tolower)) %>% # lowercase
tm_map(removeNumbers) %>% # remove numerical character
tm_map(removeWords, stopwords("english")) %>% # remove stopwords (and, the, am)
tm_map(removeWords, stopwords.id) %>%
tm_map(removePunctuation) %>% # remove punctuation mark
tm_map(stemDocument) %>% # stem word (e.g. from walking to walk)
tm_map(stripWhitespace) wordcloud(review_corpus,
min.freq = 5,
max.words = 30,
random.order = FALSE,
colors = brewer.pal(8, "Set2"))## <<DocumentTermMatrix (documents: 1338, terms: 2280)>>
## Non-/sparse entries: 11935/3038705
## Sparsity : 100%
## Maximal term length: 40
## Weighting : term frequency (tf)
corpus_test <- VCorpus( VectorSource(data_test$text))
stopwords.id <- readLines("stopwords-id.txt", warn = FALSE)
review_corpus_test <- corpus_test %>%
tm_map(content_transformer(tolower)) %>% # lowercase
tm_map(removeNumbers) %>% # remove numerical character
tm_map(removeWords, stopwords("english")) %>% # remove stopwords (and, the, am)
tm_map(removeWords, stopwords.id) %>%
tm_map(removePunctuation) %>% # remove punctuation mark
tm_map(stemDocument) %>% # stem word (e.g. from walking to walk)
tm_map(stripWhitespace)
dataframe<-data.frame(text=unlist(sapply(review_corpus_test, `[`, "content")),
stringsAsFactors=F)
head(data_test$text)## [1] "YEAY! Free Ice Tea atau Cashback up to 30% dg transaksi di AH Resto! Hanya untuk pengguna TCASH TAP. S&K Berlaku. Info tsel.me/tappromo"
## [2] "Voting your Offer. Disc 40%, 1 crispy chicken+1 spicy chicken+ nasi+lotteria tea Rp.26rb. Tukar SMS ini di LOTTERIA terdekat. Berlaku hari ini. SKB. Promo *606#"
## [3] "Ayo bergabung dgn Freedom Postpaid! Makin rame makin seru, ajak teman & keluarga diskonnya lebih besar. Daftar di http://im3.do/uxU PAI1"
## [4] "YEAY! Kejutan cashback & freebies dg TCASH TAP! Terus #pakeTCASH, cek HP kamu & dapatkan kejutannya. S&K berlaku. Info cek tsel.me/yeay"
## [5] "nanti saya ke depan gerbang bukit permai yg ditutup, yg di sebelah kimia farma ya, Pak"
## [6] "Sore, ini dengan drh yg di Jambore ya? dengan dokter siapa ya? saya mau tanya kalau untuk biaya panggil ke rumah itu dihitung per kucing atau per datangnya ya? makasih."
## text
## 1.content yeay free ice tea cashback dg transaksi ah resto pengguna tcash tap sk berlaku info tseltappromo
## 2.content vote offer disc crispi chicken spici chicken nasilotteria tea rprb tukar sms lotteria terdekat berlaku skb promo
## 3.content ayo bergabung dgn freedom postpaid rame seru ajak teman keluarga diskonnya daftar httpimuxu pai
## 4.content yeay kejutan cashback freebi dg tcash tap paketcash cek hp dapatkan kejutannya sk berlaku info cek tselyeay
## 5.content gerbang bukit permai yg ditutup yg sebelah kimia farma ya
## 6.content sore drh yg jambor ya dokter ya biaya panggil rumah dihitung kuce datangnya ya makasih
## Confusion Matrix and Statistics
##
## Reference
## Prediction ham spam
## ham 209 7
## spam 19 166
##
## Accuracy : 0.9352
## 95% CI : (0.9064, 0.9572)
## No Information Rate : 0.5686
## P-Value [Acc > NIR] : < 2e-16
##
## Kappa : 0.8689
##
## Mcnemar's Test P-Value : 0.03098
##
## Sensitivity : 0.9595
## Specificity : 0.9167
## Pos Pred Value : 0.8973
## Neg Pred Value : 0.9676
## Prevalence : 0.4314
## Detection Rate : 0.4140
## Detection Prevalence : 0.4613
## Balanced Accuracy : 0.9381
##
## 'Positive' Class : spam
##
tokenize_text <- function(text){
# Create Corpuse
data_corpus <- VCorpus(VectorSource(text))
stopwords.id <- readLines("stopwords-id.txt", warn = FALSE)
# Cleansing
data_corpus_clean <- data_corpus %>%
tm_map(content_transformer(tolower)) %>%
tm_map(removeNumbers) %>%
tm_map(removeWords, stopwords("english")) %>%
tm_map(removeWords, stopwords.id) %>%
tm_map(removePunctuation) %>%
tm_map(stemDocument) %>%
tm_map(stripWhitespace) %>%
tm_map(stemDocument)
# Document-Term Matrix and use only terms from data train
data_dtm <- DocumentTermMatrix(data_corpus_clean)
# Bernoulli Converter
data_text <- apply(data_dtm, 2, bernoulli_conv)
return(data_text)
}## [1] "naiveBayes"
# create a function named model_type
model_type.naiveBayes <- function(x){
return("classification")
}
# return the probability value and convert them to data.frame
predict_model.naiveBayes <- function(x, newdata, type = "raw") {
# return classification probabilities only
res <- predict(x, newdata, type = "raw") %>% as.data.frame()
return(res)
}
# prepare the input for the lime.
text_train <- data_train$text %>% as.character() # The text review from data train
explainer <- lime(text_train, # the input
model = model_bayes, # the model
preprocess = tokenize_text)text_test <- dataframe$text
set.seed(100)
explanation <- explain(text_test[2:5],
explainer = explainer,
n_labels = 1, # show only 1 label (recommend or not recommend)
n_features = 5,
feature_select = "none", # use all terms to explain the model
single_explanation = F)data_test2 <- data_test %>%
mutate(predicted.status = pred_test) %>%
mutate(result = ifelse(status == predicted.status, "TRUE", "FALSE"))
data_test2 <- data_test2[,-3]
data_test2 <- data_test2 %>%
mutate(text = dataframe$text)Count that misclassified
##
## FALSE TRUE
## 26 375
find the words that frequently misclasified
data_test3 <- data_test2 %>%
filter(result == "FALSE")
corpus3 <- VCorpus(VectorSource(data_test3$text))
obj_wordcloud <- wordcloud(corpus3,
min.freq = 3,
max.words = 3,
random.order = FALSE,
colors = brewer.pal(8, "Set2"))## NULL
predict_df <- predict(model_df, test_df, type = "response")
confusionMatrix(predict_df, test_df$status , positive = "spam")## Confusion Matrix and Statistics
##
## Reference
## Prediction ham spam
## ham 199 10
## spam 29 163
##
## Accuracy : 0.9027
## 95% CI : (0.8694, 0.9299)
## No Information Rate : 0.5686
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.8043
##
## Mcnemar's Test P-Value : 0.003948
##
## Sensitivity : 0.9422
## Specificity : 0.8728
## Pos Pred Value : 0.8490
## Neg Pred Value : 0.9522
## Prevalence : 0.4314
## Detection Rate : 0.4065
## Detection Prevalence : 0.4788
## Balanced Accuracy : 0.9075
##
## 'Positive' Class : spam
##
corpus_test2 <- VCorpus( VectorSource(csv_test$text))
stopwords.id <- readLines("stopwords-id.txt", warn = FALSE)
review_corpus_test2 <- corpus_test2 %>%
tm_map(content_transformer(tolower)) %>% # lowercase
tm_map(removeNumbers) %>% # remove numerical character
tm_map(removeWords, stopwords("english")) %>% # remove stopwords (and, the, am)
tm_map(removeWords, stopwords.id) %>%
tm_map(removePunctuation) %>% # remove punctuation mark
tm_map(stemDocument) %>% # stem word (e.g. from walking to walk)
tm_map(stripWhitespace)
dataframe2<-data.frame(text=unlist(sapply(review_corpus_test2, `[`, "content")),
stringsAsFactors=F)
head(csv_test$text)## [1] "Km baru saja akses Apps Sehari-hari terpopuler.Nikmati akses YOUTUBE ga habis habis dgn beli pkt Unlimited HANYA di *123# atau myIM3 http://im3.do/m3"
## [2] "GRATIS UNLIMITED YOUTUBE+INTERNET 10GB+CHAT&SOSMED+SMS+NELPON selama 30hari.Data Rollover.PROMO 100Rb (Normal 115rb). MAU? Tekan C25 kirim SMS ke 929 sekarang"
## [3] "Sisa kuota 285 MB.Beli pkt Internet TERBAIK dr IM3 ooredoo di *123# atau myIM3 http:// im3.do/m3 .Kelebihan pemakaian dikenakan tarif perKB"
## [4] "Ada banyak lowongan kerja baru! Ayo jgn sampai kamu ketinggalan update & tips di dunia kerja, tekan *123*543*2# . Tarif Rp2.200/3hari. Info: 08001401686 DSI7"
## [5] "Proses PEMBLOKIRAN kartu bagi yg blm registrasi sdg berjalan,segera registrasi kartu Anda,dapatkan bonus 250MB+250mnt+250SMS.Ketik ULANG#NIK#No.KK# SMS ke4444"
## [6] "iRing keren cuman buat km, Via Vallen-Bojo Galak (Reff),Rp.0,1/3hr prpnjngan Rp.3190 dengan hnya bls YA lho!"
## text
## 1.content km aks app sehari terpopulernikmati aks youtub ga habi habi dgn beli pkt unlimit myim httpimm
## 2.content grati unlimit youtubeinternet gbchatsosmedsmsnelpon data rolloverpromo rb normal rb tekan c kirim sms
## 3.content sisa kuota mbbeli pkt internet terbaik dr im ooredoo myim http imm kelebihan pemakaian dikenakan tarif perkb
## 4.content lowongan kerja ayo jgn ketinggalan updat tip dunia kerja tekan tarif rp info dsi
## 5.content prose pemblokiran kartu yg blm registrasi sdg berjalan registrasi kartu dapatkan bonus mbmntsmsketik ulangnikkk sms
## 6.content ire keren cuman km via vallenbojo galak reffrphr prpnjngan rp hnya bls ya lho
## datetime
## 1 2018-03-01T00:32:00Z
## 2 2018-03-01T08:57:00Z
## 3 2018-03-01T09:15:00Z
## 4 2018-03-01T16:42:00Z
## 5 2018-03-01T17:42:00Z
## 6 2018-03-01T22:04:00Z
## text
## 1 Km baru saja akses Apps Sehari-hari terpopuler.Nikmati akses YOUTUBE ga habis habis dgn beli pkt Unlimited HANYA di *123# atau myIM3 http://im3.do/m3
## 2 GRATIS UNLIMITED YOUTUBE+INTERNET 10GB+CHAT&SOSMED+SMS+NELPON selama 30hari.Data Rollover.PROMO 100Rb (Normal 115rb). MAU? Tekan C25 kirim SMS ke 929 sekarang
## 3 Sisa kuota 285 MB.Beli pkt Internet TERBAIK dr IM3 ooredoo di *123# atau myIM3 http:// im3.do/m3 .Kelebihan pemakaian dikenakan tarif perKB
## 4 Ada banyak lowongan kerja baru! Ayo jgn sampai kamu ketinggalan update & tips di dunia kerja, tekan *123*543*2# . Tarif Rp2.200/3hari. Info: 08001401686 DSI7
## 5 Proses PEMBLOKIRAN kartu bagi yg blm registrasi sdg berjalan,segera registrasi kartu Anda,dapatkan bonus 250MB+250mnt+250SMS.Ketik ULANG#NIK#No.KK# SMS ke4444
## 6 iRing keren cuman buat km, Via Vallen-Bojo Galak (Reff),Rp.0,1/3hr prpnjngan Rp.3190 dengan hnya bls YA lho!
## status
## 1 spam
## 2 spam
## 3 spam
## 4 spam
## 5 ham
## 6 spam