INTRODUCTION

Capstone Requirement Tab

Data Preprocess and Exploratory Data Analysis

  1. (2 Points) Demonstrated how to properly do data preprocess for text data:
  • What package you will use for text mining? Answer: tm, e1071, SnowballC, and wordcloud
  • Should you remove punctuation or emoticon? Answer: yes
  • Will you create a document-term matrix? Answer: yes

Model Selection and Evaluation

  1. (2 Points) Compare multiple method approaches for text classification task (e.g. Naive Bayes, Random Forest, Deep Learning)
  • What model will you use to classify the text? Answer: Naive Bayes

  • How many token or word you will use for training the model? Answer: 2280

  1. (2 Points) Reported model selection and cross-validation results.
  • How much percent (%) of the data used for training the model? Answer: 80%
  • How do you choose which one is the better model? Is it based on the accuracy? Answer: based on Precision, we really do not want the occurrence of False Positive. We prefer if the actual email is SPAM but predicted not SPAM so it remains in our inbox, rather than the actual email not SPAM but predicted SPAM so it is not in our inbox
  • Which model is the best? Answer: NaiveBayes, refer to previous answer
  1. (2 Points) Reported which words are important for prediction problem.
  • How do you decide which words are important? Answer: Vizualizing Word Frequency that frequently appeared in the dataset.by barchart or wordcloud
  1. (2 Points) Reported which sms were incorrectly predicted in your own test dataset.
  • Which sms were incorrectly predicted on the test dataset?

Answer:

26

  1. (2 Points) Based on sms that misclassified, give an analysis of why this might happen.
  • Is there any common pattern among the misclassified texts? Answer: yes
  • Is there any particular words that present in most of the misclassified texts? Answer: isi, kuota, and pulsa

Prediction Performance

  1. (1 Points) Accuracy in (your own) validation dataset reach > 80%. Answer: yes

  2. (1 Points) Sensitivity in (your own) validation dataset reach > 80%. Answer: yes

  3. (1 Points) Specificity in (your own) validation dataset reach > 85%. Answer: yes

  4. (1 Points) Precision in (your own) validation dataset reach > 90%. Answer: yes

  5. (2 Points) Accuracy in test dataset reach > 80%. Answer: yes

  6. (2 Points) Sensitivity in test dataset reach > 80%. Answer: yes

  7. (2 Points) Specificity in test dataset reach > 85%. Answer: yes

  8. (2 Points) Precision in test dataset reach > 90%. Answer: yes

Interpretation

15 (3 Points) Use LIME method to interpret the model that you have used

  • Is there any pre-processing that you need in order to be more interpretable? Answer: Use tokenize text as pre-processing Function to Cleansing and be more interpretable
  • How many features do you use to explain the model? Answer: 5
  • What is the difference between using LIME compared to interpretable machine learning models such as Decision Tree or metrics such as Variable Importance in Random Forest? Answer: Interpretable Data Representations for text mining, represents presence/absence of words.
  1. (3 Points) Interpret the first 4 observation of the plot
  • What is the difference between interpreting black box model with LIME and using an interpretable machine learning model? Answer:model has higher performance in term of accuracy or precision,

  • How good is the explanation fit? What does it signify? Answer: We can see that for 1 ,2 ,and 3 text in Lime 100%, the probability to be not recommended is 100%.the first and second got 40% more accurate than third.

  • What are the most and the least important factors for each observation? Answer: The red-labeled text means that the word decrease probabilty of status SPAM, and the blue-labeled text means increase probabilty

Conclusion

  1. (2 Points) Write the conclusion of your capstone project
  • Is your goal achieved? Answer: yes
  • Is the problem can be solved by machine learning? Answer: yes
  • What model did you use and how is the performance? Answer: Naive bayes based on Confusion Matrix is better than decision tree result with a precision prespective
  • What is the potential business implementation of your capstone project? Answer: Can be used for everyone that has so many message in a day

Leaderboard Tab

Revision Tab C001+

  1. (2 Points) Reported a distribution plot of total hourly frequency for each status.
  • ggplot and geom bar that i visualize in C001+
  1. (2 Points) Reported some text characteristics related to spam and ham
  • head(df.spam\(text) head(df.ham\)text) And Wordcloud to visualize spam text that will be classified as spam
  1. (2 Points) Compare multiple method approaches for text classification task (e.g. Naive Bayes, Random Forest, Deep Learning) What model will you use to classify the text?
  • naive bayes, based on model evaluation comparison rely on precision NB precision = 89% Decision Tree Precision = 84% the actual sms not SPAM but predicted SPAM occurance and we use 2280 terms and all documents
  1. (2 Points) Reported which sms were incorrectly predicted in your own test dataset?
  • Rely on Model Evaluation we add False positive and False Negative, FP= 19, FN= 7, and total incorrectly predicted equal 26
  1. (2 Points) Based on sms that misclassified, give an analysis of why this might happen?
  • Because some words couldnt classified As spam, and words that present in most of the misclassified texts such as isi, kuota, and pulsa.
  1. (2 Points) Interpret the first 4 observation of the plot?
  • Explanation chunk’s been revised using 4 observation

IMPORT DATASET

##   ID             datetime
## 1  1 2017-02-15T14:48:00Z
## 2  2 2017-02-15T15:24:00Z
## 3  3 2017-02-15T16:07:00Z
## 4  4 2017-02-15T16:59:00Z
## 5  5 2017-02-15T18:05:00Z
## 6  6 2017-02-15T18:05:00Z
##                                                                                                                                                                                                                                         text
## 1                                                                                                                                                                                                                        Telegram code 53784
## 2                                                                            Rezeki Nomplok Dompetku Pengiriman Uang! Kirim uang di Alfamart & dptkan hadiah jutaan rupiah setiap hari.Periode s.d. 28Feb17.Info: http://bit.ly/dmpurna MFI1
## 3                                                                                                                                    WhatsApp code 123-994.\r\n\r\nYou can also tap on this link to verify your phone: v.whatsapp.com/123994
## 4                                                                             Transaksi travel online pakai CIMB Clicks gratis perlindungan kecelakaan & tiket nonton di Pasarpolis.com. Ayo transaksi & nikmati manfaatnya! Info S&K 14041.
## 5 Apakah Anda mencoba mengakses akun Anda dari perangkat lain? Jika ya, mohon klik tautan ini https://api.gojek.co.id/customers/device?token=f192293e-3117-46e9-bac3-1d1473c23113 dalam 72 jam ke depan. Jika tidak, mohon abaikan pesan ini
## 6 Apakah Anda mencoba mengakses akun Anda dari perangkat lain? Jika ya, mohon klik tautan ini https://api.gojek.co.id/customers/device?token=f192293e-3117-46e9-bac3-1d1473c23113 dalam 72 jam ke depan. Jika tidak, mohon abaikan pesan ini
##   status
## 1    ham
## 2   spam
## 3    ham
## 4    ham
## 5    ham
## 6    ham

Explanatory Data Analysis

## 
##       ham      spam 
## 0.5798403 0.4201597

Visualize Data C001+

## Text Characteristics C001+

## [1] "Rezeki Nomplok Dompetku Pengiriman Uang! Kirim uang di Alfamart & dptkan hadiah jutaan rupiah setiap hari.Periode s.d. 28Feb17.Info: http://bit.ly/dmpurna MFI1" 
## [2] "YEAY! Free Ice Tea atau Cashback up to 30% dg transaksi  di AH Resto! Hanya untuk pengguna TCASH TAP. S&K Berlaku. Info tsel.me/tappromo"                        
## [3] "Voting your Offer. Disc 40%, 1 crispy chicken+1 spicy chicken+ nasi+lotteria tea Rp.26rb. Tukar SMS ini di LOTTERIA terdekat. Berlaku hari ini. SKB. Promo *606#"
## [4] "Ayo bergabung dgn Freedom Postpaid! Makin rame makin seru, ajak teman & keluarga diskonnya lebih besar. Daftar di http://im3.do/uxU PAI1"                        
## [5] "Nikmati kemudahan mewujudkan impian kamu dan pasangan utk masa depan yg lebih cerah. Cek Dana Bantuan Sahabat di DOMPETKU! Info: http://bit.ly/dmpdbs MFI3"      
## [6] "Gratis 1 bulan Spotify Premium khusus FreedomCombo. Bisa bebas dengar musik,bikin playlist sepuasnya tanpa iklan dgn Spotify Premium. Aktifkan di *123*123# CVI1"
## [1] "Telegram code 53784"                                                                                                                                                                                                                       
## [2] "WhatsApp code 123-994.\r\n\r\nYou can also tap on this link to verify your phone: v.whatsapp.com/123994"                                                                                                                                   
## [3] "Transaksi travel online pakai CIMB Clicks gratis perlindungan kecelakaan & tiket nonton di Pasarpolis.com. Ayo transaksi & nikmati manfaatnya! Info S&K 14041."                                                                            
## [4] "Apakah Anda mencoba mengakses akun Anda dari perangkat lain? Jika ya, mohon klik tautan ini https://api.gojek.co.id/customers/device?token=f192293e-3117-46e9-bac3-1d1473c23113 dalam 72 jam ke depan. Jika tidak, mohon abaikan pesan ini"
## [5] "Apakah Anda mencoba mengakses akun Anda dari perangkat lain? Jika ya, mohon klik tautan ini https://api.gojek.co.id/customers/device?token=f192293e-3117-46e9-bac3-1d1473c23113 dalam 72 jam ke depan. Jika tidak, mohon abaikan pesan ini"
## [6] "15/02/2017 18:08:02 Silakan gunakan passcode 7791 untuk Login Go Mobile CIMB Niaga. Passcode bersifat RAHASIA. Jangan memberitahukan kepada siapapun!"

NAIVE BAYES MODEL

Model Evaluation

Text Cleansing Data test

## [1] "YEAY! Free Ice Tea atau Cashback up to 30% dg transaksi  di AH Resto! Hanya untuk pengguna TCASH TAP. S&K Berlaku. Info tsel.me/tappromo"                                
## [2] "Voting your Offer. Disc 40%, 1 crispy chicken+1 spicy chicken+ nasi+lotteria tea Rp.26rb. Tukar SMS ini di LOTTERIA terdekat. Berlaku hari ini. SKB. Promo *606#"        
## [3] "Ayo bergabung dgn Freedom Postpaid! Makin rame makin seru, ajak teman & keluarga diskonnya lebih besar. Daftar di http://im3.do/uxU PAI1"                                
## [4] "YEAY! Kejutan cashback & freebies dg TCASH TAP! Terus #pakeTCASH, cek HP kamu & dapatkan kejutannya. S&K berlaku. Info cek tsel.me/yeay"                                 
## [5] "nanti saya ke depan gerbang bukit permai yg ditutup, yg di sebelah kimia farma ya, Pak"                                                                                  
## [6] "Sore, ini dengan drh yg di Jambore ya? dengan dokter siapa ya? saya mau tanya kalau untuk biaya panggil ke rumah itu dihitung per kucing atau per datangnya ya? makasih."
##                                                                                                                       text
## 1.content                 yeay free ice tea cashback dg transaksi ah resto pengguna tcash tap sk berlaku info tseltappromo
## 2.content vote offer disc crispi chicken spici chicken nasilotteria tea rprb tukar sms lotteria terdekat berlaku skb promo
## 3.content                  ayo bergabung dgn freedom postpaid rame seru ajak teman keluarga diskonnya daftar httpimuxu pai
## 4.content      yeay kejutan cashback freebi dg tcash tap paketcash cek hp dapatkan kejutannya sk berlaku info cek tselyeay
## 5.content                                                        gerbang bukit permai yg ditutup yg sebelah kimia farma ya
## 6.content                           sore drh yg jambor ya dokter ya biaya panggil rumah dihitung kuce datangnya ya makasih

Predict and Confusion Matrix

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction ham spam
##       ham  209    7
##       spam  19  166
##                                           
##                Accuracy : 0.9352          
##                  95% CI : (0.9064, 0.9572)
##     No Information Rate : 0.5686          
##     P-Value [Acc > NIR] : < 2e-16         
##                                           
##                   Kappa : 0.8689          
##                                           
##  Mcnemar's Test P-Value : 0.03098         
##                                           
##             Sensitivity : 0.9595          
##             Specificity : 0.9167          
##          Pos Pred Value : 0.8973          
##          Neg Pred Value : 0.9676          
##              Prevalence : 0.4314          
##          Detection Rate : 0.4140          
##    Detection Prevalence : 0.4613          
##       Balanced Accuracy : 0.9381          
##                                           
##        'Positive' Class : spam            
## 

Interperate Model With Lime

Explanation

## Incorrectly Predicted Test Dataset

Count that misclassified

## 
## FALSE  TRUE 
##    26   375

find the words that frequently misclasified

## NULL

DECISION TREE

Model Evaluation

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction ham spam
##       ham  199   10
##       spam  29  163
##                                           
##                Accuracy : 0.9027          
##                  95% CI : (0.8694, 0.9299)
##     No Information Rate : 0.5686          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.8043          
##                                           
##  Mcnemar's Test P-Value : 0.003948        
##                                           
##             Sensitivity : 0.9422          
##             Specificity : 0.8728          
##          Pos Pred Value : 0.8490          
##          Neg Pred Value : 0.9522          
##              Prevalence : 0.4314          
##          Detection Rate : 0.4065          
##    Detection Prevalence : 0.4788          
##       Balanced Accuracy : 0.9075          
##                                           
##        'Positive' Class : spam            
## 

DATA-TEST.CSV WITH NAIVE BAYES

Importing data-test.csv

## [1] "Km baru saja akses Apps Sehari-hari terpopuler.Nikmati akses YOUTUBE ga habis habis dgn beli pkt Unlimited HANYA di *123# atau myIM3 http://im3.do/m3"         
## [2] "GRATIS UNLIMITED YOUTUBE+INTERNET 10GB+CHAT&SOSMED+SMS+NELPON selama 30hari.Data Rollover.PROMO 100Rb (Normal 115rb). MAU? Tekan C25 kirim SMS ke 929 sekarang"
## [3] "Sisa kuota 285 MB.Beli pkt Internet TERBAIK dr IM3 ooredoo di *123# atau myIM3 http:// im3.do/m3 .Kelebihan pemakaian dikenakan tarif perKB"                   
## [4] "Ada banyak lowongan kerja baru! Ayo jgn sampai kamu ketinggalan update & tips di dunia kerja, tekan *123*543*2# . Tarif Rp2.200/3hari. Info: 08001401686 DSI7" 
## [5] "Proses PEMBLOKIRAN kartu bagi yg blm registrasi sdg berjalan,segera registrasi kartu Anda,dapatkan bonus 250MB+250mnt+250SMS.Ketik ULANG#NIK#No.KK# SMS ke4444"
## [6] "iRing keren cuman buat km, Via Vallen-Bojo Galak (Reff),Rp.0,1/3hr prpnjngan Rp.3190 dengan hnya bls YA lho!"
##                                                                                                                          text
## 1.content                       km aks app sehari terpopulernikmati aks youtub ga habi habi dgn beli pkt unlimit myim httpimm
## 2.content               grati unlimit youtubeinternet gbchatsosmedsmsnelpon data rolloverpromo rb normal rb tekan c kirim sms
## 3.content        sisa kuota mbbeli pkt internet terbaik dr im ooredoo myim http imm kelebihan pemakaian dikenakan tarif perkb
## 4.content                                    lowongan kerja ayo jgn ketinggalan updat tip dunia kerja tekan tarif rp info dsi
## 5.content prose pemblokiran kartu yg blm registrasi sdg berjalan registrasi kartu dapatkan bonus mbmntsmsketik ulangnikkk sms
## 6.content                                       ire keren cuman km via vallenbojo galak reffrphr prpnjngan rp hnya bls ya lho

Predict Using Naive Bayes Model

##               datetime
## 1 2018-03-01T00:32:00Z
## 2 2018-03-01T08:57:00Z
## 3 2018-03-01T09:15:00Z
## 4 2018-03-01T16:42:00Z
## 5 2018-03-01T17:42:00Z
## 6 2018-03-01T22:04:00Z
##                                                                                                                                                             text
## 1          Km baru saja akses Apps Sehari-hari terpopuler.Nikmati akses YOUTUBE ga habis habis dgn beli pkt Unlimited HANYA di *123# atau myIM3 http://im3.do/m3
## 2 GRATIS UNLIMITED YOUTUBE+INTERNET 10GB+CHAT&SOSMED+SMS+NELPON selama 30hari.Data Rollover.PROMO 100Rb (Normal 115rb). MAU? Tekan C25 kirim SMS ke 929 sekarang
## 3                    Sisa kuota 285 MB.Beli pkt Internet TERBAIK dr IM3 ooredoo di *123# atau myIM3 http:// im3.do/m3 .Kelebihan pemakaian dikenakan tarif perKB
## 4  Ada banyak lowongan kerja baru! Ayo jgn sampai kamu ketinggalan update & tips di dunia kerja, tekan *123*543*2# . Tarif Rp2.200/3hari. Info: 08001401686 DSI7
## 5 Proses PEMBLOKIRAN kartu bagi yg blm registrasi sdg berjalan,segera registrasi kartu Anda,dapatkan bonus 250MB+250mnt+250SMS.Ketik ULANG#NIK#No.KK# SMS ke4444
## 6                                                   iRing keren cuman buat km, Via Vallen-Bojo Galak (Reff),Rp.0,1/3hr prpnjngan Rp.3190 dengan hnya bls YA lho!
##   status
## 1   spam
## 2   spam
## 3   spam
## 4   spam
## 5    ham
## 6   spam