#Introduction
As a final step in machine learning course, every student should complete 1 case as their machine learning capstone project. In this article, we will explain one cases of machine learning capstone project: “Spam SMS Classification”.
#Datasets
The train dataset will be used to train and evaluate the model, while the test dataset is used for the final evaluation. The final evaluation requires you to submit your prediction of the test dataset to the leaderboard in order to obtain the final model evaluation (more details are provided below). The data scheme is illustrated as follows:
#Case Study
The SMS dataset is collected by team for educational purposes. It is a real SMS dataset with a spam/ham label for each message.
SMS: “I didn’t get your message!”
Someone might contact you through old-school way of SMS and you might even skip it because the amount of the spams in your inbox is just way too much. The SMS is classified as spam is collected through user’s report for unwanted SMS. Can we build a spam classifier?
The problem above urge you to classify whether a text message would be a SPAM or HAM based on the content.
For this case study, there are SMS dataset sources, as follow 1. SMS Train Datasets 2. SMS Test Datasets
#Data Preprocess and Exploratory Data Analysis ## A. Text data preprocessing
To load the following package.
library(lubridate) #Make Dealing with Dates
library(tidyverse) #The tidyverse package is designed to make it easy to install and load core packages from the tidyverse in a single command -> ggplot2, dplyr, tidyr, readr, purrr, tibble, stringr, forcats
library(readr) #Read Rectangular Text Data
library(caret) #Classification and Regression Training
library(dplyr) #A Grammar of Data Manipulation
library(e1071) #Misc Functions of the Department of Statistics, Probability Theory Group (Formerly: E1071)
library(ROCR) #Visualizing the Performance of Scoring Classifiers
library(SnowballC) #Snowball Stemmers Based on the C 'libstemmer' UTF-8 Library
In this phase, To read the source SMS data and use the read_csv() function from the readr package to speed up the data reading process.
sms_datatrain <- read.csv("4. sms-cl-spam/data/data-train.csv",
stringsAsFactors = FALSE, row.names = ,
encoding = "UTF-8") %>%
mutate(status = as.factor(status),
datetime = as_datetime(datetime),
hourly = hour(datetime)
)
glimpse(sms_datatrain)
#> Rows: 2,004
#> Columns: 4
#> $ datetime <dttm> 2017-02-15 14:48:00, 2017-02-15 15:24:00, 2017-02-15 16:0...
#> $ text <chr> "Telegram code 53784", "Rezeki Nomplok Dompetku Pengiriman...
#> $ status <fct> ham, spam, ham, ham, ham, ham, ham, spam, spam, spam, spam...
#> $ hourly <int> 14, 15, 16, 16, 18, 18, 18, 10, 11, 18, 18, 9, 9, 11, 13, ...
Taking 20 text samples with ‘spam’ category, then observe which words / sentences can be indicator or predictor as text spam. it can also use the a wordcloud to look at the most frequent words in the overall collection of words(“bag of words”) available.
To see “SPAM” word pattern thru bag of word.
library(wordcloud)
# your code here
sms_datatrain %>%
filter(status == "spam") %>%
head(20) %>%
pull(text) # ambil kolom text sebagai vektor character
#> [1] "Rezeki Nomplok Dompetku Pengiriman Uang! Kirim uang di Alfamart & dptkan hadiah jutaan rupiah setiap hari.Periode s.d. 28Feb17.Info: http://bit.ly/dmpurna MFI1"
#> [2] "YEAY! Free Ice Tea atau Cashback up to 30% dg transaksi di AH Resto! Hanya untuk pengguna TCASH TAP. S&K Berlaku. Info tsel.me/tappromo"
#> [3] "Voting your Offer. Disc 40%, 1 crispy chicken+1 spicy chicken+ nasi+lotteria tea Rp.26rb. Tukar SMS ini di LOTTERIA terdekat. Berlaku hari ini. SKB. Promo *606#"
#> [4] "Ayo bergabung dgn Freedom Postpaid! Makin rame makin seru, ajak teman & keluarga diskonnya lebih besar. Daftar di http://im3.do/uxU PAI1"
#> [5] "Nikmati kemudahan mewujudkan impian kamu dan pasangan utk masa depan yg lebih cerah. Cek Dana Bantuan Sahabat di DOMPETKU! Info: http://bit.ly/dmpdbs MFI3"
#> [6] "Gratis 1 bulan Spotify Premium khusus FreedomCombo. Bisa bebas dengar musik,bikin playlist sepuasnya tanpa iklan dgn Spotify Premium. Aktifkan di *123*123# CVI1"
#> [7] "Masukan Username & Password ini di aplikasi Spotify. Username:085722688068 Password:1khF1SpC Mohon lgsg ganti alamat email & password di aplikasi Spotify."
#> [8] "TIPS HEMAT DATA: pakai resolusi 480p ketika menonton video online, temukan di setting layanan video yg kamu nikmati. Ini akan membuat pemakaian data lbh hemat"
#> [9] "Ayam (syp/paha bwh),Nasi,Ades Rp.18.181. Add on CD Bebi Glenn Rp.22.727. Tkr SMS hari ini di CFC CIBUBUR JUNC. Selama persediaan msh ada. Promo*606#"
#> [10] "YEAY! Kejutan cashback & freebies dg TCASH TAP! Terus #pakeTCASH, cek HP kamu & dapatkan kejutannya. S&K berlaku. Info cek tsel.me/yeay"
#> [11] "Beli EXTRA kuota 1GB harga DISKON cuma Rp10rb. Ketik YA9 kirim ke 929 sd. 20/02/17"
#> [12] "Hari Senin saatnya Nonton Hemat hanya 25ribu di Cinema XXI dgn TCASH TAP! Dptkan stiker TCASH TAP di GraPARI terdekat. Info tsel.me/tappromo"
#> [13] "Disc 50% setiap Senin, 10% di hari lainnya di Coffee Bean dgn TCASH TAP! Dapatkan stiker TCASH TAP di graPARI terdekat. Info tsel.me/tappromo"
#> [14] "Beli EXTRA kuota 1GB harga DISKON cuma Rp10rb. Ketik YA9 kirim ke 929 sd. 20/02/17"
#> [15] "Pelanggan 085722688068, Ada yang Ngajak Kamu Chatting. Hubungi *858*11# untuk baca. Pesan akan dihapus dalam 5 menit. Silakan hubungi *858*11# sekarang."
#> [16] "Dapatkan hasil investasi yg bersaing dgn deposito, segera maksimalkan hasil Investasi Anda dgn fitur Auto-Invest di DOMPETKU! Info: http://bit.ly/bnp2211 MFI2"
#> [17] "Harga spesial Rp 75ribu atau Disc 44% utk tiket Jungle Land. Khusus pengguna TCASH TAP! S&K Berlaku. Info tsel.me/tappromo"
#> [18] "Cashback 50% setiap di Haagen-Dasz Selasa&Kamis / 10% hari lainnya khusus dengan TCASH TAP! S&K berlaku. Info tsel.me/tappromo"
#> [19] "Raih kesempatan mendapat Smartphone keren, cukup download iflix skrg & tonton film sebanyak2nya. Unduh iflix di im3.do/iflix Info kunjungi website kami. CVI2"
#> [20] "Hanya dengan Isi ulang akumulasi 150rb sebelum 24 Febuari 2017, Dapatkan GRATIS 3GB berlaku 30 hari. Mau? Ayo buruan isi ulang sekarang juga."
sms_datatrain %>%
filter(status == "spam") %>%
pull(text) %>%
wordcloud( max.words = 250, scale = c(2, 0.4),random.order = FALSE,colors=brewer.pal(8, "BrBG"))
Kata yang berpotensial mengindikasi bahwa suatu text adalah spam: kuota, pulsa, paket, sms, internetan, rezeki, hadiah, disc, promo, diskon, gratis, bonus..
See the report of SMS hourly, using histogram bar chart.
library(ggplot2)
hist(sms_datatrain$hourly, breaks = 30) # 2 variabel kategorik
See the report of SMS hourly, using Boc Plot chart.
# layer 2: geom_point
ggplot(data = sms_datatrain, mapping = aes(x = hourly, y = status)) +
geom_boxplot() +
geom_point()
Look the interesting report of SMS hourly, using bar chart (GEOM_COL), so we can see the proportion or composition of SPAM/HAM SMS in hourly.
sms_datatrain_freq <- as.data.frame(table(sms_datatrain$status,
sms_datatrain$hourly))
sms_datatrain_freq
#> Var1 Var2 Freq
#> 1 ham 0 4
#> 2 spam 0 30
#> 3 ham 1 1
#> 4 spam 1 19
#> 5 ham 2 2
#> 6 spam 2 4
#> 7 ham 3 0
#> 8 spam 3 5
#> 9 ham 4 0
#> 10 spam 4 2
#> 11 ham 5 7
#> 12 spam 5 4
#> 13 ham 6 10
#> 14 spam 6 11
#> 15 ham 7 85
#> 16 spam 7 28
#> 17 ham 8 91
#> 18 spam 8 78
#> 19 ham 9 130
#> 20 spam 9 65
#> 21 ham 10 86
#> 22 spam 10 64
#> 23 ham 11 58
#> 24 spam 11 60
#> 25 ham 12 73
#> 26 spam 12 71
#> 27 ham 13 68
#> 28 spam 13 79
#> 29 ham 14 60
#> 30 spam 14 55
#> 31 ham 15 63
#> 32 spam 15 58
#> 33 ham 16 65
#> 34 spam 16 40
#> 35 ham 17 77
#> 36 spam 17 38
#> 37 ham 18 49
#> 38 spam 18 16
#> 39 ham 19 60
#> 40 spam 19 31
#> 41 ham 20 56
#> 42 spam 20 28
#> 43 ham 21 67
#> 44 spam 21 23
#> 45 ham 22 40
#> 46 spam 22 18
#> 47 ham 23 10
#> 48 spam 23 15
ggplot(data = sms_datatrain_freq, mapping = aes(x = Freq, y = Var2 )) +
geom_col(mapping = aes(fill = Var1), position = "fill") +
labs(x = "Spam & Ham count Proportion",
y = "Hour",
fill = "",
title = "Proportion of Spam & Ham SMS",
subtitle = "Publish Hour vs Spam/Ham") +
scale_fill_brewer(palette = "Set2") +
theme_minimal() +
theme(legend.position = "top")
ggplot(data = sms_datatrain_freq, mapping = aes(x = Freq, y = Var2 )) +
geom_col(mapping = aes(fill = Var1), position = "dodge") +
labs(x = "Spam & Ham counts",
y = "Hour",
fill = "",
title = "No. of Spam & Ham SMS",
subtitle = "Publish Hour vs Spam/Ham") +
scale_fill_brewer(palette = "Set2") +
theme_minimal() +
theme(legend.position = "right")
Based on the Data Visualization using Wordcloud, Boxplot & Bar chart, we can conclude: - Mostly, SPAM SMS were from Telco provider Promotion (ie: Kuota, paket, sms, gratis, etc.) - SPAM SMS were sent at early day - Peak of SMS broacasting & activity were from morning till afternoon
Before to start the model development, we need to have a clear the text data first. At a later stage, the text is converted to corpus format and then cleaned.
Corpus is a collection of documents. In this case, one document is equivalent to one SMS observation. In one SMS there can be one or more sentences.
summary: in general, the steps that are often done for text cleansing are: - 1. Case-folding, to change all words with lower case - 2. Remove numbers, to remove all numbers - 3. Remove stopwords, to delete words that often appear in the corpus and are usually meaningless - 4. Remove punctuation, to replace certain characters with spaces/blank - 5. Stemming, cutting words into their base words - 6. Remove white space,we remove excessive white space, because the next tokenizing process will cut word by word based on the space character ("")
Corpus is a collection of documents. In this case, one document is equivalent to one SMS observation. In one SMS there can be one or more sentences.
One of the packages that we can use for text mining is tm. Changing from vector text to corpus can be done using the function VCorpus ()
library(tm)
# ubah format menjadi corpus
sms.corpus_train <- VCorpus(VectorSource(sms_datatrain$text))
sms.corpus_train
#> <<VCorpus>>
#> Metadata: corpus specific: 0, document level (indexed): 0
#> Content: documents: 2004
nrow(sms_datatrain)
#> [1] 2004
So let’s inspect the contents of (example: the 9th) SMS:
sms.corpus_train[[9]]$content
#> [1] "Voting your Offer. Disc 40%, 1 crispy chicken+1 spicy chicken+ nasi+lotteria tea Rp.26rb. Tukar SMS ini di LOTTERIA terdekat. Berlaku hari ini. SKB. Promo *606#"
library(stopwords)
# 1. case-folding: mengubah semua text menjadi lowercase
sms.corpus_train <- tm_map(sms.corpus_train, content_transformer(tolower))
# 2. remove numbers: menghapus angka
sms.corpus_train <- tm_map(sms.corpus_train, removeNumbers)
# 3. remove stopwords: menghapus kata yang sering muncul di corpus dan biasanya tidak meaningful
sms.corpus_train <- tm_map(sms.corpus_train, removeWords, stopwords("english"))
# cek content ke-9
sms.corpus_train[[9]]$content
#> [1] "voting offer. disc %, crispy chicken+ spicy chicken+ nasi+lotteria tea rp.rb. tukar sms ini di lotteria terdekat. berlaku hari ini. skb. promo *#"
We create a transformer function that can replace a certain character with a space (" "):
transformer <- content_transformer(FUN = function(x, pattern){
gsub(x = x,
pattern = pattern,
replacement = " ")
})
We can use a function removePunctuation to replace character. Punctuation omitted: ! ’ # S % & ’ ( ) * + , - . / : ; < = > ? @ [ / ] ^ _ { | } ~
# replace ".", "/", "@", "-" with a white space
sms.corpus_train <-
VCorpus(VectorSource(sms_datatrain$text)) %>%
tm_map(content_transformer(tolower)) %>%
tm_map(removeNumbers) %>%
tm_map(removeWords, stopwords("english")) %>%
tm_map( transformer, "/") %>%
tm_map( transformer, "@") %>%
tm_map( transformer, "-") %>%
tm_map( transformer, "\\.") %>% # \\. mengakses "." di awal, tengah, maupun akhir kalimat
tm_map(transformer, "<[^>]+>.") %>% # Remove Pipe
tm_map(transformer, "@\\S+.") %>% # Remove mention
tm_map( transformer, "*#") %>% # Remove hastag
tm_map(transformer, "http[^[:space:]]*.") %>% # remove URL
tm_map(transformer, "&.") %>% # remove amp
tm_map(transformer, "[[:punct:]].") %>% # remove titik3
tm_map(transformer, "[^[:alpha:][:space:]].") %>% # remove all
tm_map(removePunctuation) # remove all punctuation, but without white space (dari package tm)
# cek content ke-9
sms.corpus_train[[9]]$content
#> [1] "voting offer disc crispy chicken spicy chicken nasi otteria tea rp rb tukar sms ini di lotteria terdekat berlaku hari ini skb promo "
sms.corpus_train[[5]]$content
#> [1] "apakah anda mencoba mengakses akun anda dari perangkat lain jika ya mohon klik tautan ini api gojek co id customers device oken e e bac dc dalam jam ke depan jika tidak mohon abaikan pesan ini"
Next we do ** stemming ** or cutting the word into the root word. For example **walking, walked, walks* becomes walk.
library(SnowballC)
# lakukan instalasi package: SnowballC
# stemming
sms.corpus_train <- tm::tm_map(sms.corpus_train, stemDocument)
# cek content ke-9
sms.corpus_train[[9]]$content
#> [1] "vote offer disc crispi chicken spici chicken nasi otteria tea rp rb tukar sms ini di lotteria terdekat berlaku hari ini skb promo"
Finally, we remove excess white space, because the next tokenizing process will cut word by word based on the space character ("").
# remove white space
sms.corpus_train <- tm_map(sms.corpus_train, stripWhitespace)
sms.corpus_train[[9]]$content
#> [1] "vote offer disc crispi chicken spici chicken nasi otteria tea rp rb tukar sms ini di lotteria terdekat berlaku hari ini skb promo"
sms.corpus_train %>%
wordcloud( max.words = 200, scale = c(2, 0.5),random.order = FALSE, colors=brewer.pal(8, "BrBG"))
To have robust a prediction model, the data_train for development to split into 2 data: - Data Train, the base data for model development - Data Test for validation, to test models based on Train Data
Splitting the data with sms_train_val dan sms_test_val dengan perbandingan 75%-25%.
RNGkind(sample.kind = "Rounding")
set.seed(100)
# train-test splitting
index <- sample(nrow(sms_train.dtm), nrow(sms_train.dtm)*0.75)
# sms.dtm = DocumentTermMatrix yang tidak ada labelnya
sms_train_val <- sms_train.dtm[index,]
sms_test_val <- sms_train.dtm[-index,]
the label for target prediction:
# label untuk train dan test, tersimpan pada dataframe sms
label_train_val <- sms_datatrain[index, 'status']
label_test_val <- sms_datatrain[-index, 'status']
To check the composition/ proportion of target class at label_train_val dan label_test_val:
prop.table(table(label_train_val))
#> label_train_val
#> ham spam
#> 0.584165 0.415835
prop.table(table(label_test_val))
#> label_test_val
#> ham spam
#> 0.5668663 0.4331337
To check dimention of sms_train data –> ‘sms_train_val’ & ‘sms_test_val’ that it will be used for model development:
dim(sms_train_val)
#> [1] 1503 2877
dim(sms_test_val)
#> [1] 501 2877
We have very many predictors, up to 2877. Let’s reduce the noise in our data by taking words that appear quite often, for example, at least 10 times in all SMS. Use the findFreqTerms () function:
# sms_train dari sms.dtm
sms_freq_val <- findFreqTerms(sms_train_val, lowfreq = 8)
length(sms_freq_val)
#> [1] 462
Note: Determination of lowfreq = 8 is not absolute and can be changed for feature selection. Please note: The bigger the lowfreq, the less terms we use as a feature / predictor.
Let’s subset the sms_train data only for the words that appear insms_freq:
sms_train_Val <- sms_train_val[,sms_freq_val] # terms letaknya di kolom
inspect(sms_train_val) #inspect(sms_train)
#> <<DocumentTermMatrix (documents: 1503, terms: 2877)>>
#> Non-/sparse entries: 16826/4307305
#> Sparsity : 100%
#> Maximal term length: 22
#> Weighting : term frequency (tf)
#> Sample :
#> Terms
#> Docs anda atau dgn info kamu kuota paket pulsa saya sms
#> 1400 1 1 0 0 0 2 1 1 0 1
#> 196 0 0 0 0 0 0 0 0 0 0
#> 197 0 0 0 0 0 0 0 0 0 1
#> 225 0 0 0 0 0 0 0 0 0 0
#> 239 0 1 0 0 0 0 0 0 4 0
#> 29 0 1 0 0 0 0 0 0 4 0
#> 378 0 0 0 0 0 0 0 0 0 0
#> 403 0 0 0 0 0 0 0 0 0 0
#> 409 0 0 0 0 0 0 0 0 0 0
#> 410 1 0 1 2 0 0 2 2 0 7
The value in the sms_train matrix is still frequency. For probability calculation, the frequency will be changed to only the conditions appear (1) or not (0). One way is by using Bernoulli Converter.
bernoulli_conv <- function(x){
# parameter ifelse: kondisi, TRUE, FALSE
x <- as.factor(ifelse(x > 0, 1, 0))
return(x)
}
# testing fungsi
bernoulli_conv(c(3,0,0,1,4,0))
#> [1] 1 0 0 1 1 0
#> Levels: 0 1
Next, to apply bernoulli_conv ke sms_train_val & sms_test_val:
sms_train_bn_val <- apply(X = sms_train_val, MARGIN = 2, FUN = bernoulli_conv)
sms_test_bn_val <- apply(X = sms_test_val, MARGIN = 2, FUN = bernoulli_conv)
dim(sms_train_bn_val)
#> [1] 1503 2877
dim(sms_test_bn_val)
#> [1] 501 2877
Check the result:
sms_train_bn_val[1:25, 35:40]
#> Terms
#> Docs ahlinya ahmad air aja ajaa ajak
#> 617 "0" "0" "0" "0" "0" "0"
#> 517 "0" "0" "0" "0" "0" "0"
#> 1106 "0" "0" "0" "0" "0" "0"
#> 113 "0" "0" "0" "0" "0" "0"
#> 938 "0" "0" "0" "0" "0" "0"
#> 968 "0" "0" "0" "0" "0" "0"
#> 1624 "0" "0" "0" "0" "0" "0"
#> 740 "0" "0" "0" "0" "0" "0"
#> 1091 "0" "0" "0" "0" "0" "0"
#> 340 "0" "0" "0" "0" "0" "0"
#> 1247 "0" "0" "0" "0" "0" "0"
#> 1759 "0" "0" "0" "0" "0" "0"
#> 559 "0" "0" "0" "0" "0" "0"
#> 794 "0" "0" "0" "0" "0" "0"
#> 1518 "0" "0" "0" "0" "0" "0"
#> 1331 "0" "0" "0" "0" "0" "0"
#> 407 "0" "0" "0" "0" "0" "0"
#> 711 "0" "0" "0" "0" "0" "0"
#> 714 "0" "0" "0" "0" "0" "0"
#> 1371 "0" "0" "0" "0" "0" "0"
#> 1064 "0" "0" "0" "0" "0" "0"
#> 1410 "0" "0" "0" "0" "0" "0"
#> 1068 "0" "0" "0" "0" "0" "0"
#> 1484 "0" "0" "0" "0" "0" "0"
#> 832 "0" "0" "0" "0" "0" "0"
For this project, we will use Naive Bayes Model, The sms_train_bn is ready and using Library (e1071). # 1. Naive Bayes Model script:
library(e1071)
# train
naive_spam <- naiveBayes(x = sms_train_bn_val, # predictor-predictor yang berupa matrix
y = label_train_val) # label atau target variable
head(naive_spam$tables)
#> $aagc
#> aagc
#> label_train_val 0 1
#> ham 0.998861048 0.001138952
#> spam 1.000000000 0.000000000
#>
#> $abaikan
#> abaikan
#> label_train_val 0 1
#> ham 0.98974943 0.01025057
#> spam 0.99840000 0.00160000
#>
#> $abi
#> abi
#> label_train_val 0 1
#> ham 1.0000 0.0000
#> spam 0.9968 0.0032
#>
#> $abu
#> abu
#> label_train_val 0 1
#> ham 0.998861048 0.001138952
#> spam 1.000000000 0.000000000
#>
#> $abung
#> abung
#> label_train_val 0
#> ham 1
#> spam 1
#>
#> $acara
#> acara
#> label_train_val 0 1
#> ham 0.998861048 0.001138952
#> spam 1.000000000 0.000000000
Predict the target class in sms_test_bn_val. Save it to the sms_pred_class object, which will be used to evaluate the confusion matrix.
# predict
dim(sms_test_bn_val)
#> [1] 501 2877
sms_pred_class <- predict(object = naive_spam, # model naive bayes
newdata = sms_test_bn_val, # testing data
type = "class") # memprediksi kelas
head(sms_pred_class)
#> [1] ham spam spam ham spam spam
#> Levels: ham spam
table(sms_pred_class)
#> sms_pred_class
#> ham spam
#> 273 228
prop.table(table(sms_pred_class)) #sms_pred_class
#> sms_pred_class
#> ham spam
#> 0.5449102 0.4550898
to check the spam/ham prediction data/composition
sms_test_bn_val_df <- data.frame(sms_test_bn_val)
sms_pred_class_df <- data.frame(sms_pred_class)
dim(sms_test_bn_val_df)
#> [1] 501 2877
prop.table(table(sms_pred_class_df))
#> sms_pred_class_df
#> ham spam
#> 0.5449102 0.4550898
(table(sms_pred_class_df))
#> sms_pred_class_df
#> ham spam
#> 273 228
Evaluate the naive_spam model using the confusion matrix and the existing metrics, with library(caret)
library(caret)
confusionMatrix(data = sms_pred_class, # label hasil prediksi
reference = label_test_val, # label actual
positive = "spam") # kelas positif: spam
#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction ham spam
#> ham 264 9
#> spam 20 208
#>
#> Accuracy : 0.9421
#> 95% CI : (0.9179, 0.9609)
#> No Information Rate : 0.5669
#> P-Value [Acc > NIR] : < 0.0000000000000002
#>
#> Kappa : 0.8828
#>
#> Mcnemar's Test P-Value : 0.06332
#>
#> Sensitivity : 0.9585
#> Specificity : 0.9296
#> Pos Pred Value : 0.9123
#> Neg Pred Value : 0.9670
#> Prevalence : 0.4331
#> Detection Rate : 0.4152
#> Detection Prevalence : 0.4551
#> Balanced Accuracy : 0.9441
#>
#> 'Positive' Class : spam
#>
Which cases do we want to minimize ? - False Negative: sms is actually spam, but the prediction model is as ham (consequently spam sms enters our inbox) - False Positive: sms is actually ham, but it is predicted to be spam (as a result, ham sms goes to our spam folder)
Minimize: False Positive
ROC and AUC as a model evaluation tool: a. ROC is a curve depicting TPR vs FPR for each threshold, while AUC is the area under the ROC curve. b. The more convex the ROC curve (that is, the AUC approaches the value of one), the better the model is at separating positive and negative classes.
First, prepare a prediction result in the form of a probability from sms_test_bn_val, save it to an object with the namesms_pred_prob:
sms_pred_prob <- predict(naive_spam, sms_test_bn_val, type = "raw")
head(sms_pred_prob)
#> ham spam
#> [1,] 1.00000000000000000000 0.00000000000000004174301
#> [2,] 0.00000000000004828383 0.99999999999995181632073
#> [3,] 0.00001158056778772665 0.99998841943221228323324
#> [4,] 0.99964574107841930317 0.00035425892158074773165
#> [5,] 0.00024669691554197196 0.99975330308445797822259
#> [6,] 0.14715885403056078173 0.85284114596943916275507
Prepare ROC data to make things easier for us, save it to an object with the name data_sms_roc:
data_sms_roc <- data.frame(pred_prob = sms_pred_prob[,'spam'],
actual_label = ifelse(label_test_val == 'spam', 1, 0))
head(data_sms_roc)
#> pred_prob actual_label
#> 1 0.00000000000000004174301 0
#> 2 0.99999999999995181632073 1
#> 3 0.99998841943221228323324 1
#> 4 0.00035425892158074773165 1
#> 5 0.99975330308445797822259 1
#> 6 0.85284114596943916275507 0
Create an ROC curve. Save the resulting prediction () object with the name sms_roc:
library(ROCR)
sms_roc <- prediction(predictions = data_sms_roc$pred_prob,
labels = data_sms_roc$actual_label)
plot(performance(sms_roc, "tpr", "fpr"))
abline(0, 1, lty = 2) # garis diagonal, yaitu performa ketika asal tebak
Calculate the AUC value:
sms_auc <- performance(sms_roc, measure = "auc")
sms_auc@y.values
#> [[1]]
#> [1] 0.9841955
AUC value = 0.9841955, so it can be concluded that the performance of the Naive Bayes Spam Classifier model is very good in separating which class is positive (spam) from class negative (ham).
ROC and AUC as a model evaluation tool:
Naive Bayes is often used for text classification cases, because the computation time is fast enough to predict many words.
The Workflow for text mining are:
findFreqTerms() ii. Bernoulli converterThe CAPSTONE PROJECT: SMS Classification SPAM, using Naive Bayes Model and based on Validation, also ROC & AUC, We can conclude that the model is very predictive. AUC value = 0.9841955, so it can be concluded that the performance of the Naive Bayes Spam Classifier model is very good in separating which class is positive (spam) from class negative (ham).
Prepare the data test, read and transform as same with Data train & validation at Model Development.
sms_datatest <- read.csv("4. sms-cl-spam/data/data-test.csv",
stringsAsFactors = FALSE,nrows = , row.names = ,
encoding = "UTF-8") %>%
mutate(status = as.factor(status)
)
transformer <- content_transformer(FUN = function(x, pattern){
gsub(x = x,
pattern = pattern,
replacement = " ")
})
sms.corpus_test <-
VCorpus(VectorSource(sms_datatest$text)) %>%
tm_map(content_transformer(tolower)) %>%
tm_map(removeNumbers) %>%
tm_map(removeWords, stopwords("english")) %>%
tm_map( transformer, "/") %>%
tm_map( transformer, "@") %>%
tm_map( transformer, "-") %>%
tm_map( transformer, "\\.") %>% # \\. mengakses "." di awal, tengah, maupun akhir kalimat
tm_map(transformer, "<[^>]+>.") %>% # Remove Pipe
tm_map(transformer, "@\\S+.") %>% # Remove mention
tm_map( transformer, "*#") %>% # Remove hastag
tm_map(transformer, "http[^[:space:]]*.") %>% # remove URL
tm_map(transformer, "&.") %>% # remove amp
tm_map(transformer, "[[:punct:]].") %>% # remove titik3
tm_map(transformer, "[^[:alpha:][:space:]].") %>% # remove all
tm_map(removePunctuation) %>%
tm::tm_map(stemDocument) %>%
tm_map(stripWhitespace)
glimpse(sms_datatest)
#> Rows: 283
#> Columns: 3
#> $ datetime <chr> "2018-03-01T00:32:00Z", "2018-03-01T08:57:00Z", "2018-03-0...
#> $ text <chr> "Km baru saja akses Apps Sehari-hari terpopuler.Nikmati ak...
#> $ status <fct> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
# cek content ke-9
sms.corpus_test[[9]]$content
#> [1] "preorder samsung galaxi s cashback s d rpribu disc s d rpribu bln dgn kartukredit di samsung store tertentu s d mar s info"
sms_test.dtm <- DocumentTermMatrix(x = sms.corpus_test)
inspect(sms_test.dtm)
#> <<DocumentTermMatrix (documents: 283, terms: 808)>>
#> Non-/sparse entries: 4021/224643
#> Sparsity : 98%
#> Maximal term length: 22
#> Weighting : term frequency (tf)
#> Sample :
#> Terms
#> Docs anda axi diblokir info jam kamu ketik registrasi sms ulang
#> 134 0 0 0 1 0 1 0 0 2 0
#> 180 1 0 0 0 0 0 1 0 1 0
#> 200 0 0 0 0 0 0 0 0 0 0
#> 208 1 0 0 0 0 0 1 1 1 2
#> 209 1 0 0 0 0 0 1 1 1 2
#> 212 1 0 0 0 0 0 1 1 1 2
#> 64 2 0 0 0 0 0 0 0 1 0
#> 82 2 0 0 0 0 0 0 0 1 0
#> 84 2 0 0 0 0 0 0 0 1 0
#> 88 2 0 0 0 0 0 0 0 1 0
index_test <- sample(nrow(sms_test.dtm) )
label_test_test <- sms_datatest[index_test, 'status']
sms_test_test <- sms_test.dtm
prop.table(table(label_test_test))
#> numeric(0)
dim(label_test_test)
#> NULL
dim(sms_test_test)
#> [1] 283 808
sms_freq_test <- findFreqTerms(sms_test_test, lowfreq = 8)
length(sms_freq_test)
#> [1] 125
sms_test_test1 <- sms_test_test[,sms_freq_test] # terms letaknya di kolom
inspect(sms_test_test1)
#> <<DocumentTermMatrix (documents: 283, terms: 125)>>
#> Non-/sparse entries: 2586/32789
#> Sparsity : 93%
#> Maximal term length: 11
#> Weighting : term frequency (tf)
#> Sample :
#> Terms
#> Docs anda axi diblokir info jam kamu ketik registrasi sms ulang
#> 27 1 0 2 0 0 0 1 1 2 2
#> 32 1 0 2 1 0 0 1 1 2 2
#> 38 1 0 2 1 0 0 1 1 2 2
#> 44 1 0 2 1 0 0 1 1 2 2
#> 60 1 0 2 0 0 0 1 1 2 2
#> 61 1 0 2 0 0 0 1 1 2 2
#> 69 1 0 2 1 0 0 1 1 2 2
#> 72 1 0 2 0 0 0 1 1 2 2
#> 79 1 0 2 0 0 0 1 1 2 2
#> 81 1 0 2 1 0 0 1 1 2 2
dim(sms_test_test1)
#> [1] 283 125
bernoulli_conv <- function(x){
# parameter ifelse: kondisi, TRUE, FALSE
x <- as.factor(ifelse(x > 0, 1, 0))
return(x)
}
sms_test_bn_test <- apply(X = sms_test_test1, MARGIN = 2, FUN = bernoulli_conv)
dim(sms_test_bn_test)
#> [1] 283 125
sms_test_bn_test_df <- data.frame(sms_test_bn_test)
pred_test <- predict(object = naive_spam, # model naive bayes
newdata = sms_test_bn_test_df , #%>% select(-status), # testing data
type = "class") # memprediksi kelas
#sms_pred_class
#head(sms_test_bn)
pred_test_df <- data.frame(pred_test)
dim(sms_test_bn_test_df)
#> [1] 283 125
(table(pred_test_df))
#> pred_test_df
#> ham spam
#> 92 191
head(pred_test)
#> [1] spam spam spam spam ham spam
#> Levels: ham spam
table(pred_test)
#> pred_test
#> ham spam
#> 92 191
prop.table(table(pred_test))
#> pred_test
#> ham spam
#> 0.3250883 0.6749117
library(wordcloud)
wordcloud(sms_datatest$text, max.words = 200, scale = c(2, 0.5),random.order = FALSE)
submission <- sms_test_bn_test_df %>%
mutate(datetime = sms_datatest$datetime,
text = sms_datatest$text
) %>%
mutate(status = pred_test) %>%
select(datetime,status)
table(submission$status)
#>
#> ham spam
#> 92 191
prop.table(table(submission$status))
#>
#> ham spam
#> 0.3250883 0.6749117
# save data
write.csv(submission, "submission-dwi_susiyanto.csv", row.names = F)
# check first 3 data
head(submission, 3)
#> datetime status
#> 1 2018-03-01T00:32:00Z spam
#> 2 2018-03-01T08:57:00Z spam
#> 3 2018-03-01T09:15:00Z spam
nrow(submission)
#> [1] 283