Always get spam sms from anonim number?
We get same problem here!!
So this time we will build machine learning model for detecting spam messages.
Source dataset : https://www.kaggle.com/shravan3273/sms-spam
library(dplyr)
library(e1071)
library(tm)
library(caret)
library(wordcloud)
library(RColorBrewer)
sms <- read.csv("spamraw.csv")
Check data type
glimpse(sms)
## Rows: 5,559
## Columns: 2
## $ type <chr> "ham", "ham", "ham", "spam", "spam", "ham", "ham", "ham", "spam",…
## $ text <chr> "Hope you are having a good week. Just checking in", "K..give bac…
Check first 5 rows of our dataset
head(sms,5)
Change data type
We have to change data type into factor.
sms <- sms %>%
mutate(type = as.factor(type))
head(sms)
Check Missing Values
colSums(is.na(sms))
## type text
## 0 0
There is no missing value at our dataset.
Check propotion our dataset
prop.table(table(sms$type))
##
## ham spam
## 0.8656233 0.1343767
See first 5 sms
head(sms$text,5)
## [1] "Hope you are having a good week. Just checking in"
## [2] "K..give back my thanks."
## [3] "Am also doing in cbe only. But have to pay."
## [4] "complimentary 4 STAR Ibiza Holiday or £10,000 cash needs your URGENT collection. 09066364349 NOW from Landline not to lose out! Box434SK38WP150PPM18+"
## [5] "okmail: Dear Dave this is your final notice to collect your 4* Tenerife Holiday or #5000 CASH award! Call 09061743806 from landline. TCs SAE Box326 CW25WX 150ppm"
We use package tm to text mining. We will change data to text with function VCorpus().
Change Text to Corpus
sms.corpus <- VCorpus(VectorSource(sms$text))
We have to remove some numbers, changing to lowercase, remove some punctuation mark, etc.
sms.corpus <- sms.corpus %>%
tm_map(removeNumbers) %>% # remove numerical character
tm_map(content_transformer(tolower)) %>% #lowercase
tm_map(removeWords, stopwords("english")) %>% # remove english stopwords (and, the, am)
tm_map(removePunctuation) %>% # remove punctuation mark
tm_map(stemDocument) %>% # stem word (e.g. from walking to walk)
tm_map(stripWhitespace) # strip double white space
Check text content
Get content at row 111.
sms.corpus[[111]]$content
## [1] "wow healthi old airport rd lor cant thk anyth els b bath dog later"
After cleansing our text we have to change text into DTM, the process is called Tokenization. Splitting one sentence into others term.
sms.dtm <- DocumentTermMatrix(sms.corpus)
as.data.frame(head(as.matrix(sms.dtm)))
Inspect DTM
Check our dtm data.
inspect(sms.dtm)
## <<DocumentTermMatrix (documents: 5559, terms: 6559)>>
## Non-/sparse entries: 42147/36419334
## Sparsity : 100%
## Maximal term length: 40
## Weighting : term frequency (tf)
## Sample :
## Terms
## Docs call can come day free get just know now will
## 1814 1 0 0 0 0 0 0 0 0 0
## 2046 0 0 0 0 0 0 0 1 0 0
## 295 0 0 0 0 1 0 0 0 0 0
## 2993 0 1 0 1 0 0 0 0 0 0
## 313 0 0 1 2 0 1 0 0 0 11
## 3201 0 0 0 0 1 0 0 0 0 0
## 3522 0 0 0 0 0 0 0 0 0 0
## 399 0 0 0 0 0 0 0 0 0 0
## 5068 0 0 0 6 0 0 0 0 0 0
## 5279 0 3 0 0 1 1 0 0 0 0
sms[1000,"text"]
## [1] "K..k...from tomorrow onwards started ah?"
Before making model we have to split our data train into data train and data test with the composition 80% as data train.
set.seed(123)
index <- sample(nrow(sms.dtm), nrow(sms.dtm)*0.80)
data_train <- sms.dtm[index, ]
data_test <- sms.dtm[-index,]
Prepare data label target
label_train <- sms[index, "type"]
label_test <- sms[-index, "type"]
Check propotion class target data train
prop.table(table(label_train))
## label_train
## ham spam
## 0.8641781 0.1358219
We can see our propotion here where is ham 86% and spam 13%.
sms_freq <- findFreqTerms(x = data_train, lowfreq = 20)
Check sms_freq head
head(sms_freq, 20)
## [1] "abl" "abt" "account" "actual" "afternoon" "aight"
## [7] "alreadi" "alright" "also" "alway" "anoth" "answer"
## [13] "anyth" "anyway" "appli" "around" "ask" "await"
## [19] "award" "away"
Make dataset that words only appear at sms_freq
data_train <- data_train[, sms_freq]
Make Bernauli Converter
# fungsi DIY
bernoulli_conv <- function(x){
x <- as.factor(ifelse(x > 0, 1, 0))
return(x)
}
# coba fungsi
bernoulli_conv(c(0,1,3,0,12,4,0.3))
## [1] 0 1 1 0 1 1 1
## Levels: 0 1
Input Bernoulli Converter into data_test and data_train:
data_train_bn <- apply(X = data_train, MARGIN = 2, FUN = bernoulli_conv)
data_test_bn <- apply(X = data_test, MARGIN = 2, FUN = bernoulli_conv)
See the result
data_train_bn[20:30, 50:60]
## Terms
## Docs check claim class code collect come contact cool cos cost credit
## 4444 "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
## 1017 "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
## 2013 "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
## 5475 "0" "0" "0" "0" "1" "0" "0" "0" "0" "0" "0"
## 2888 "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
## 2567 "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
## 1450 "0" "0" "0" "0" "0" "0" "0" "0" "1" "0" "0"
## 1790 "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
## 4307 "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
## 2980 "0" "0" "0" "0" "0" "0" "0" "0" "1" "0" "0"
## 1614 "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
We make Naive Bayes model based on our dataset that we already have processed.
model_naive <- naiveBayes(x = data_train_bn,
y = label_train,
laplace = 1)
Change into dataframe if we want to see data from data_train_bn.
as.data.frame(head(data_train_bn))
We try to predict our target at data test and save into sms_predClass.
sms_predClass <- predict(object = model_naive,
newdata = data_test_bn,
type = "class")
head(sms_predClass)
## [1] spam ham ham ham ham ham
## Levels: ham spam
We use confussion Matrix to evaluate our Naive Bayes model.
result <- confusionMatrix(data = sms_predClass, # hasil prediksi
reference = label_test, # label aktual
positive = "spam")
result
## Confusion Matrix and Statistics
##
## Reference
## Prediction ham spam
## ham 961 22
## spam 8 121
##
## Accuracy : 0.973
## 95% CI : (0.9617, 0.9817)
## No Information Rate : 0.8714
## P-Value [Acc > NIR] : < 2e-16
##
## Kappa : 0.8744
##
## Mcnemar's Test P-Value : 0.01762
##
## Sensitivity : 0.8462
## Specificity : 0.9917
## Pos Pred Value : 0.9380
## Neg Pred Value : 0.9776
## Prevalence : 0.1286
## Detection Rate : 0.1088
## Detection Prevalence : 0.1160
## Balanced Accuracy : 0.9189
##
## 'Positive' Class : spam
##
Based on result above we got quite good model, our accuracy is 97%, sensitivity is 84%, specificity is 99%, and precision is 93%. We can get conclusion, our model is good to be detecting spam.
Make wordcloud
wordcloud(words = sms.corpus, min.freq = 100, random.order = FALSE, rot.per=0.35, colors=brewer.pal(12, "Paired"))
From word cloud above, we can see words call, now, get, can, will, come, free, just, etc are mostly appear at the messages.