1 Introduction

Always get spam sms from anonim number?

We get same problem here!!

So this time we will build machine learning model for detecting spam messages.

Source dataset : https://www.kaggle.com/shravan3273/sms-spam

2 Import Library

library(dplyr)
library(e1071)
library(tm)
library(caret)
library(wordcloud)
library(RColorBrewer)

3 Read Data

sms <- read.csv("spamraw.csv")

4 Data Wrangling

Check data type

glimpse(sms)
## Rows: 5,559
## Columns: 2
## $ type <chr> "ham", "ham", "ham", "spam", "spam", "ham", "ham", "ham", "spam",…
## $ text <chr> "Hope you are having a good week. Just checking in", "K..give bac…

Check first 5 rows of our dataset

head(sms,5)

Change data type

We have to change data type into factor.

sms <- sms %>% 
    mutate(type = as.factor(type))
    
head(sms)

Check Missing Values

colSums(is.na(sms))
## type text 
##    0    0

There is no missing value at our dataset.

5 Exploratory Data Analysis (EDA)

Check propotion our dataset

prop.table(table(sms$type))
## 
##       ham      spam 
## 0.8656233 0.1343767

See first 5 sms

head(sms$text,5)
## [1] "Hope you are having a good week. Just checking in"                                                                                                                
## [2] "K..give back my thanks."                                                                                                                                          
## [3] "Am also doing in cbe only. But have to pay."                                                                                                                      
## [4] "complimentary 4 STAR Ibiza Holiday or £10,000 cash needs your URGENT collection. 09066364349 NOW from Landline not to lose out! Box434SK38WP150PPM18+"            
## [5] "okmail: Dear Dave this is your final notice to collect your 4* Tenerife Holiday or #5000 CASH award! Call 09061743806 from landline. TCs SAE Box326 CW25WX 150ppm"

5.1 Data Pre-processing

5.1.1 Text to Corpus

We use package tm to text mining. We will change data to text with function VCorpus().

Change Text to Corpus

sms.corpus <- VCorpus(VectorSource(sms$text))

5.1.2 Text Cleansing

We have to remove some numbers, changing to lowercase, remove some punctuation mark, etc.

sms.corpus <- sms.corpus %>% 
              tm_map(removeNumbers) %>% # remove numerical character
              tm_map(content_transformer(tolower)) %>% #lowercase
              tm_map(removeWords, stopwords("english")) %>% # remove english stopwords (and, the, am)
              tm_map(removePunctuation) %>%  # remove punctuation mark
              tm_map(stemDocument) %>% # stem word (e.g. from walking to walk)
              tm_map(stripWhitespace) # strip double white space

Check text content

Get content at row 111.

sms.corpus[[111]]$content
## [1] "wow healthi old airport rd lor cant thk anyth els b bath dog later"

5.1.3 Document-Term Matrix (DTM)

After cleansing our text we have to change text into DTM, the process is called Tokenization. Splitting one sentence into others term.

sms.dtm <- DocumentTermMatrix(sms.corpus)
as.data.frame(head(as.matrix(sms.dtm)))

Inspect DTM

Check our dtm data.

inspect(sms.dtm)
## <<DocumentTermMatrix (documents: 5559, terms: 6559)>>
## Non-/sparse entries: 42147/36419334
## Sparsity           : 100%
## Maximal term length: 40
## Weighting          : term frequency (tf)
## Sample             :
##       Terms
## Docs   call can come day free get just know now will
##   1814    1   0    0   0    0   0    0    0   0    0
##   2046    0   0    0   0    0   0    0    1   0    0
##   295     0   0    0   0    1   0    0    0   0    0
##   2993    0   1    0   1    0   0    0    0   0    0
##   313     0   0    1   2    0   1    0    0   0   11
##   3201    0   0    0   0    1   0    0    0   0    0
##   3522    0   0    0   0    0   0    0    0   0    0
##   399     0   0    0   0    0   0    0    0   0    0
##   5068    0   0    0   6    0   0    0    0   0    0
##   5279    0   3    0   0    1   1    0    0   0    0
sms[1000,"text"]
## [1] "K..k...from tomorrow onwards started ah?"

5.2 Cross Validation

Before making model we have to split our data train into data train and data test with the composition 80% as data train.

set.seed(123)

index <- sample(nrow(sms.dtm), nrow(sms.dtm)*0.80)

data_train <- sms.dtm[index, ]
data_test <- sms.dtm[-index,]

Prepare data label target

label_train <- sms[index, "type"]
label_test <- sms[-index, "type"]

Check propotion class target data train

prop.table(table(label_train))
## label_train
##       ham      spam 
## 0.8641781 0.1358219

We can see our propotion here where is ham 86% and spam 13%.

5.3 Further Data Pre-processing

sms_freq <- findFreqTerms(x = data_train, lowfreq = 20)

Check sms_freq head

head(sms_freq, 20)
##  [1] "abl"       "abt"       "account"   "actual"    "afternoon" "aight"    
##  [7] "alreadi"   "alright"   "also"      "alway"     "anoth"     "answer"   
## [13] "anyth"     "anyway"    "appli"     "around"    "ask"       "await"    
## [19] "award"     "away"

Make dataset that words only appear at sms_freq

data_train <- data_train[, sms_freq]

Make Bernauli Converter

# fungsi DIY
bernoulli_conv <- function(x){
  x <- as.factor(ifelse(x > 0, 1, 0))
  return(x)
}

# coba fungsi
bernoulli_conv(c(0,1,3,0,12,4,0.3))
## [1] 0 1 1 0 1 1 1
## Levels: 0 1

Input Bernoulli Converter into data_test and data_train:

data_train_bn <- apply(X = data_train, MARGIN = 2, FUN = bernoulli_conv)
data_test_bn <- apply(X = data_test, MARGIN = 2, FUN = bernoulli_conv)

See the result

data_train_bn[20:30, 50:60]
##       Terms
## Docs   check claim class code collect come contact cool cos cost credit
##   4444 "0"   "0"   "0"   "0"  "0"     "0"  "0"     "0"  "0" "0"  "0"   
##   1017 "0"   "0"   "0"   "0"  "0"     "0"  "0"     "0"  "0" "0"  "0"   
##   2013 "0"   "0"   "0"   "0"  "0"     "0"  "0"     "0"  "0" "0"  "0"   
##   5475 "0"   "0"   "0"   "0"  "1"     "0"  "0"     "0"  "0" "0"  "0"   
##   2888 "0"   "0"   "0"   "0"  "0"     "0"  "0"     "0"  "0" "0"  "0"   
##   2567 "0"   "0"   "0"   "0"  "0"     "0"  "0"     "0"  "0" "0"  "0"   
##   1450 "0"   "0"   "0"   "0"  "0"     "0"  "0"     "0"  "1" "0"  "0"   
##   1790 "0"   "0"   "0"   "0"  "0"     "0"  "0"     "0"  "0" "0"  "0"   
##   4307 "0"   "0"   "0"   "0"  "0"     "0"  "0"     "0"  "0" "0"  "0"   
##   2980 "0"   "0"   "0"   "0"  "0"     "0"  "0"     "0"  "1" "0"  "0"   
##   1614 "0"   "0"   "0"   "0"  "0"     "0"  "0"     "0"  "0" "0"  "0"

6 Model Fitting

We make Naive Bayes model based on our dataset that we already have processed.

model_naive <- naiveBayes(x = data_train_bn,
                          y = label_train, 
                          laplace = 1)

Change into dataframe if we want to see data from data_train_bn.

as.data.frame(head(data_train_bn))

6.1 Prediction

We try to predict our target at data test and save into sms_predClass.

sms_predClass <- predict(object = model_naive, 
                         newdata = data_test_bn,
                         type = "class")

head(sms_predClass)
## [1] spam ham  ham  ham  ham  ham 
## Levels: ham spam

6.2 Model Evaluation

We use confussion Matrix to evaluate our Naive Bayes model.

result <- confusionMatrix(data = sms_predClass, # hasil prediksi
                reference = label_test, # label aktual
                positive = "spam") 

7 Conclusion

result
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction ham spam
##       ham  961   22
##       spam   8  121
##                                           
##                Accuracy : 0.973           
##                  95% CI : (0.9617, 0.9817)
##     No Information Rate : 0.8714          
##     P-Value [Acc > NIR] : < 2e-16         
##                                           
##                   Kappa : 0.8744          
##                                           
##  Mcnemar's Test P-Value : 0.01762         
##                                           
##             Sensitivity : 0.8462          
##             Specificity : 0.9917          
##          Pos Pred Value : 0.9380          
##          Neg Pred Value : 0.9776          
##              Prevalence : 0.1286          
##          Detection Rate : 0.1088          
##    Detection Prevalence : 0.1160          
##       Balanced Accuracy : 0.9189          
##                                           
##        'Positive' Class : spam            
## 

Based on result above we got quite good model, our accuracy is 97%, sensitivity is 84%, specificity is 99%, and precision is 93%. We can get conclusion, our model is good to be detecting spam.

Make wordcloud

wordcloud(words = sms.corpus, min.freq = 100, random.order = FALSE, rot.per=0.35, colors=brewer.pal(12, "Paired"))

From word cloud above, we can see words call, now, get, can, will, come, free, just, etc are mostly appear at the messages.