The goal of a typical text classification exercise is to categorize a document into one or more defined categories. One very interesting application we observe in our day to day lives is how Google or other email services identifies if a new mail is a spam. Lets make a simple spam detector using support vector machines.

Basic setup and data import

#set working directory 
setwd("C:/Users/awani/Documents/GitHub/50daysofAnalytics/Day 18 - NLP Spam Detector")

#load libraries
if (!require("pacman")) install.packages("pacman")
pacman::p_load(wordcloud, tm, tidyr, tidytext, RColorBrewer, RTextTools, e1071, caret, knitr)

#read data
spam_data = read.csv("spam_data.csv", stringsAsFactors = F)

Data Cleaning

Before we start the modelling process, its always better to clean the data first. The data will have a bunch of stuff like names, punctuation and others. Some custom cleaning might be required but this code will take care most common data cleaning steps.

 

text = spam_data$Message  # save tweets to another data set "text"

text = gsub("(RT|via)((?:\\b\\W*@\\w+)+)","",text)         #remove names
text = gsub("http[^[:blank:]]+","",text)                   #remove html links
text = gsub("@\\w+","",text)                               #remove people names
text = gsub("[[:punct:]]","",text)                         #remove punctuations
text = trimws(text, which = c("both", "left", "right"))    # remove whitespace

text = gsub('[[:digit:]]+', '', text)                      # remove digits
text = gsub("[\r\n]", "", text)                            # remove line breaks
text = iconv(text, to = "ASCII//TRANSLIT")                 # remove not readable standard text
text = iconv(text, "ASCII", "UTF-8", sub="")               # remove not readable standard text
text = tolower(text)                                       # lower case

spam_data$Message = text

Exploratory Analysis

Getting a feel of data is always useful in any type of model building. Word cloud is quick and effective method to visualize word frequency and identify most repeated words in the corpus. One must, however, be vigilant while making word clouds. If not used properly they often paint a misleading picture.

 

# dependent variable
kable(table(spam_data$Category), col.names = c("Not Spam", "Spam"),  align = "l")
Not Spam Spam
ham 4825
spam 747
#independent variable - word cloud
corpus = Corpus(VectorSource(text))                          # convert tweets to corpus


# word frequency
uniqwords = as.matrix(TermDocumentMatrix(corpus))            # covert corpus to Term Document matrix 
wordfreq = sort(rowSums(uniqwords),decreasing=TRUE)          # Count frequency of words in each tweet
WCinput = data.frame(word = names(wordfreq),freq=wordfreq)   # word frequency dataframe

#generate the wordcloud
wordcloud(words = WCinput$word, freq = WCinput$freq, min.freq = 2,
max.words=200, random.order=FALSE, rot.per=0.35,
colors=brewer.pal(8, "Dark2"))

Data Manipulation

Before we move on to training the classifier, we need do split the data into training and validation sets. We will create a document matrix with all the unique words with a frequency of more than 10 in the reviews and use them as independent variables in the our classifier. We also need to ensure that data is in correct format.

 

# Build Corpus
corpus = Corpus(VectorSource(text))

# Build Term Document Matrix
tdm = DocumentTermMatrix(corpus)

# Convert TDM to Dataframe
tdm_data = data.frame(data.matrix(tdm),stringsAsFactors=FALSE)

# Remove features with total frequency less than 10
tdm_data = tdm_data[,colSums(tdm_data) >= 10]

#final data
tdm_data = cbind(Spam = spam_data$Category, tdm_data)

We will now randomly make a 70% training and 30% validation split to train and test separately.

 

#split data in training and validation
Index = sample(1:nrow(tdm_data), size = round(0.7*nrow(tdm_data)), replace=FALSE)
text.train = tdm_data[Index ,]
text.test = tdm_data[-Index ,]

Model and evaluation

Now that we have our data as we need, training the model becomes very simple. We just need to use svm function along with a linear kernel for our classifier with train data set. To understand how good the classifier is, we will predict spams in test data set. A confusion matrix will then give us metric like overall accuracy, sensitivity and specificity. The model performance according to these metrics are exceptional.

 

#SVM Model
svm_model = svm(Spam~., data=text.train, scale=FALSE, kernel='linear')
summary(svm_model)
## 
## Call:
## svm(formula = Spam ~ ., data = text.train, kernel = "linear", 
##     scale = FALSE)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  linear 
##        cost:  1 
##       gamma:  0.001077586 
## 
## Number of Support Vectors:  431
## 
##  ( 290 141 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  ham spam
#prediction
pred = predict(svm_model, text.train[,-1])

#Confusion Matrix
confusionMatrix(factor(pred),factor(text.train$Spam))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  ham spam
##       ham  3360    9
##       spam    0  531
##                                           
##                Accuracy : 0.9977          
##                  95% CI : (0.9956, 0.9989)
##     No Information Rate : 0.8615          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9903          
##  Mcnemar's Test P-Value : 0.007661        
##                                           
##             Sensitivity : 1.0000          
##             Specificity : 0.9833          
##          Pos Pred Value : 0.9973          
##          Neg Pred Value : 1.0000          
##              Prevalence : 0.8615          
##          Detection Rate : 0.8615          
##    Detection Prevalence : 0.8638          
##       Balanced Accuracy : 0.9917          
##                                           
##        'Positive' Class : ham             
##