SMS SPAM DETECTION

Synopsis:

The current analysis aims to build a machine learning algorithm to detect a spam SMS. SMS Spam Collection Data Set from UCI Machine Learning Repository has been used to train SVM models. The data contains a total of 5574 SMS messages with the corresponding label(spam/ham). The message is processed to build a numerical dataset on which Support Vector Machine Models are built and validated.

Load Data:

# Load Libraries
library(tm)
library(plyr)
library(class)
library(caret)
library(e1071)
library(knitr)

# Read data
rawdata <- read.csv("SMSSpamCollection",sep="\t",header=FALSE,quote="",stringsAsFactors=FALSE)
names(rawdata) <- c("Class","Message")

Sample Data

kable(rawdata[1:8,])

Class	Message
ham	Go until jurong point, crazy.. Available only in bugis n great world la e buffet… Cine there got amore wat…
ham	Ok lar… Joking wif u oni…
spam	Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C’s apply 08452810075over18’s
ham	U dun say so early hor… U c already then say…
ham	Nah I don’t think he goes to usf, he lives around here though
spam	FreeMsg Hey there darling it’s been 3 week’s now and no word back! I’d like some fun you up for it still? Tb ok! XxX std chgs to send, £1.50 to rcv
ham	Even my brother is not like to speak with me. They treat me like aids patent.
ham	As per your request ‘Melle Melle (Oru Minnaminunginte Nurungu Vettam)’ has been set as your callertune for all Callers. Press *9 to copy your friends Callertune

Preprocessing:

- Considering the fact that normal messages have lesser numeric characters than spam messages, a new feature describing the total number of digits is added to the feature set.
- Also, SPAM messages are usually longer as marketing campaigns make use of most of the available characters.
- The text messages are then processed to build term document matrix, which contains the frequency of occurrences of each word over the whole messages(documents).
- Words which appeared less than three times in all the messages together as they might not have any significant influence in classification.

# Find total number of characters in each SMS
NumberOfChar <- as.numeric(lapply(rawdata$Message,FUN=nchar))

# Find number of numeric digits in each SMS

number.digits <- function(vect) {
    length(as.numeric(unlist(strsplit(gsub("[^0-9]", "", unlist(vect)), ""))))
}

NumberOfDigits <- as.numeric(lapply(rawdata$Message,FUN=number.digits))

# Function to clean text in the SMS

clean.text = function(x)
{ 
  # tolower
  x = tolower(x)
  # remove punctuation
  x = gsub("[[:punct:]]", "", x)
  # remove numbers
  x = gsub("[[:digit:]]", "", x)
  # remove tabs
  x = gsub("[ |\t]{2,}", "", x)
  # remove blank spaces at the beginning
  x = gsub("^ ", "", x)
  # remove blank spaces at the end
  x = gsub(" $", "", x)
  # remove common words
  x = removeWords(x,stopwords("en"))
  return(x)
}

cleanText <- clean.text(rawdata$Message)

# Build Corpus
corpus <- Corpus(VectorSource(cleanText))

# Build Term Document Matrix
tdm <- DocumentTermMatrix(corpus)

# Convert TDM to Dataframe
tdm.df <- as.data.frame(data.matrix(tdm),stringsAsFactors=FALSE)

# Remove features with total frequency less than 3
tdm.new <- tdm.df[,colSums(tdm.df) > 2]

Split Data:

# Prepare final data with TDM, NumberofChar, NumberOfDigits as features

cleandata <- cbind("Class" = rawdata$Class, NumberOfChar, NumberOfDigits, tdm.new)

# Split Data into training (80%) and testing(20%) datasets

set.seed(1234)
inTrain <- createDataPartition(cleandata$Class,p=0.8,list=FALSE)
train <- cleandata[inTrain,]
test <- cleandata[-inTrain,]

Build SVM Models:

- As the final dataset contains a 2499 numeric features and a binary output class, Support Vector Machines work well.
- Linear, Polynomial, Radial Basis and Sigmoid SVMs are used to predict the  Class on the test dataset.

## Linear Kernel
svm.linear <- svm(Class~., data=train, scale=FALSE, kernel='linear')
pred.linear <- predict(svm.linear, test[,-1])
linear <- confusionMatrix(pred.linear,test$Class)

## Linear Kernel
svm.poly <- svm(Class~., data=train, scale=FALSE, kernel='polynomial')
pred.poly <- predict(svm.poly, test[,-1])
poly <- confusionMatrix(pred.poly,test$Class)

## Radial Basis Kernel
svm.radial <- svm(Class~., data=train, scale=FALSE, kernel='radial')
pred.radial <- predict(svm.radial,test[,-1])
radial <- confusionMatrix(pred.radial,test$Class)

## Sigmoid Kernel
svm.sigmoid <- svm(Class~., data=train, scale=FALSE, kernel='sigmoid')
pred.sigmoid <- predict(svm.sigmoid,test[,-1])
sigmoid <- confusionMatrix(pred.sigmoid,test$Class)

Accuracies

Kernels <- c("Linear","Polynomial","Radial Basis","Sigmoid")
Accuracies <- round(c(linear$overall[1],poly$overall[1],radial$overall[1],sigmoid$overall[1]),4)
acc <- cbind(Kernels,Accuracies)
kable(acc,row.names=FALSE)

Kernels	Accuracies
Linear	0.9847
Polynomial	0.9847
Radial Basis	0.9695
Sigmoid	0.7325

Results:

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction ham spam
##       ham  956    8
##       spam   9  141
##                                           
##                Accuracy : 0.9847          
##                  95% CI : (0.9757, 0.9911)
##     No Information Rate : 0.8662          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.9343          
##  Mcnemar's Test P-Value : 1               
##                                           
##             Sensitivity : 0.9907          
##             Specificity : 0.9463          
##          Pos Pred Value : 0.9917          
##          Neg Pred Value : 0.9400          
##              Prevalence : 0.8662          
##          Detection Rate : 0.8582          
##    Detection Prevalence : 0.8654          
##       Balanced Accuracy : 0.9685          
##                                           
##        'Positive' Class : ham             
##