Synopsis:
The current analysis aims to build a machine learning algorithm to detect a spam SMS. SMS Spam Collection Data Set from UCI Machine Learning Repository has been used to train SVM models. The data contains a total of 5574 SMS messages with the corresponding label(spam/ham). The message is processed to build a numerical dataset on which Support Vector Machine Models are built and validated.
Load Data:
# Load Libraries
library(tm)
library(plyr)
library(class)
library(caret)
library(e1071)
library(knitr)
# Read data
rawdata <- read.csv("SMSSpamCollection",sep="\t",header=FALSE,quote="",stringsAsFactors=FALSE)
names(rawdata) <- c("Class","Message")
Sample Data
kable(rawdata[1:8,])
ham |
Go until jurong point, crazy.. Available only in bugis n great world la e buffet… Cine there got amore wat… |
ham |
Ok lar… Joking wif u oni… |
spam |
Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C’s apply 08452810075over18’s |
ham |
U dun say so early hor… U c already then say… |
ham |
Nah I don’t think he goes to usf, he lives around here though |
spam |
FreeMsg Hey there darling it’s been 3 week’s now and no word back! I’d like some fun you up for it still? Tb ok! XxX std chgs to send, £1.50 to rcv |
ham |
Even my brother is not like to speak with me. They treat me like aids patent. |
ham |
As per your request ‘Melle Melle (Oru Minnaminunginte Nurungu Vettam)’ has been set as your callertune for all Callers. Press *9 to copy your friends Callertune |
Preprocessing:
- Considering the fact that normal messages have lesser numeric characters than spam messages, a new feature describing the total number of digits is added to the feature set.
- Also, SPAM messages are usually longer as marketing campaigns make use of most of the available characters.
- The text messages are then processed to build term document matrix, which contains the frequency of occurrences of each word over the whole messages(documents).
- Words which appeared less than three times in all the messages together as they might not have any significant influence in classification.
# Find total number of characters in each SMS
NumberOfChar <- as.numeric(lapply(rawdata$Message,FUN=nchar))
# Find number of numeric digits in each SMS
number.digits <- function(vect) {
length(as.numeric(unlist(strsplit(gsub("[^0-9]", "", unlist(vect)), ""))))
}
NumberOfDigits <- as.numeric(lapply(rawdata$Message,FUN=number.digits))
# Function to clean text in the SMS
clean.text = function(x)
{
# tolower
x = tolower(x)
# remove punctuation
x = gsub("[[:punct:]]", "", x)
# remove numbers
x = gsub("[[:digit:]]", "", x)
# remove tabs
x = gsub("[ |\t]{2,}", "", x)
# remove blank spaces at the beginning
x = gsub("^ ", "", x)
# remove blank spaces at the end
x = gsub(" $", "", x)
# remove common words
x = removeWords(x,stopwords("en"))
return(x)
}
cleanText <- clean.text(rawdata$Message)
# Build Corpus
corpus <- Corpus(VectorSource(cleanText))
# Build Term Document Matrix
tdm <- DocumentTermMatrix(corpus)
# Convert TDM to Dataframe
tdm.df <- as.data.frame(data.matrix(tdm),stringsAsFactors=FALSE)
# Remove features with total frequency less than 3
tdm.new <- tdm.df[,colSums(tdm.df) > 2]
Split Data:
# Prepare final data with TDM, NumberofChar, NumberOfDigits as features
cleandata <- cbind("Class" = rawdata$Class, NumberOfChar, NumberOfDigits, tdm.new)
# Split Data into training (80%) and testing(20%) datasets
set.seed(1234)
inTrain <- createDataPartition(cleandata$Class,p=0.8,list=FALSE)
train <- cleandata[inTrain,]
test <- cleandata[-inTrain,]
Build SVM Models:
- As the final dataset contains a 2499 numeric features and a binary output class, Support Vector Machines work well.
- Linear, Polynomial, Radial Basis and Sigmoid SVMs are used to predict the Class on the test dataset.
## Linear Kernel
svm.linear <- svm(Class~., data=train, scale=FALSE, kernel='linear')
pred.linear <- predict(svm.linear, test[,-1])
linear <- confusionMatrix(pred.linear,test$Class)
## Linear Kernel
svm.poly <- svm(Class~., data=train, scale=FALSE, kernel='polynomial')
pred.poly <- predict(svm.poly, test[,-1])
poly <- confusionMatrix(pred.poly,test$Class)
## Radial Basis Kernel
svm.radial <- svm(Class~., data=train, scale=FALSE, kernel='radial')
pred.radial <- predict(svm.radial,test[,-1])
radial <- confusionMatrix(pred.radial,test$Class)
## Sigmoid Kernel
svm.sigmoid <- svm(Class~., data=train, scale=FALSE, kernel='sigmoid')
pred.sigmoid <- predict(svm.sigmoid,test[,-1])
sigmoid <- confusionMatrix(pred.sigmoid,test$Class)
Accuracies
Kernels <- c("Linear","Polynomial","Radial Basis","Sigmoid")
Accuracies <- round(c(linear$overall[1],poly$overall[1],radial$overall[1],sigmoid$overall[1]),4)
acc <- cbind(Kernels,Accuracies)
kable(acc,row.names=FALSE)
Linear |
0.9847 |
Polynomial |
0.9847 |
Radial Basis |
0.9695 |
Sigmoid |
0.7325 |
Results:
## Confusion Matrix and Statistics
##
## Reference
## Prediction ham spam
## ham 956 8
## spam 9 141
##
## Accuracy : 0.9847
## 95% CI : (0.9757, 0.9911)
## No Information Rate : 0.8662
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.9343
## Mcnemar's Test P-Value : 1
##
## Sensitivity : 0.9907
## Specificity : 0.9463
## Pos Pred Value : 0.9917
## Neg Pred Value : 0.9400
## Prevalence : 0.8662
## Detection Rate : 0.8582
## Detection Prevalence : 0.8654
## Balanced Accuracy : 0.9685
##
## 'Positive' Class : ham
##