Spam/Ham

It can be useful to be able to classify new “test” documents using already classified “training” documents. A common example is using a corpus of labeled spam and ham (non-spam) e-mails to predict whether or not a new document is spam.

For this project, you can start with a spam/ham dataset, then predict the class of new documents (either withheld from the training dataset or from another source such as your own spam folder).

Installing required packages and libraries

library(kernlab)
library(knitr)
library(rpart)
library(e1071)
library(RTextTools)

## Loading required package: SparseM

## 
## Attaching package: 'SparseM'

## The following object is masked from 'package:base':
## 
##     backsolve

library(randomForest)

## randomForest 4.6-12

## Type rfNews() to see new features/changes/bug fixes.

library(ROCR)

## Loading required package: gplots

## 
## Attaching package: 'gplots'

## The following object is masked from 'package:stats':
## 
##     lowess

Data Collection

The dataset is retrieved form UCI Machine Learning Repository. Information about the dataset can be accessed using the link

We will use two different approaches for reading the data in r.

Approach: 1

We will install the package kernlab. The library(kernlab) contain the dataset which can be used with ease for creating models for spam detection.

we will use r command data(spam) to get the data in r.

data(spam)

kable(head(spam))

make	address	all	our	over	remove	internet	order	mail	receive	will	people	report	addresses	free	business	email	you	credit	your	num000	money	num1999	direct	original	re	edu	charSemicolon	charRoundbracket	charExclamation	charDollar	charHash	capitalAve	capitalLong	capitalTotal	type
0.00	0.64	0.64	0.32	0.00	0.00	0.00	0.00	0.00	0.00	0.64	0.00	0.00	0.00	0.32	0.00	1.29	1.93	0.00	0.96	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.000	0.778	0.000	0.000	3.756	61	278	spam
0.21	0.28	0.50	0.14	0.28	0.21	0.07	0.00	0.94	0.21	0.79	0.65	0.21	0.14	0.14	0.07	0.28	3.47	0.00	1.59	0.43	0.43	0.07	0.00	0.00	0.00	0.00	0.00	0.132	0.372	0.180	0.048	5.114	101	1028	spam
0.06	0.00	0.71	1.23	0.19	0.19	0.12	0.64	0.25	0.38	0.45	0.12	0.00	1.75	0.06	0.06	1.03	1.36	0.32	0.51	1.16	0.06	0.00	0.06	0.12	0.06	0.06	0.01	0.143	0.276	0.184	0.010	9.821	485	2259	spam
0.00	0.00	0.00	0.63	0.00	0.31	0.63	0.31	0.63	0.31	0.31	0.31	0.00	0.00	0.31	0.00	0.00	3.18	0.00	0.31	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.137	0.137	0.000	0.000	3.537	40	191	spam
0.00	0.00	0.00	0.63	0.00	0.31	0.63	0.31	0.63	0.31	0.31	0.31	0.00	0.00	0.31	0.00	0.00	3.18	0.00	0.31	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.135	0.135	0.000	0.000	3.537	40	191	spam
0.00	0.00	0.00	1.85	0.00	0.00	1.85	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.223	0.000	0.000	0.000	3.000	15	54	spam

dim(spam)

## [1] 4601   58

The second approach is copying and saving the data in csv file and reading it in r. There is another file name_data which contains the names of the variables.After reading it in r, the two files data and name_data are merged using a function, where name_data is appended to the header row of data. ####Approach 2

dataset <- read.csv("data.csv",header=FALSE,sep=";")
names <- read.csv("name_data.csv",header=FALSE,sep=";")
names(dataset) <- sapply((1:nrow(names)),function(i) toString(names[i,1]))

kable(head(dataset))

make	address	all	our	over	remove	internet	order	mail	receive	will	people	report	addresses	free	business	email	you	credit	your	num000	money	num1999	direct	original	re	edu	charSemicolon	charRoundbracket	charExclamation	charDollar	charHash	capitalAve	capitalLong	capitalTotal	type
0.00	0.64	0.64	0.32	0.00	0.00	0.00	0.00	0.00	0.00	0.64	0.00	0.00	0.00	0.32	0.00	1.29	1.93	0.00	0.96	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.000	0.778	0.000	0.000	3.756	61	278	1
0.21	0.28	0.50	0.14	0.28	0.21	0.07	0.00	0.94	0.21	0.79	0.65	0.21	0.14	0.14	0.07	0.28	3.47	0.00	1.59	0.43	0.43	0.07	0.00	0.00	0.00	0.00	0.00	0.132	0.372	0.180	0.048	5.114	101	1028	1
0.06	0.00	0.71	1.23	0.19	0.19	0.12	0.64	0.25	0.38	0.45	0.12	0.00	1.75	0.06	0.06	1.03	1.36	0.32	0.51	1.16	0.06	0.00	0.06	0.12	0.06	0.06	0.01	0.143	0.276	0.184	0.010	9.821	485	2259	1
0.00	0.00	0.00	0.63	0.00	0.31	0.63	0.31	0.63	0.31	0.31	0.31	0.00	0.00	0.31	0.00	0.00	3.18	0.00	0.31	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.137	0.137	0.000	0.000	3.537	40	191	1
0.00	0.00	0.00	0.63	0.00	0.31	0.63	0.31	0.63	0.31	0.31	0.31	0.00	0.00	0.31	0.00	0.00	3.18	0.00	0.31	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.135	0.135	0.000	0.000	3.537	40	191	1
0.00	0.00	0.00	1.85	0.00	0.00	1.85	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.223	0.000	0.000	0.000	3.000	15	54	1

The two approaches used provide same datasets.

Splitting the data into training and test data sets

Once we have the data for spam emails, we will split that dataset into training and test datasets. We will select 80% of the data randomly from our dataset to test it against the remaining 20% test dataset.

sub <- sample(nrow(spam), floor(nrow(spam) * 0.8))
train <- spam[sub, ]
test <- spam[-sub, ]



dim(train)

## [1] 3680   58

dim(test)

## [1] 921  58

The training and test datasets are now used for creating the models.

SVM model

svm.model <- svm(type~., data = train)


svm.pred <- predict(svm.model,test[1:57,])

svm.pred <- predict(svm.model,test)


tab <-table(predicted = svm.pred, actual = test[,58])


# to find the precision and accuracy we can use caret library. I was having problems even after the caret libraryy was installed. I used a manual technique for calculating the accuracy of the models

#TP = True Positive
#TN = True Negative
#FP = False Positive
#FN = False Negative

TP <-tab[2,2]
TN <-tab[1,1]

FP<-tab[1,2]
FN<-tab[2,1]

Accuracy_SVM<- (TN + TP)/(TN+TP+FN+FP)
Accuracy_SVM

## [1] 0.9381107

Recursive Partitioning

rpart.model <- rpart(type~., data =train)
rpart.pred <- predict(rpart.model,test, type = "class")
tab <-table(predicted=rpart.pred,actual=test[,58])
TP <-tab[2,2]
TN <-tab[1,1]

FP<-tab[1,2]
FN<-tab[2,1]

Accuracy_RecPart<- (TN + TP)/(TN+TP+FN+FP)
Accuracy_RecPart

## [1] 0.8957655

Random Forest

spam.rf<-randomForest(type~.,data=train)
rf.pred <- predict(spam.rf, test)
tab <-table(predicted=rf.pred,actual=test[,58])

tab

##          actual
## predicted nonspam spam
##   nonspam     559   33
##   spam         11  318

TP <-tab[2,2]
TN <-tab[1,1]

FP<-tab[1,2]
FN<-tab[2,1]

Accuracy_RF<- (TN + TP)/(TN+TP+FN+FP)
Accuracy_RF

## [1] 0.9522258

Comparing the accuracies of three models

Accuracy <- data.frame(Accuracy_SVM, Accuracy_RecPart, Accuracy_RF)

kable(Accuracy)

Accuracy_SVM	Accuracy_RecPart	Accuracy_RF
0.9381107	0.8957655	0.9522258

Random Forest has better accuracy followed by SVM and Recursive Partitioning.

Reference:

http://www.gettinggeneticsdone.com/2011/02/split-data-frame-into-testing-and.html

https://en.wikibooks.org/wiki/Data_Mining_Algorithms_In_R/Classification/SVM

https://www.researchgate.net/post/Whats_the_difference_between_training_set_and_test_set