It can be useful to be able to classify new “test” documents using already classified “training” documents. A common example is using a corpus of labeled spam and ham (non-spam) e-mails to predict whether or not a new document is spam.

For this project, you can start with a spam/ham dataset, then predict the class of new documents (either withheld from the training dataset or from another source such as your own spam folder).

Installing required packages and libraries

library(kernlab)
library(knitr)
library(rpart)
library(e1071)
library(RTextTools)
## Loading required package: SparseM
## 
## Attaching package: 'SparseM'
## The following object is masked from 'package:base':
## 
##     backsolve
library(randomForest)
## randomForest 4.6-12
## Type rfNews() to see new features/changes/bug fixes.
library(ROCR)
## Loading required package: gplots
## 
## Attaching package: 'gplots'
## The following object is masked from 'package:stats':
## 
##     lowess

Data Collection

The dataset is retrieved form UCI Machine Learning Repository. Information about the dataset can be accessed using the link

We will use two different approaches for reading the data in r.

Approach: 1

We will install the package kernlab. The library(kernlab) contain the dataset which can be used with ease for creating models for spam detection.

we will use r command data(spam) to get the data in r.

data(spam)

kable(head(spam))
make address all num3d our over remove internet order mail receive will people report addresses free business email you credit your font num000 money hp hpl george num650 lab labs telnet num857 data num415 num85 technology num1999 parts pm direct cs meeting original project re edu table conference charSemicolon charRoundbracket charSquarebracket charExclamation charDollar charHash capitalAve capitalLong capitalTotal type
0.00 0.64 0.64 0 0.32 0.00 0.00 0.00 0.00 0.00 0.00 0.64 0.00 0.00 0.00 0.32 0.00 1.29 1.93 0.00 0.96 0 0.00 0.00 0 0 0 0 0 0 0 0 0 0 0 0 0.00 0 0 0.00 0 0 0.00 0 0.00 0.00 0 0 0.00 0.000 0 0.778 0.000 0.000 3.756 61 278 spam
0.21 0.28 0.50 0 0.14 0.28 0.21 0.07 0.00 0.94 0.21 0.79 0.65 0.21 0.14 0.14 0.07 0.28 3.47 0.00 1.59 0 0.43 0.43 0 0 0 0 0 0 0 0 0 0 0 0 0.07 0 0 0.00 0 0 0.00 0 0.00 0.00 0 0 0.00 0.132 0 0.372 0.180 0.048 5.114 101 1028 spam
0.06 0.00 0.71 0 1.23 0.19 0.19 0.12 0.64 0.25 0.38 0.45 0.12 0.00 1.75 0.06 0.06 1.03 1.36 0.32 0.51 0 1.16 0.06 0 0 0 0 0 0 0 0 0 0 0 0 0.00 0 0 0.06 0 0 0.12 0 0.06 0.06 0 0 0.01 0.143 0 0.276 0.184 0.010 9.821 485 2259 spam
0.00 0.00 0.00 0 0.63 0.00 0.31 0.63 0.31 0.63 0.31 0.31 0.31 0.00 0.00 0.31 0.00 0.00 3.18 0.00 0.31 0 0.00 0.00 0 0 0 0 0 0 0 0 0 0 0 0 0.00 0 0 0.00 0 0 0.00 0 0.00 0.00 0 0 0.00 0.137 0 0.137 0.000 0.000 3.537 40 191 spam
0.00 0.00 0.00 0 0.63 0.00 0.31 0.63 0.31 0.63 0.31 0.31 0.31 0.00 0.00 0.31 0.00 0.00 3.18 0.00 0.31 0 0.00 0.00 0 0 0 0 0 0 0 0 0 0 0 0 0.00 0 0 0.00 0 0 0.00 0 0.00 0.00 0 0 0.00 0.135 0 0.135 0.000 0.000 3.537 40 191 spam
0.00 0.00 0.00 0 1.85 0.00 0.00 1.85 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0 0.00 0.00 0 0 0 0 0 0 0 0 0 0 0 0 0.00 0 0 0.00 0 0 0.00 0 0.00 0.00 0 0 0.00 0.223 0 0.000 0.000 0.000 3.000 15 54 spam
dim(spam)
## [1] 4601   58

The second approach is copying and saving the data in csv file and reading it in r. There is another file name_data which contains the names of the variables.After reading it in r, the two files data and name_data are merged using a function, where name_data is appended to the header row of data. ####Approach 2

dataset <- read.csv("data.csv",header=FALSE,sep=";")
names <- read.csv("name_data.csv",header=FALSE,sep=";")
names(dataset) <- sapply((1:nrow(names)),function(i) toString(names[i,1]))

kable(head(dataset))
make address all num3d our over remove internet order mail receive will people report addresses free business email you credit your font num000 money hp hpl george num650 lab labs telnet num857 data num415 num85 technology num1999 parts pm direct cs meeting original project re edu table conference charSemicolon charRoundbracket charSquarebracket charExclamation charDollar charHash capitalAve capitalLong capitalTotal type
0.00 0.64 0.64 0 0.32 0.00 0.00 0.00 0.00 0.00 0.00 0.64 0.00 0.00 0.00 0.32 0.00 1.29 1.93 0.00 0.96 0 0.00 0.00 0 0 0 0 0 0 0 0 0 0 0 0 0.00 0 0 0.00 0 0 0.00 0 0.00 0.00 0 0 0.00 0.000 0 0.778 0.000 0.000 3.756 61 278 1
0.21 0.28 0.50 0 0.14 0.28 0.21 0.07 0.00 0.94 0.21 0.79 0.65 0.21 0.14 0.14 0.07 0.28 3.47 0.00 1.59 0 0.43 0.43 0 0 0 0 0 0 0 0 0 0 0 0 0.07 0 0 0.00 0 0 0.00 0 0.00 0.00 0 0 0.00 0.132 0 0.372 0.180 0.048 5.114 101 1028 1
0.06 0.00 0.71 0 1.23 0.19 0.19 0.12 0.64 0.25 0.38 0.45 0.12 0.00 1.75 0.06 0.06 1.03 1.36 0.32 0.51 0 1.16 0.06 0 0 0 0 0 0 0 0 0 0 0 0 0.00 0 0 0.06 0 0 0.12 0 0.06 0.06 0 0 0.01 0.143 0 0.276 0.184 0.010 9.821 485 2259 1
0.00 0.00 0.00 0 0.63 0.00 0.31 0.63 0.31 0.63 0.31 0.31 0.31 0.00 0.00 0.31 0.00 0.00 3.18 0.00 0.31 0 0.00 0.00 0 0 0 0 0 0 0 0 0 0 0 0 0.00 0 0 0.00 0 0 0.00 0 0.00 0.00 0 0 0.00 0.137 0 0.137 0.000 0.000 3.537 40 191 1
0.00 0.00 0.00 0 0.63 0.00 0.31 0.63 0.31 0.63 0.31 0.31 0.31 0.00 0.00 0.31 0.00 0.00 3.18 0.00 0.31 0 0.00 0.00 0 0 0 0 0 0 0 0 0 0 0 0 0.00 0 0 0.00 0 0 0.00 0 0.00 0.00 0 0 0.00 0.135 0 0.135 0.000 0.000 3.537 40 191 1
0.00 0.00 0.00 0 1.85 0.00 0.00 1.85 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0 0.00 0.00 0 0 0 0 0 0 0 0 0 0 0 0 0.00 0 0 0.00 0 0 0.00 0 0.00 0.00 0 0 0.00 0.223 0 0.000 0.000 0.000 3.000 15 54 1

The two approaches used provide same datasets.

Splitting the data into training and test data sets

Once we have the data for spam emails, we will split that dataset into training and test datasets. We will select 80% of the data randomly from our dataset to test it against the remaining 20% test dataset.

sub <- sample(nrow(spam), floor(nrow(spam) * 0.8))
train <- spam[sub, ]
test <- spam[-sub, ]



dim(train)
## [1] 3680   58
dim(test)
## [1] 921  58

The training and test datasets are now used for creating the models.

SVM model

svm.model <- svm(type~., data = train)


svm.pred <- predict(svm.model,test[1:57,])

svm.pred <- predict(svm.model,test)


tab <-table(predicted = svm.pred, actual = test[,58])


# to find the precision and accuracy we can use caret library. I was having problems even after the caret libraryy was installed. I used a manual technique for calculating the accuracy of the models

#TP = True Positive
#TN = True Negative
#FP = False Positive
#FN = False Negative

TP <-tab[2,2]
TN <-tab[1,1]

FP<-tab[1,2]
FN<-tab[2,1]

Accuracy_SVM<- (TN + TP)/(TN+TP+FN+FP)
Accuracy_SVM
## [1] 0.9381107

Recursive Partitioning

rpart.model <- rpart(type~., data =train)
rpart.pred <- predict(rpart.model,test, type = "class")
tab <-table(predicted=rpart.pred,actual=test[,58])
TP <-tab[2,2]
TN <-tab[1,1]

FP<-tab[1,2]
FN<-tab[2,1]

Accuracy_RecPart<- (TN + TP)/(TN+TP+FN+FP)
Accuracy_RecPart
## [1] 0.8957655

Random Forest

spam.rf<-randomForest(type~.,data=train)
rf.pred <- predict(spam.rf, test)
tab <-table(predicted=rf.pred,actual=test[,58])

tab
##          actual
## predicted nonspam spam
##   nonspam     559   33
##   spam         11  318
TP <-tab[2,2]
TN <-tab[1,1]

FP<-tab[1,2]
FN<-tab[2,1]

Accuracy_RF<- (TN + TP)/(TN+TP+FN+FP)
Accuracy_RF
## [1] 0.9522258

Comparing the accuracies of three models

Accuracy <- data.frame(Accuracy_SVM, Accuracy_RecPart, Accuracy_RF)

kable(Accuracy)
Accuracy_SVM Accuracy_RecPart Accuracy_RF
0.9381107 0.8957655 0.9522258

Random Forest has better accuracy followed by SVM and Recursive Partitioning.