For this project, you can start with a spam/ham dataset, then predict the class of new documents (either withheld from the training dataset or from another source such as your own spam folder).
library(kernlab)
library(knitr)
library(rpart)
library(e1071)
library(RTextTools)
## Loading required package: SparseM
##
## Attaching package: 'SparseM'
## The following object is masked from 'package:base':
##
## backsolve
library(randomForest)
## randomForest 4.6-12
## Type rfNews() to see new features/changes/bug fixes.
library(ROCR)
## Loading required package: gplots
##
## Attaching package: 'gplots'
## The following object is masked from 'package:stats':
##
## lowess
The dataset is retrieved form UCI Machine Learning Repository. Information about the dataset can be accessed using the link
We will use two different approaches for reading the data in r.
We will install the package kernlab. The library(kernlab) contain the dataset which can be used with ease for creating models for spam detection.
we will use r command data(spam) to get the data in r.
data(spam)
kable(head(spam))
| make | address | all | num3d | our | over | remove | internet | order | receive | will | people | report | addresses | free | business | you | credit | your | font | num000 | money | hp | hpl | george | num650 | lab | labs | telnet | num857 | data | num415 | num85 | technology | num1999 | parts | pm | direct | cs | meeting | original | project | re | edu | table | conference | charSemicolon | charRoundbracket | charSquarebracket | charExclamation | charDollar | charHash | capitalAve | capitalLong | capitalTotal | type | ||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0.00 | 0.64 | 0.64 | 0 | 0.32 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.64 | 0.00 | 0.00 | 0.00 | 0.32 | 0.00 | 1.29 | 1.93 | 0.00 | 0.96 | 0 | 0.00 | 0.00 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.00 | 0 | 0 | 0.00 | 0 | 0 | 0.00 | 0 | 0.00 | 0.00 | 0 | 0 | 0.00 | 0.000 | 0 | 0.778 | 0.000 | 0.000 | 3.756 | 61 | 278 | spam |
| 0.21 | 0.28 | 0.50 | 0 | 0.14 | 0.28 | 0.21 | 0.07 | 0.00 | 0.94 | 0.21 | 0.79 | 0.65 | 0.21 | 0.14 | 0.14 | 0.07 | 0.28 | 3.47 | 0.00 | 1.59 | 0 | 0.43 | 0.43 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.07 | 0 | 0 | 0.00 | 0 | 0 | 0.00 | 0 | 0.00 | 0.00 | 0 | 0 | 0.00 | 0.132 | 0 | 0.372 | 0.180 | 0.048 | 5.114 | 101 | 1028 | spam |
| 0.06 | 0.00 | 0.71 | 0 | 1.23 | 0.19 | 0.19 | 0.12 | 0.64 | 0.25 | 0.38 | 0.45 | 0.12 | 0.00 | 1.75 | 0.06 | 0.06 | 1.03 | 1.36 | 0.32 | 0.51 | 0 | 1.16 | 0.06 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.00 | 0 | 0 | 0.06 | 0 | 0 | 0.12 | 0 | 0.06 | 0.06 | 0 | 0 | 0.01 | 0.143 | 0 | 0.276 | 0.184 | 0.010 | 9.821 | 485 | 2259 | spam |
| 0.00 | 0.00 | 0.00 | 0 | 0.63 | 0.00 | 0.31 | 0.63 | 0.31 | 0.63 | 0.31 | 0.31 | 0.31 | 0.00 | 0.00 | 0.31 | 0.00 | 0.00 | 3.18 | 0.00 | 0.31 | 0 | 0.00 | 0.00 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.00 | 0 | 0 | 0.00 | 0 | 0 | 0.00 | 0 | 0.00 | 0.00 | 0 | 0 | 0.00 | 0.137 | 0 | 0.137 | 0.000 | 0.000 | 3.537 | 40 | 191 | spam |
| 0.00 | 0.00 | 0.00 | 0 | 0.63 | 0.00 | 0.31 | 0.63 | 0.31 | 0.63 | 0.31 | 0.31 | 0.31 | 0.00 | 0.00 | 0.31 | 0.00 | 0.00 | 3.18 | 0.00 | 0.31 | 0 | 0.00 | 0.00 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.00 | 0 | 0 | 0.00 | 0 | 0 | 0.00 | 0 | 0.00 | 0.00 | 0 | 0 | 0.00 | 0.135 | 0 | 0.135 | 0.000 | 0.000 | 3.537 | 40 | 191 | spam |
| 0.00 | 0.00 | 0.00 | 0 | 1.85 | 0.00 | 0.00 | 1.85 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0 | 0.00 | 0.00 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.00 | 0 | 0 | 0.00 | 0 | 0 | 0.00 | 0 | 0.00 | 0.00 | 0 | 0 | 0.00 | 0.223 | 0 | 0.000 | 0.000 | 0.000 | 3.000 | 15 | 54 | spam |
dim(spam)
## [1] 4601 58
The second approach is copying and saving the data in csv file and reading it in r. There is another file name_data which contains the names of the variables.After reading it in r, the two files data and name_data are merged using a function, where name_data is appended to the header row of data. ####Approach 2
dataset <- read.csv("data.csv",header=FALSE,sep=";")
names <- read.csv("name_data.csv",header=FALSE,sep=";")
names(dataset) <- sapply((1:nrow(names)),function(i) toString(names[i,1]))
kable(head(dataset))
| make | address | all | num3d | our | over | remove | internet | order | receive | will | people | report | addresses | free | business | you | credit | your | font | num000 | money | hp | hpl | george | num650 | lab | labs | telnet | num857 | data | num415 | num85 | technology | num1999 | parts | pm | direct | cs | meeting | original | project | re | edu | table | conference | charSemicolon | charRoundbracket | charSquarebracket | charExclamation | charDollar | charHash | capitalAve | capitalLong | capitalTotal | type | ||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0.00 | 0.64 | 0.64 | 0 | 0.32 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.64 | 0.00 | 0.00 | 0.00 | 0.32 | 0.00 | 1.29 | 1.93 | 0.00 | 0.96 | 0 | 0.00 | 0.00 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.00 | 0 | 0 | 0.00 | 0 | 0 | 0.00 | 0 | 0.00 | 0.00 | 0 | 0 | 0.00 | 0.000 | 0 | 0.778 | 0.000 | 0.000 | 3.756 | 61 | 278 | 1 |
| 0.21 | 0.28 | 0.50 | 0 | 0.14 | 0.28 | 0.21 | 0.07 | 0.00 | 0.94 | 0.21 | 0.79 | 0.65 | 0.21 | 0.14 | 0.14 | 0.07 | 0.28 | 3.47 | 0.00 | 1.59 | 0 | 0.43 | 0.43 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.07 | 0 | 0 | 0.00 | 0 | 0 | 0.00 | 0 | 0.00 | 0.00 | 0 | 0 | 0.00 | 0.132 | 0 | 0.372 | 0.180 | 0.048 | 5.114 | 101 | 1028 | 1 |
| 0.06 | 0.00 | 0.71 | 0 | 1.23 | 0.19 | 0.19 | 0.12 | 0.64 | 0.25 | 0.38 | 0.45 | 0.12 | 0.00 | 1.75 | 0.06 | 0.06 | 1.03 | 1.36 | 0.32 | 0.51 | 0 | 1.16 | 0.06 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.00 | 0 | 0 | 0.06 | 0 | 0 | 0.12 | 0 | 0.06 | 0.06 | 0 | 0 | 0.01 | 0.143 | 0 | 0.276 | 0.184 | 0.010 | 9.821 | 485 | 2259 | 1 |
| 0.00 | 0.00 | 0.00 | 0 | 0.63 | 0.00 | 0.31 | 0.63 | 0.31 | 0.63 | 0.31 | 0.31 | 0.31 | 0.00 | 0.00 | 0.31 | 0.00 | 0.00 | 3.18 | 0.00 | 0.31 | 0 | 0.00 | 0.00 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.00 | 0 | 0 | 0.00 | 0 | 0 | 0.00 | 0 | 0.00 | 0.00 | 0 | 0 | 0.00 | 0.137 | 0 | 0.137 | 0.000 | 0.000 | 3.537 | 40 | 191 | 1 |
| 0.00 | 0.00 | 0.00 | 0 | 0.63 | 0.00 | 0.31 | 0.63 | 0.31 | 0.63 | 0.31 | 0.31 | 0.31 | 0.00 | 0.00 | 0.31 | 0.00 | 0.00 | 3.18 | 0.00 | 0.31 | 0 | 0.00 | 0.00 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.00 | 0 | 0 | 0.00 | 0 | 0 | 0.00 | 0 | 0.00 | 0.00 | 0 | 0 | 0.00 | 0.135 | 0 | 0.135 | 0.000 | 0.000 | 3.537 | 40 | 191 | 1 |
| 0.00 | 0.00 | 0.00 | 0 | 1.85 | 0.00 | 0.00 | 1.85 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0 | 0.00 | 0.00 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.00 | 0 | 0 | 0.00 | 0 | 0 | 0.00 | 0 | 0.00 | 0.00 | 0 | 0 | 0.00 | 0.223 | 0 | 0.000 | 0.000 | 0.000 | 3.000 | 15 | 54 | 1 |
The two approaches used provide same datasets.
Once we have the data for spam emails, we will split that dataset into training and test datasets. We will select 80% of the data randomly from our dataset to test it against the remaining 20% test dataset.
sub <- sample(nrow(spam), floor(nrow(spam) * 0.8))
train <- spam[sub, ]
test <- spam[-sub, ]
dim(train)
## [1] 3680 58
dim(test)
## [1] 921 58
The training and test datasets are now used for creating the models.
svm.model <- svm(type~., data = train)
svm.pred <- predict(svm.model,test[1:57,])
svm.pred <- predict(svm.model,test)
tab <-table(predicted = svm.pred, actual = test[,58])
# to find the precision and accuracy we can use caret library. I was having problems even after the caret libraryy was installed. I used a manual technique for calculating the accuracy of the models
#TP = True Positive
#TN = True Negative
#FP = False Positive
#FN = False Negative
TP <-tab[2,2]
TN <-tab[1,1]
FP<-tab[1,2]
FN<-tab[2,1]
Accuracy_SVM<- (TN + TP)/(TN+TP+FN+FP)
Accuracy_SVM
## [1] 0.9381107
rpart.model <- rpart(type~., data =train)
rpart.pred <- predict(rpart.model,test, type = "class")
tab <-table(predicted=rpart.pred,actual=test[,58])
TP <-tab[2,2]
TN <-tab[1,1]
FP<-tab[1,2]
FN<-tab[2,1]
Accuracy_RecPart<- (TN + TP)/(TN+TP+FN+FP)
Accuracy_RecPart
## [1] 0.8957655
spam.rf<-randomForest(type~.,data=train)
rf.pred <- predict(spam.rf, test)
tab <-table(predicted=rf.pred,actual=test[,58])
tab
## actual
## predicted nonspam spam
## nonspam 559 33
## spam 11 318
TP <-tab[2,2]
TN <-tab[1,1]
FP<-tab[1,2]
FN<-tab[2,1]
Accuracy_RF<- (TN + TP)/(TN+TP+FN+FP)
Accuracy_RF
## [1] 0.9522258
Comparing the accuracies of three models
Accuracy <- data.frame(Accuracy_SVM, Accuracy_RecPart, Accuracy_RF)
kable(Accuracy)
| Accuracy_SVM | Accuracy_RecPart | Accuracy_RF |
|---|---|---|
| 0.9381107 | 0.8957655 | 0.9522258 |