The goal of this project is to test three different supervised machine learning algorithms that could potentially be used as a spam filter. Essentially we want to see which algorithm has the best chance of predicting a spam email given certain criteria.
The algorithms are trained using methods from the caret package.
The spambase dataset contains a set of word frequencies occuring in emails. Each email was labelled spam or not spam, denoted as 0 or 1. The spam column(V58) is turned into a factor since we are testing binary classification.
spam_data = read.csv(file = "spambase.data", header=FALSE)
colnames(spam_data)[58] = "spam" # change label of last column for convenience
spam_data$spam = as.factor(spam_data$spam)
In order for the program to run efficiently we take a sample of 750 rows from the data set and split it into training and testing sets using CreatePartition(). 70% of the data will be used for training and the remaining 30% witheld for testing.
library(caret)
## Warning: package 'caret' was built under R version 3.1.3
## Loading required package: lattice
## Loading required package: ggplot2
set.seed(1234)
spam_sample = spam_data[sample(nrow(spam_data), 750), ]
training_index = createDataPartition(spam_sample$spam, p=.7, list=FALSE)
training_set = spam_sample[training_index, ]
testing_set = spam_sample[-training_index, ]
After some research I decided to test Random Forest, Linear Discriminant Analysis and Neural Networks. These algorithms required the least tuning and are designed to work with categorical data.
These algorithms are masked from their respective packages and used with caret’s training function. The training control method used is repeated cross validation. 10 seperate 10 fold cross-overs are used to improve model accuracy.
library(randomForest) # randomForest method
## randomForest 4.6-12
## Type rfNews() to see new features/changes/bug fixes.
library(MASS) # linear discrimant analysis
library(nnet) # neural network
#set up the training control
tr_ctrl = trainControl(method = "repeatedcv",number = 10,repeats = 10)
#train the models
forest_train = train(spam ~ .,
data=training_set,
method="rf",
trControl=tr_ctrl)
lda_train = train(spam ~ .,
data=training_set,
method="lda",
trControl=tr_ctrl)
nnet_train = train(spam ~ .,
data=training_set,
method="nnet",
trControl=tr_ctrl)
Now that the models are trained we test them against the testing set. The confusionMatrix() displays the results and relevent statistics for each model.
pred_forest = predict(forest_train, testing_set[,-58])
pred_lda = predict(lda_train, testing_set[,-58])
pred_nnet = predict(nnet_train, testing_set[,-58])
confusionMatrix(pred_forest,testing_set$spam)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 133 7
## 1 6 78
##
## Accuracy : 0.942
## 95% CI : (0.9028, 0.9687)
## No Information Rate : 0.6205
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.8765
## Mcnemar's Test P-Value : 1
##
## Sensitivity : 0.9568
## Specificity : 0.9176
## Pos Pred Value : 0.9500
## Neg Pred Value : 0.9286
## Prevalence : 0.6205
## Detection Rate : 0.5938
## Detection Prevalence : 0.6250
## Balanced Accuracy : 0.9372
##
## 'Positive' Class : 0
##
confusionMatrix(pred_lda,testing_set$spam)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 130 17
## 1 9 68
##
## Accuracy : 0.8839
## 95% CI : (0.8346, 0.9228)
## No Information Rate : 0.6205
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.7489
## Mcnemar's Test P-Value : 0.1698
##
## Sensitivity : 0.9353
## Specificity : 0.8000
## Pos Pred Value : 0.8844
## Neg Pred Value : 0.8831
## Prevalence : 0.6205
## Detection Rate : 0.5804
## Detection Prevalence : 0.6562
## Balanced Accuracy : 0.8676
##
## 'Positive' Class : 0
##
confusionMatrix(pred_nnet,testing_set$spam)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 127 9
## 1 12 76
##
## Accuracy : 0.9062
## 95% CI : (0.8603, 0.941)
## No Information Rate : 0.6205
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.8023
## Mcnemar's Test P-Value : 0.6625
##
## Sensitivity : 0.9137
## Specificity : 0.8941
## Pos Pred Value : 0.9338
## Neg Pred Value : 0.8636
## Prevalence : 0.6205
## Detection Rate : 0.5670
## Detection Prevalence : 0.6071
## Balanced Accuracy : 0.9039
##
## 'Positive' Class : 0
##
With this sample, Random Forest seemed to perform the best with 94% accuracy and a confidence interval that suggests up to 96% accuracy. LDA and Neural Networks performed slightly lower, but with fine tuning the results would most likely improve.