For this project, predict the class of new documents either withheld from the training dataset (“spambase”) or from another source such as your own spam folder.

To achieve this task I will demonstrate document classification using a supervised machine learning technique called a Support Vector Machines (SVM) and data withheld from the original dataset.

library(nutshell) # spambase dataset available in clean format in nutshell package
library(e1071)

SVMs are supervised learning methods that can be used for classification and regression tasks. This means we are going to be using both a training and a testing dataset to create our classification model. An SVM algorithm generates non-overlapping partitions that typically employ all attributes of a dataset. The entity space is partitioned in a single pass, so that flat and linear partitions are generated. SVMs are based on maximum margin linear discriminants.

data(spambase) # load the data from the nutshell package
colnames(spambase) # check which column the dependent variable is in ("is_spam")
##  [1] "word_freq_make"             "word_freq_address"         
##  [3] "word_freq_all"              "word_freq_3d"              
##  [5] "word_freq_our"              "word_freq_over"            
##  [7] "word_freq_remove"           "word_freq_internet"        
##  [9] "word_freq_order"            "word_freq_mail"            
## [11] "word_freq_receive"          "word_freq_will"            
## [13] "word_freq_people"           "word_freq_report"          
## [15] "word_freq_addresses"        "word_freq_free"            
## [17] "word_freq_business"         "word_freq_email"           
## [19] "word_freq_you"              "word_freq_credit"          
## [21] "word_freq_your"             "word_freq_font"            
## [23] "word_freq_000"              "word_freq_money"           
## [25] "word_freq_hp"               "word_freq_hpl"             
## [27] "word_freq_george"           "word_freq_650"             
## [29] "word_freq_lab"              "word_freq_labs"            
## [31] "word_freq_telnet"           "word_freq_857"             
## [33] "word_freq_data"             "word_freq_415"             
## [35] "word_freq_85"               "word_freq_technology"      
## [37] "word_freq_1999"             "word_freq_parts"           
## [39] "word_freq_pm"               "word_freq_direct"          
## [41] "word_freq_cs"               "word_freq_meeting"         
## [43] "word_freq_original"         "word_freq_project"         
## [45] "word_freq_re"               "word_freq_edu"             
## [47] "word_freq_table"            "word_freq_conference"      
## [49] "char_freq_semicolon"        "char_freq_left_paren"      
## [51] "char_freq_left_bracket"     "char_freq_exclamation"     
## [53] "char_freq_dollar"           "char_freq_pound"           
## [55] "capital_run_length_average" "capital_run_length_longest"
## [57] "capital_run_length_total"   "is_spam"
# create a training and testing set for our analysis
i <- 1:nrow(spambase)
test_i <- sample(i, trunc(length(i)/4)) # slice the dataset into quarters
testset <- spambase[test_i,] # the test set is 1/4 of the total records
trainset <- spambase[-test_i,] # the training set is the rest of the dataset

# create a model using the training data
m <- svm(is_spam~., data = trainset)

# test the model on the testing set leaving out the dependent variable
prediction <- predict(m, testset[,-58])

# create a confusion matrix to show the true and false positives associated with the model
cm <- table(pred = prediction, true = testset[,58]) # confusion matrix
cm
##     true
## pred   0   1
##    0 656  48
##    1  30 416
# check the accuracy rates of the model
classAgreement(cm)
## $diag
## [1] 0.9321739
## 
## $kappa
## [1] 0.8582069
## 
## $rand
## [1] 0.8734385
## 
## $crand
## [1] 0.7464198

Because we are working with a training and testing set, we can try altering the sample sizes used to create our train and test set and see how this impacts our model’s acuracy.

# create a new set of training and testing subsets, but this time make the training set much larger
i2 <- 1:nrow(spambase)
test_i2 <- sample(i, trunc(length(i2)/8)) # slice the dataset into eighths
testset2 <- spambase[test_i2,] # the test set is 1/8 of the total records
trainset2 <- spambase[-test_i2,] # the training set is the rest of the dataset

# create a new model using the new larger training data
m2 <- svm(is_spam~., data = trainset2)

# test the model on the testing set leaving out the dependent variable
prediction <- predict(m2, testset2[,-58])

# create a confusion matrix to show the true and false positives associated with the model
cm2 <- table(pred = prediction, true = testset2[,58]) # confusion matrix
cm2
##     true
## pred   0   1
##    0 325  22
##    1  14 214
# check the accuracy rates of the model
classAgreement(cm2)
## $diag
## [1] 0.9373913
## 
## $kappa
## [1] 0.8699618
## 
## $rand
## [1] 0.8824178
## 
## $crand
## [1] 0.764541

We can easily see from both the confusion matrices and the classAgreement() accuracy rates that increasing the size of the training set improved our model’s ability to correctly classify spam from non-spam in the spambase. This shows that with more data the model was able to learn more about how to correctly classify the documents.