Objective of this project is to classify documents using supervised learning techniques.
library(tm)
library(RTextTools)
The data used in this project comes from the spambase data set.
#Getting the data from the spambase dataset
data <- read.csv("http://archive.ics.uci.edu/ml/machine-learning-databases/spambase/spambase.data", header = FALSE)
In this data set, column 58 denotes if the email was spam or ham: Spam = 1 and ham = 0. For a full explanation of the variables, please visit this link. In this section, the data frame is prepped so it can be loaded to the training models.
In the first step, label data frame is created using column 58. In the second step, a data frame using columns 1-57 is created.
#Subsetting the spam-ham labels
labels <- data[58]
#Removing the labels column
data <- data[,1:57]
Creating a Container with the Data
In this step, a container is created using the previously subsetted data frames. The training and test size arguments are also defined.
container <- create_container(data, t(labels), trainSize = 1:2500, testSize = 2501:4601, virgin = FALSE)
Implementing Supervised Classification Models
For this project, I will use three supervised classification models: Support Vector Machines (SVM), Random Forest and Maximum Entropy.
Train Models
#Support Vector Machines
svm_model <- train_model(container, "SVM")
#Random Forest
tree_model <- train_model(container, "TREE")
#Maximun Entropy
maxent_model <- train_model(container, "MAXENT")
Classify Models
#Support Vector Machines
svm_out <- classify_model(container, svm_model)
#Random Forest
tree_out <- classify_model(container, tree_model)
#Maximun Entropy
maxent_out <- classify_model(container, maxent_model)
Evaluating the Models
In this section, the models are evaluated to determine their performance.
Classification Results
#Support Vector Machines
head(svm_out)
## SVM_LABEL SVM_PROB
## 1 1 0.9751901
## 2 1 0.9271082
## 3 0 0.9872233
## 4 1 0.8134585
## 5 0 0.9999976
## 6 1 0.5110802
#Random Forest
head(tree_out)
## TREE_LABEL TREE_PROB
## 1 1 0.9200000
## 2 0 0.9633803
## 3 0 0.8638298
## 4 0 0.9633803
## 5 0 0.8638298
## 6 0 0.9633803
#Maximun Entropy
head(maxent_out)
## MAXENTROPY_LABEL MAXENTROPY_PROB
## 1 1 0.9992718
## 2 0 0.9793610
## 3 0 0.7651015
## 4 0 0.9999184
## 5 0 0.8499035
## 6 0 0.9554223
Creating a new data frame
In this step, a new data frame labels_out is created to compare the models side-by-side with the correct classifications.
labels_out <- data.frame (labels = labels[2501:4601,1], svm = svm_out, tree = tree_out, entropy = maxent_out, stringsAsFactors = FALSE)
head(labels_out)
## labels svm.SVM_LABEL svm.SVM_PROB tree.TREE_LABEL tree.TREE_PROB
## 1 0 1 0.9751901 1 0.9200000
## 2 0 1 0.9271082 0 0.9633803
## 3 0 0 0.9872233 0 0.8638298
## 4 0 1 0.8134585 0 0.9633803
## 5 0 0 0.9999976 0 0.8638298
## 6 0 1 0.5110802 0 0.9633803
## entropy.MAXENTROPY_LABEL entropy.MAXENTROPY_PROB
## 1 1 0.9992718
## 2 0 0.9793610
## 3 0 0.7651015
## 4 0 0.9999184
## 5 0 0.8499035
## 6 0 0.9554223
Reviewing Performance
Performance for the Support Vector Machines Model
#Support Vector Machines
#Table
table(labels_out[,1] == labels_out[,2])
##
## FALSE TRUE
## 902 1199
#Probability Table
prop.table(table(labels_out[,1] == labels_out[,2]))
##
## FALSE TRUE
## 0.4293194 0.5706806
Performance for the Random Forest Model
#Random Forest
#Table
table(labels_out[,1] == labels_out[,4])
##
## FALSE TRUE
## 507 1594
#Probability Table
prop.table(table(labels_out[,1] == labels_out[,4]))
##
## FALSE TRUE
## 0.2413137 0.7586863
Performance for the Maximum Entropy Model
#Maximun Entropy
#Table
table(labels_out[,1] == labels_out[,6])
##
## FALSE TRUE
## 625 1476
#Probability Table
prop.table(table(labels_out[,1] == labels_out[,6]))
##
## FALSE TRUE
## 0.2974774 0.7025226
In terms of overall accuracy, the Random Forest model had the best performance by classifying the ~76% of the documents correctly. The Maximum Entropy model comes in a close second at 70%. The worst performance, and this is surprising, comes from the SVM model with a low value of 57%.