Project Objective

Objective of this project is to classify documents using supervised learning techniques.

Loading Required Libraries

library(tm)
library(RTextTools)

Getting the Data

The data used in this project comes from the spambase data set.

#Getting the data from the spambase dataset
data <- read.csv("http://archive.ics.uci.edu/ml/machine-learning-databases/spambase/spambase.data", header = FALSE)

Subsetting the Data

In this data set, column 58 denotes if the email was spam or ham: Spam = 1 and ham = 0. For a full explanation of the variables, please visit this link. In this section, the data frame is prepped so it can be loaded to the training models.

In the first step, label data frame is created using column 58. In the second step, a data frame using columns 1-57 is created.

#Subsetting the spam-ham labels
labels <- data[58] 

#Removing the labels column
data <- data[,1:57]

Training and Classifying

Creating a Container with the Data

In this step, a container is created using the previously subsetted data frames. The training and test size arguments are also defined.

container <- create_container(data, t(labels), trainSize = 1:2500, testSize = 2501:4601, virgin = FALSE)

Implementing Supervised Classification Models

For this project, I will use three supervised classification models: Support Vector Machines (SVM), Random Forest and Maximum Entropy.

Train Models

#Support Vector Machines
svm_model <- train_model(container, "SVM")

#Random Forest
tree_model <- train_model(container, "TREE")

#Maximun Entropy
maxent_model <- train_model(container, "MAXENT")

Classify Models

#Support Vector Machines
svm_out <- classify_model(container, svm_model)

#Random Forest
tree_out <- classify_model(container, tree_model)

#Maximun Entropy
maxent_out <- classify_model(container, maxent_model)

Evaluating the Models

In this section, the models are evaluated to determine their performance.

Classification Results

#Support Vector Machines
head(svm_out)

##   SVM_LABEL  SVM_PROB
## 1         1 0.9751901
## 2         1 0.9271082
## 3         0 0.9872233
## 4         1 0.8134585
## 5         0 0.9999976
## 6         1 0.5110802

#Random Forest
head(tree_out)

##   TREE_LABEL TREE_PROB
## 1          1 0.9200000
## 2          0 0.9633803
## 3          0 0.8638298
## 4          0 0.9633803
## 5          0 0.8638298
## 6          0 0.9633803

#Maximun Entropy
head(maxent_out)

##   MAXENTROPY_LABEL MAXENTROPY_PROB
## 1                1       0.9992718
## 2                0       0.9793610
## 3                0       0.7651015
## 4                0       0.9999184
## 5                0       0.8499035
## 6                0       0.9554223

Creating a new data frame

In this step, a new data frame labels_out is created to compare the models side-by-side with the correct classifications.

labels_out <- data.frame (labels = labels[2501:4601,1], svm = svm_out, tree = tree_out, entropy = maxent_out, stringsAsFactors = FALSE)

head(labels_out)

##   labels svm.SVM_LABEL svm.SVM_PROB tree.TREE_LABEL tree.TREE_PROB
## 1      0             1    0.9751901               1      0.9200000
## 2      0             1    0.9271082               0      0.9633803
## 3      0             0    0.9872233               0      0.8638298
## 4      0             1    0.8134585               0      0.9633803
## 5      0             0    0.9999976               0      0.8638298
## 6      0             1    0.5110802               0      0.9633803
##   entropy.MAXENTROPY_LABEL entropy.MAXENTROPY_PROB
## 1                        1               0.9992718
## 2                        0               0.9793610
## 3                        0               0.7651015
## 4                        0               0.9999184
## 5                        0               0.8499035
## 6                        0               0.9554223

Reviewing Performance

Performance for the Support Vector Machines Model

#Support Vector Machines

#Table
table(labels_out[,1] == labels_out[,2])

## 
## FALSE  TRUE 
##   902  1199

#Probability Table
prop.table(table(labels_out[,1] == labels_out[,2]))

## 
##     FALSE      TRUE 
## 0.4293194 0.5706806

Performance for the Random Forest Model

#Random Forest

#Table
table(labels_out[,1] == labels_out[,4])

## 
## FALSE  TRUE 
##   507  1594

#Probability Table
prop.table(table(labels_out[,1] == labels_out[,4]))

## 
##     FALSE      TRUE 
## 0.2413137 0.7586863

Performance for the Maximum Entropy Model

#Maximun Entropy

#Table
table(labels_out[,1] == labels_out[,6])

## 
## FALSE  TRUE 
##   625  1476

#Probability Table
prop.table(table(labels_out[,1] == labels_out[,6]))

## 
##     FALSE      TRUE 
## 0.2974774 0.7025226

Conclusion

In terms of overall accuracy, the Random Forest model had the best performance by classifying the ~76% of the documents correctly. The Maximum Entropy model comes in a close second at 70%. The worst performance, and this is surprising, comes from the SVM model with a low value of 57%.

Week 11/12 Assignment: Document Classification

Diego Diaz

November 21, 2015