I went with the spam example to perform my sentiment analysis.
The first step is to identify the data which I wrote to a csv file and then uploaded to github. Then we load our libraries.
url <- "https://raw.githubusercontent.com/bkreis84/Kreis-Week-11/master/spambase.csv"
spambase <- read.csv(url, header=TRUE)
library(RTextTools)
## Loading required package: SparseM
##
## Attaching package: 'SparseM'
##
## The following object is masked from 'package:base':
##
## backsolve
library(tm)
## Loading required package: NLP
library(SnowballC)
##
## Attaching package: 'SnowballC'
##
## The following objects are masked from 'package:RTextTools':
##
## getStemLanguages, wordStem
I decided to randomize the sample as it appeared the original data was sorted by what was determined to be spam and what was not.
I established the labels to be used and then set a value equal to the length of the dataset (N)
spambase <- spambase[sample(nrow(spambase)),]
spam_labels <- unlist(spambase$V58)
head(spam_labels)
## [1] 1 1 0 0 1 0
N <- length(spam_labels)
Then I created a container and created a training size of 1000 and a testsize of 3001.
container <- create_container(spambase, labels = spam_labels, trainSize = 1:1000, testSize = 1001:N, virgin = FALSE)
slotNames(container)
## [1] "training_matrix" "classification_matrix" "training_codes"
## [4] "testing_codes" "column_names" "virgin"
Using the train model funtion I applied the object to the models, we then apply those models to the remaining observations that we want to test
svm_model <- train_model(container, "SVM")
tree_model <- train_model(container, "TREE")
maxent_model <- train_model(container, "MAXENT")
svm_out <- classify_model(container, svm_model)
tree_out <- classify_model(container, tree_model)
maxent_out <- classify_model(container, maxent_model)
A data frame is then created to compare whether it was determined to be actual spam or not versus what our model predicted.
labels_out <- data.frame(
correct_label = spam_labels[1001:N],
svm = maxent_out[,1],
tree = tree_out[,1],
maxent = maxent_out[,1],
stringsAsFactors = F)
We then create tables to show how many times our model was correct versus wrong
##SVM Performance
table(labels_out[,1] == labels_out[,2])
##
## FALSE TRUE
## 30 3571
prop.table(table(labels_out[,1] == labels_out[,2]))
##
## FALSE TRUE
## 0.008331019 0.991668981
##Random Forest Performance
table(labels_out[,1] == labels_out[,3])
##
## TRUE
## 3601
prop.table(table(labels_out[,1] == labels_out[,3]))
##
## TRUE
## 1
##Maximum Entropy Performance
table(labels_out[,1] == labels_out[,4])
##
## FALSE TRUE
## 30 3571
prop.table(table(labels_out[,1] == labels_out[,4]))
##
## FALSE TRUE
## 0.008331019 0.991668981
All three models were able to predict whether the email was spam or not with over 95% accuracy