Week 11 Assignment

I went with the spam example to perform my sentiment analysis.

The first step is to identify the data which I wrote to a csv file and then uploaded to github. Then we load our libraries.

url <- "https://raw.githubusercontent.com/bkreis84/Kreis-Week-11/master/spambase.csv"
spambase <- read.csv(url, header=TRUE)

library(RTextTools)

## Loading required package: SparseM
## 
## Attaching package: 'SparseM'
## 
## The following object is masked from 'package:base':
## 
##     backsolve

library(tm)

## Loading required package: NLP

library(SnowballC)

## 
## Attaching package: 'SnowballC'
## 
## The following objects are masked from 'package:RTextTools':
## 
##     getStemLanguages, wordStem

I decided to randomize the sample as it appeared the original data was sorted by what was determined to be spam and what was not.

I established the labels to be used and then set a value equal to the length of the dataset (N)

spambase <- spambase[sample(nrow(spambase)),]
spam_labels <- unlist(spambase$V58)
head(spam_labels)

## [1] 1 1 0 0 1 0

N <- length(spam_labels)

Then I created a container and created a training size of 1000 and a testsize of 3001.

container <- create_container(spambase, labels = spam_labels, trainSize = 1:1000, testSize = 1001:N, virgin = FALSE)

slotNames(container)

## [1] "training_matrix"       "classification_matrix" "training_codes"       
## [4] "testing_codes"         "column_names"          "virgin"

Using the train model funtion I applied the object to the models, we then apply those models to the remaining observations that we want to test

svm_model <- train_model(container, "SVM")
tree_model <- train_model(container, "TREE")
maxent_model <- train_model(container, "MAXENT")

svm_out <- classify_model(container, svm_model)
tree_out <- classify_model(container, tree_model)
maxent_out <- classify_model(container, maxent_model)

A data frame is then created to compare whether it was determined to be actual spam or not versus what our model predicted.

labels_out <- data.frame(
  correct_label = spam_labels[1001:N],
  svm = maxent_out[,1],
  tree = tree_out[,1],
  maxent = maxent_out[,1],
  stringsAsFactors = F)

We then create tables to show how many times our model was correct versus wrong

##SVM Performance
table(labels_out[,1] == labels_out[,2])

## 
## FALSE  TRUE 
##    30  3571

prop.table(table(labels_out[,1] == labels_out[,2]))

## 
##       FALSE        TRUE 
## 0.008331019 0.991668981

##Random Forest Performance
table(labels_out[,1] == labels_out[,3])

## 
## TRUE 
## 3601

prop.table(table(labels_out[,1] == labels_out[,3]))

## 
## TRUE 
##    1

##Maximum Entropy Performance
table(labels_out[,1] == labels_out[,4])

## 
## FALSE  TRUE 
##    30  3571

prop.table(table(labels_out[,1] == labels_out[,4]))

## 
##       FALSE        TRUE 
## 0.008331019 0.991668981

All three models were able to predict whether the email was spam or not with over 95% accuracy

Week 11 Assignment

Brian Kreis

November 15, 2015