Sentiment analysis (also known as opinion mining) refers to the use of natural language processing, text analysis and computational linguistics to identify and extract subjective information in source materials. Sentiment analysis is widely applied to:
Generally speaking, sentiment analysis aims to determine the attitude of a speaker or a writer with respect to some topic or the overall contextual polarity of a document. The attitude may be his or her judgment or evaluation (see appraisal theory), affective state (that is to say, the emotional state of the author when writing), or the intended emotional communication (that is to say, the emotional effect the author wishes to have on the reader).
This assignment will take the idea of sentiment analaysis and apply it to a “spam filter”, It will take e-mails that are labeld as spam or ham and train a model to pick whether future e-mails are spam or ham nbased off of key words that are present in the training sets.
The following packages are required for this assigment:
There is a need to read in all of the files that will be used in the training and testing of the model. The process is as follows:
#Read in the files and create the term document matrix
generateTDM <- function(types, pathname){
s.dir <- sprintf("%s/%s", pathname, types)
s.cor <- Corpus(DirSource(directory = s.dir, encoding="latin1"))
s.cor.cl <- cleanCorpus(s.cor)
s.tdm <- TermDocumentMatrix(s.cor.cl)
s.tdm <- removeSparseTerms(s.tdm, 0.9)
return(list(name = types, tdm = s.tdm))
}
#clean the corpus to get rid of all the unneccessary information
cleanCorpus <- function(corpus){
corpus.tmp <- tm_map(corpus, removePunctuation)
corpus.tmp <- tm_map(corpus.tmp, stripWhitespace)
corpus.tmp <- tm_map(corpus.tmp, tolower)
corpus.tmp <- tm_map(corpus.tmp, removeWords, stopwords("english"))
corpus.tmp <- tm_map(corpus.tmp, PlainTextDocument)
return(corpus.tmp)
}
The term document matrix is created for both the spam and ham files. We assign the matricies different names, so we can manipulate each matrix individually. The last rown of the matrix does describe the email as either spam of ham. This is to allow the model to check the correctness later on.
bindTypeToTDM <- function(tdm){
s.mat <- t(data.matrix(tdm[["tdm"]]))
s.df <- as.data.frame(s.mat, stringsAsFactors = FALSE)
s.df <- cbind(s.df, rep(tdm[["name"]], nrow(s.df)))
colnames(s.df)[ncol(s.df)] <- "types"
return(s.df)
}
Creates the Term Document Matrix for all of the e-mails ine both the spam and ham files and combines the two matricies into one total matrix. The NA values will be filled with 0’s to show that they did not appear in the given e-mail set. This matrix will have the overall frquency for all the terms taht appear in both the spam and ham e-mails.
tdm <- lapply(types, generateTDM, path = pathname)
typeTDM <- lapply(tdm, bindTypeToTDM)
tdm.stack <- do.call(rbind.fill, typeTDM)
tdm.stack[is.na(tdm.stack)] <- 0
After the Term Document Matrix is created, there is a need to test the model. That is done by parsing out the term document matrix into two parts:
The training Data set, is the data set that will be used to train the model. The TDM is fed into the model and the model either creates a tree, or groups the data into sets/groups. The model is learning what type of data belongs to what groups and can use that to categorize future data.
The hold out data is the data that will be used to test the model. This data is fed into the model and used to determine the accuracy. The hold out data will be placed into whatever groups the model determined by the training data set.
train.idx <- sample(nrow(tdm.stack), ceiling(nrow(tdm.stack) * 0.7))
test.idx <- (1:nrow(tdm.stack)) [-train.idx]
The KNN model uses euclidean distance to figure out where the hold out data belongs in the trainin set. Each groups spam/ham has a list of points (determined by the TDM), Those points can be ploted on a graph (n-dimensional space). The new data that is brought in looks to see where their points line up with the two groups (euclidean distance). Whichever group it is closer to, that is where the model will place the data.
tdm.type <- tdm.stack[, "types"]
tdm.stack.nl <- tdm.stack[,!colnames(tdm.stack) %in% "types"]
knn.prediction <- knn(tdm.stack.nl[train.idx, ], tdm.stack.nl[test.idx, ], tdm.type[train.idx])
test <- CrossTable(x = tdm.type[test.idx], y = knn.prediction, prop.chisq = FALSE)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 225
##
##
## | knn.prediction
## tdm.type[test.idx] | hard_ham | spam | Row Total |
## -------------------|-----------|-----------|-----------|
## hard_ham | 65 | 18 | 83 |
## | 0.783 | 0.217 | 0.369 |
## | 1.000 | 0.112 | |
## | 0.289 | 0.080 | |
## -------------------|-----------|-----------|-----------|
## spam | 0 | 142 | 142 |
## | 0.000 | 1.000 | 0.631 |
## | 0.000 | 0.887 | |
## | 0.000 | 0.631 | |
## -------------------|-----------|-----------|-----------|
## Column Total | 65 | 160 | 225 |
## | 0.289 | 0.711 | |
## -------------------|-----------|-----------|-----------|
##
##
svm_model <- train_model(container, "SVM")
svm_out <- classify_model(container, svm_model)
test <- CrossTable(x = tdm.type[testSize], y = svm_out[,1], prop.chisq = FALSE )
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 225
##
##
## | svm_out[, 1]
## tdm.type[testSize] | hard_ham | spam | Row Total |
## -------------------|-----------|-----------|-----------|
## hard_ham | 83 | 0 | 83 |
## | 1.000 | 0.000 | 0.369 |
## | 0.988 | 0.000 | |
## | 0.369 | 0.000 | |
## -------------------|-----------|-----------|-----------|
## spam | 1 | 141 | 142 |
## | 0.007 | 0.993 | 0.631 |
## | 0.012 | 1.000 | |
## | 0.004 | 0.627 | |
## -------------------|-----------|-----------|-----------|
## Column Total | 84 | 141 | 225 |
## | 0.373 | 0.627 | |
## -------------------|-----------|-----------|-----------|
##
##
Creates Multiple decision tress and takes the most frequnetlt predicted membership category of many decision trees as the classification that is most likely to be accurate.
tree_model <- train_model(container, "TREE")
tree_out <- classify_model(container, tree_model)
test <- CrossTable(x = tdm.type[testSize], y = tree_out[,1], prop.chisq = FALSE )
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 225
##
##
## | tree_out[, 1]
## tdm.type[testSize] | hard_ham | spam | Row Total |
## -------------------|-----------|-----------|-----------|
## hard_ham | 83 | 0 | 83 |
## | 1.000 | 0.000 | 0.369 |
## | 0.976 | 0.000 | |
## | 0.369 | 0.000 | |
## -------------------|-----------|-----------|-----------|
## spam | 2 | 140 | 142 |
## | 0.014 | 0.986 | 0.631 |
## | 0.024 | 1.000 | |
## | 0.009 | 0.622 | |
## -------------------|-----------|-----------|-----------|
## Column Total | 85 | 140 | 225 |
## | 0.378 | 0.622 | |
## -------------------|-----------|-----------|-----------|
##
##
maxent_model <- train_model(container, "MAXENT")
maxent_out <- classify_model(container, maxent_model)
test <- CrossTable(x = tdm.type[testSize], y = maxent_out[,1], prop.chisq = FALSE )
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 225
##
##
## | maxent_out[, 1]
## tdm.type[testSize] | hard_ham | spam | Row Total |
## -------------------|-----------|-----------|-----------|
## hard_ham | 83 | 0 | 83 |
## | 1.000 | 0.000 | 0.369 |
## | 0.976 | 0.000 | |
## | 0.369 | 0.000 | |
## -------------------|-----------|-----------|-----------|
## spam | 2 | 140 | 142 |
## | 0.014 | 0.986 | 0.631 |
## | 0.024 | 1.000 | |
## | 0.009 | 0.622 | |
## -------------------|-----------|-----------|-----------|
## Column Total | 85 | 140 | 225 |
## | 0.378 | 0.622 | |
## -------------------|-----------|-----------|-----------|
##
##
There are three ways to which you can measure the accuracy of your model:
| Relevant/Retrieved | Correct | Not Correct |
|---|---|---|
| Selected | TP | FP |
| Not Selected | FN | TN |
\[ precision = \frac{TP}{TP + FP} \] \[ recall = \frac{TP}{TP + FN} \] \[ F-Measure = \frac{2*precision*recall}{precision + recall} \]
## Model Precision Recall F-Measure
## 1: KNN 0.7831325 1.0000000 0.8783784
## 2: SVM 1.0000000 0.9880952 0.9940120
## 3: TREE 1.0000000 0.9764706 0.9880952
## 4: MAXENT 1.0000000 0.9764706 0.9880952