Assignment: It can be useful to be able to classify new “test” documents using already classified “training” documents. A common example is using a corpus of labeled spam and ham (non-spam) e-mails to predict whether or not a new document is spam. Here is one example of such data: http://archive.ics.uci.edu/ml/datasets/Spambase. For this project, you can either use the above dataset to predict the class of new documents (either withheld from the training dataset or from another source such as your own spam folder).

Data set information:

The purpose of this project is to classify an email as “Spam” or “Ham” (non-spam) based on a dataset of 4601 emails from UCI Machine Learning Repository.

The “spam” concept is diverse: advertisements for products/web sites, make money fast schemes, chain letters, pornography. The collection of spam e-mails used here came from the postmaster and individuals who had filed spam at HP. The collection of non-spam e-mails came from filed work and personal e-mails, and hence the word ‘george’ and the area code ‘650’ are indicators of non-spam. These are useful when constructing a personalized spam filter. One would either have to blind such non-spam indicators or get a very wide collection of non-spam to generate a general purpose spam filter. http://archive.ics.uci.edu/ml/datasets/Spambase/

Get packages and load data:

library(caret)

## Warning: package 'caret' was built under R version 3.2.2

## Loading required package: lattice
## Loading required package: ggplot2

## Warning: package 'ggplot2' was built under R version 3.2.2

library(rpart)

## Warning: package 'rpart' was built under R version 3.2.2

library(rpart.plot)

## Warning: package 'rpart.plot' was built under R version 3.2.2

library(randomForest)

## Warning: package 'randomForest' was built under R version 3.2.2

## randomForest 4.6-12
## Type rfNews() to see new features/changes/bug fixes.

library(rattle)

## Warning: package 'rattle' was built under R version 3.2.2

## Rattle: A free graphical interface for data mining with R.
## Version 4.0.0 Copyright (c) 2006-2015 Togaware Pty Ltd.
## Type 'rattle()' to shake, rattle, and roll your data.

library(e1071)

## Warning: package 'e1071' was built under R version 3.2.2

x<-read.table('https://raw.githubusercontent.com/vskrelja/607_DataAcqMgt_Skrelja/master/spambase.data', header = F, sep = ",")

set.seed(1)  # generates same random numbers for reproducibility

Attribute Information:

The last column of ‘spambase.data’ denotes whether the e-mail was considered spam (1) or not (0), i.e. unsolicited commercial e-mail. Most of the attributes indicate whether a particular word or character was frequently occuring in the e-mail. The run-length attributes (55-57) measure the length of sequences of consecutive capital letters. Here are the definitions of the attributes:

48 continuous real [0,100] attributes of type word_freq_WORD = percentage of words in the e-mail that match WORD, i.e. 100 * (number of times the WORD appears in the e-mail) / total number of words in e-mail. A “word” in this case is any string of alphanumeric characters bounded by non-alphanumeric characters or end-of-string.

6 continuous real [0,100] attributes of type char_freq_CHAR] = percentage of characters in the e-mail that match CHAR, i.e. 100 * (number of CHAR occurences) / total characters in e-mail

1 continuous real [1,…] attribute of type capital_run_length_average = average length of uninterrupted sequences of capital letters

1 continuous integer [1,…] attribute of type capital_run_length_longest = length of longest uninterrupted sequence of capital letters

1 continuous integer [1,…] attribute of type capital_run_length_total = sum of length of uninterrupted sequences of capital letters = total number of capital letters in the e-mail

1 nominal {0,1} class attribute of type spam = denotes whether the e-mail was considered spam (1) or not (0), i.e. unsolicited commercial e-mail.

#Create an exploratory plot to see how frequency of certain words might flag an email as spam
par(mar = c(5, 5, 2, 2))
plot(density(x$V57[x$V58==0 & x$V57<2000]),col='black', main='', xlab='Total number of capital letters in the e-mail')
lines(density(x$V57[x$V58==1 & x$V57<2000]),col='red')
legend("topright",lty=1, col = c("black", "red"),legend = c("non-spam", "spam"))

As we can see the spams have a higher total number of capital letters, similarly other variations in occurances of words like ‘FREE’ and characters like ‘$$$$’ should help classify the emails as spam or ham.

#Clean Data fields which are not sufficiantly different across emails and won't be usefull in classification
nzv <- nearZeroVar(x, saveMetrics=TRUE)
x <- x[,nzv$nzv==FALSE]

#Convert 'spam' or 'ham' variable V58 to factor
x$V58[x$V58==1]<-'spam'
x$V58[x$V58==0]<-'ham'
x$V58<-factor(x$V58)

#Remaining variables
str(x)

## 'data.frame':    4601 obs. of  8 variables:
##  $ V19: num  1.93 3.47 1.36 3.18 3.18 0 3.85 0 1.23 1.67 ...
##  $ V50: num  0 0.132 0.143 0.137 0.135 0.223 0.054 0.206 0.271 0.03 ...
##  $ V52: num  0.778 0.372 0.276 0.137 0.135 0 0.164 0 0.181 0.244 ...
##  $ V53: num  0 0.18 0.184 0 0 0 0.054 0 0.203 0.081 ...
##  $ V55: num  3.76 5.11 9.82 3.54 3.54 ...
##  $ V56: int  61 101 485 40 40 15 4 11 445 43 ...
##  $ V57: int  278 1028 2259 191 191 54 112 49 1257 749 ...
##  $ V58: Factor w/ 2 levels "ham","spam": 2 2 2 2 2 2 2 2 2 2 ...

#Partition Data
inTrain <- createDataPartition(x$V58, p = 3/4)[[1]]
training <- x[ inTrain,]
testing <- x[-inTrain,]

#R Regression Tree 
fit_rpart <- train(V58~.,method='rpart',data=training)

#Fancy Decision Tree Plot
fancyRpartPlot(fit_rpart$finalModel)

#Confusion Matrix on Testing set: Regression Tree
pred_rpart <- predict(fit_rpart, testing)
confusionMatrix(pred_rpart, testing$V58, positive = 'spam')

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction ham spam
##       ham  633  108
##       spam  64  345
##                                           
##                Accuracy : 0.8504          
##                  95% CI : (0.8285, 0.8706)
##     No Information Rate : 0.6061          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.6814          
##  Mcnemar's Test P-Value : 0.001043        
##                                           
##             Sensitivity : 0.7616          
##             Specificity : 0.9082          
##          Pos Pred Value : 0.8435          
##          Neg Pred Value : 0.8543          
##              Prevalence : 0.3939          
##          Detection Rate : 0.3000          
##    Detection Prevalence : 0.3557          
##       Balanced Accuracy : 0.8349          
##                                           
##        'Positive' Class : spam            
##

We get an accuracy of over 85% which is not bad. But given how undesirable False Positives (marking good mail as spam) are in this context, we need to increase the specificity. Next we try Random Forest.

#R Random Forest
fit_rf <- randomForest(training$V58~.,training)

#Confusion Matrix on Testing set: Random Forest
pred_rf <- predict(fit_rf, testing)
confusionMatrix(pred_rf, testing$V58, positive = 'spam')

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction ham spam
##       ham  665   75
##       spam  32  378
##                                           
##                Accuracy : 0.907           
##                  95% CI : (0.8887, 0.9231)
##     No Information Rate : 0.6061          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.8018          
##  Mcnemar's Test P-Value : 4.901e-05       
##                                           
##             Sensitivity : 0.8344          
##             Specificity : 0.9541          
##          Pos Pred Value : 0.9220          
##          Neg Pred Value : 0.8986          
##              Prevalence : 0.3939          
##          Detection Rate : 0.3287          
##    Detection Prevalence : 0.3565          
##       Balanced Accuracy : 0.8943          
##                                           
##        'Positive' Class : spam            
##

Although the accuracy jumped to over 90% we still have a lot of False Positives. Next we will try to change cutoff probabilities to reduce the number of False Positives.

#Default cutoff's to classify an observation are taken from the forest$cutoff component of the model object (0.5,0.5)
#The `winning' class (spam or ham) for an observation is the one with the maximum ratio of proportion of votes to cutoff. 
fit_rf$forest$cutoff

## [1] 0.5 0.5

head(fit_rf$votes)

##           ham      spam
## 1 0.021739130 0.9782609
## 2 0.015463918 0.9845361
## 3 0.005649718 0.9943503
## 4 0.088397790 0.9116022
## 5 0.077777778 0.9222222
## 7 0.351351351 0.6486486

#So if we want high Specificity in classifying an email as Spam we want the cutoff for the spam class to be high i.e. (0.1,0.9)

#Predict with high Specificity
pred_rf <- predict(fit_rf, testing,cutoff = c(.2,.8))
confusionMatrix(pred_rf, testing$V58, positive = 'spam')

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction ham spam
##       ham  688  135
##       spam   9  318
##                                           
##                Accuracy : 0.8748          
##                  95% CI : (0.8543, 0.8934)
##     No Information Rate : 0.6061          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.7243          
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.7020          
##             Specificity : 0.9871          
##          Pos Pred Value : 0.9725          
##          Neg Pred Value : 0.8360          
##              Prevalence : 0.3939          
##          Detection Rate : 0.2765          
##    Detection Prevalence : 0.2843          
##       Balanced Accuracy : 0.8445          
##                                           
##        'Positive' Class : spam            
##

Given the highly undesirable nature of False Positives, we changed the cutoff probabilities, so that the good emails are rarely marked as spam. In order to achieve over 98% Specificity we had to trade the model’s Sensitivity, which was reduced to ~70%. That also meant that about 16% of the spam actually passed through the filter.

607 Week 11/12 Assignment: Document Classification

Randi Skrelja

November 22, 2015

Data set information:

Get packages and load data:

Attribute Information: