Data Loading

Corpus Directory Load

The emails directory contains both spam and ham folders, which are going to be loaded for the classification task.

training_corpus <-  VCorpus(DirSource(directory = "emails/",encoding = "latin1",recursive = TRUE))
#name of files that are considered spam
spam_filename <- list.files("emails/spam_2/")

Corpus Summary

Training Corpus

## <<VCorpus>>
## Metadata:  corpus specific: 0, document level (indexed): 0
## Content:  documents: 2798

Corpus cleaning

In this step common cleaning tasks on corpus are performed such as removing special characters, extra spaces, numbers and stop words.

Training corpus preparation.

training_corpus <- training_corpus %>% tm_map(removeWords,stopwords("english"))
training_corpus <- training_corpus %>% tm_map(removePunctuation)
training_corpus <- training_corpus %>% tm_map(content_transformer(tolower))
training_corpus <- training_corpus %>% tm_map(stripWhitespace)
training_corpus <- training_corpus %>% tm_map(removeNumbers)

dtm_training <- DocumentTermMatrix(training_corpus)
dtm_training <- dtm_training %>%  removeSparseTerms(0.99)

Document Term Matrix Cleaning Summary

## <<DocumentTermMatrix (documents: 2798, terms: 2489)>>
## Non-/sparse entries: 344993/6619229
## Sparsity           : 95%
## Maximal term length: 66
## Weighting          : term frequency (tf)

Dataset Preparation

In this segment, we are tidying the document term matrix, as well as assigning labels to the emails that are considered spam and ham.

emails_dt <- dtm_training %>%
  tidy() %>% 
  group_by(document) %>%
  spread(term,count,fill = 0) %>%
  mutate(email_class= "ham")%>%
  ungroup()

#assigns spam to the documments that were present in the spam_folder
emails_dt[emails_dt$document %in% spam_filename,"email_class"] <- "spam" 

emails_dt["email_class"] = as.factor(emails_dt[["email_class"]])

Word Frequency Dataset Overview

This dataset is a overview of the terms present in the dataset, including the documment name were it was seen.
document aaa ability able about absolutely abuse acc accept accepted access according account accounts achieve
00001.1a31cc283af0060967a233d26548a6ce 0 0 0 0 0 0 0 0 0 0 0 0 0 0
00001.317e78fa8ee2f54cd4890fdc09ba8176 0 0 0 0 0 0 0 1 0 0 0 0 0 0
00002.5a587ae61666c5aa097c8e866aedcc59 0 0 0 0 0 0 0 0 0 0 0 0 0 0
00002.9438920e9a55591b18e60d1ed37d992b 0 0 0 0 0 0 0 0 1 0 0 0 0 0
00003.19be8acd739ad589cd00d8425bac7115 0 0 0 0 0 0 0 0 0 0 0 0 0 0
00003.590eff932f8704d8b0fcbe69d023b54d 0 1 0 0 1 0 0 0 0 0 0 0 0 0
00004.b2ed6c3c62bbdfab7683d60e214d1445 0 0 0 0 0 0 0 0 0 0 0 0 0 0
00004.bdcc075fa4beb5157b5dd6cd41d8887b 0 1 0 0 1 0 0 0 0 0 0 0 0 0
00005.07b9d4aa9e6c596440295a5170111392 0 0 0 0 0 0 0 0 0 0 0 0 0 0
00005.ed0aba4d386c5e62bc737cf3f0ed9589 0 0 0 0 0 0 0 0 0 0 0 0 0 0
00006.3ca1f399ccda5d897fecb8c57669a283 0 0 0 0 0 0 0 0 0 0 0 0 0 0
00006.654c4ec7c059531accf388a807064363 0 0 0 0 0 0 0 0 0 0 0 0 0 0
00007.2e086b13730b68a21ee715db145522b9 0 0 0 0 0 0 0 0 0 0 0 0 0 0
00007.acefeee792b5298f8fee175f9f65c453 0 0 0 0 0 0 0 0 0 1 0 0 0 0
00008.6b73027e1e56131377941ff1db17ff12 0 0 0 0 0 0 0 0 0 0 0 0 0 0
00008.ccf927a6aec028f5472ca7b9db9eee20 0 0 0 0 0 0 0 0 0 0 0 1 0 0
00009.13c349859b09264fa131872ed4fb6e4e 0 0 0 0 0 0 0 0 0 0 0 0 0 0
00009.1e1a8cb4b57532ab38aa23287523659d 0 0 0 1 0 0 0 0 0 0 0 12 0 0
00010.2558d935f6439cb40d3acb8b8569aa9b 0 0 0 0 0 0 0 0 0 0 0 0 0 0
00010.d1b4dbbad797c5c0537c5a0670c373fd 0 0 0 0 0 0 0 0 0 0 0 0 0 0
00011.bc1aa4dca14300a8eec8b7658e568f29 0 0 0 0 0 0 0 0 0 0 0 0 0 0
00011.bd8c904d9f7b161a813d222230214d50 0 0 0 0 1 0 0 0 0 0 0 0 0 0
00012.3c1ff7380f10a806321027fc0ad09560 0 0 0 0 0 0 0 0 0 0 0 0 0 0
00012.cb9c9f2a25196f5b16512338625a85b4 0 0 0 0 1 0 0 0 0 0 0 0 0 0
00013.245fc5b9e5719b033d5d740c51af92e0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
00013.372ec9dc663418ca71f7d880a76f117a 0 0 0 0 0 0 0 0 0 0 0 0 0 0
00014.13574737e55e51fe6737a475b88b5052 0 0 0 0 0 0 0 0 0 0 0 0 0 0
00014.8e21078a89bd9c57255d302f346551e8 0 0 0 0 0 0 0 0 0 0 0 0 0 0
00015.206d5a5d1d34272ae32fc286788fdf55 0 0 0 0 0 0 0 0 0 0 0 0 0 0
00015.d5c8f360cf052b222819718165db24c6 0 0 0 0 0 0 0 0 0 0 0 0 0 0

Now, lets create the training set and the test set for the classifier.

#sample size
index <- 1:length(emails_dt)

#uses the 35% of the dataset as the test size
samp_size <- (NROW(emails_dt)*0.35) %>% ceiling()
samp_id <-  sample(index,samp_size)


#choose all the records except those present in the test set
spam_train <- emails_dt[-samp_id,]

#choose records that were selected for the testing set
spam_test <- emails_dt[samp_id,]

Dataset Training

For training this dataset, the support vector machine algorithm is being used.

trctrl <- trainControl(method = "repeatedcv", number = 10, repeats = 3)
svm_lineal <- train(email_class~.,
                    data=spam_train[-1],
                    method="svmLinear",
                    trControl = trctrl,
                    tuneLength = 10,
                    preProcess = c("center", "scale")
                    )

Model:

svm_lineal
## Support Vector Machines with Linear Kernel 
## 
## 1817 samples
## 2489 predictors
##    2 classes: 'ham', 'spam' 
## 
## Pre-processing: centered (2489), scaled (2489) 
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 1635, 1636, 1636, 1635, 1635, 1634, ... 
## Resampling results:
## 
##   Accuracy   Kappa    
##   0.9856869  0.9713619
## 
## Tuning parameter 'C' was held constant at a value of 1

Prediction:

test_pre <- predict(svm_lineal,newdata = spam_test)

Confusion matrix:

This confusion matrix shows the summary of the classified emails.

confusionMatrix(test_pre,spam_test$email_class)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction ham spam
##       ham  477    3
##       spam   1  498
##                                           
##                Accuracy : 0.9959          
##                  95% CI : (0.9896, 0.9989)
##     No Information Rate : 0.5117          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.9918          
##                                           
##  Mcnemar's Test P-Value : 0.6171          
##                                           
##             Sensitivity : 0.9979          
##             Specificity : 0.9940          
##          Pos Pred Value : 0.9937          
##          Neg Pred Value : 0.9980          
##              Prevalence : 0.4883          
##          Detection Rate : 0.4872          
##    Detection Prevalence : 0.4903          
##       Balanced Accuracy : 0.9960          
##                                           
##        'Positive' Class : ham             
## 

According to the confussion matrix, the model predicts an email’s class with an 99.39% accuracy.

Email Data Visuals

This sections offers a brief visual summary on the spam and ham data.

#labeled dataframe
cl_edata <- rbind(spam_train,spam_test)
#long format dataframe
emails_dt <- dtm_training %>% tidy() 
spam_freq <- emails_dt[emails_dt$document %in% spam_filename,] %>% top_n(50)
## Selecting by count

There are about the same number of spam and ham emails in the directory.

## Warning: Ignoring unknown parameters: binwidth, bins, pad

Wordcloud

As show in the wordcloud, higer frequency of words in spam emails are those related to html coding and font attributes.

Conclusion

In this markdown, I have classified spam and ham emails according to a trained model tested against a test set. Due to lack of knowledge on machine learning algoriths, there may be parameters for the SVM algorithms that were not appropitated for modeling and therefore, contributed to a possible level of accuracy that is not the expected. however, I am confident that the results were nearly close to the expected for this assigment.