The emails directory contains both spam and ham folders, which are going to be loaded for the classification task.
training_corpus <- VCorpus(DirSource(directory = "emails/",encoding = "latin1",recursive = TRUE))
#name of files that are considered spam
spam_filename <- list.files("emails/spam_2/")
Training Corpus
## <<VCorpus>>
## Metadata: corpus specific: 0, document level (indexed): 0
## Content: documents: 2798
In this step common cleaning tasks on corpus are performed such as removing special characters, extra spaces, numbers and stop words.
training_corpus <- training_corpus %>% tm_map(removeWords,stopwords("english"))
training_corpus <- training_corpus %>% tm_map(removePunctuation)
training_corpus <- training_corpus %>% tm_map(content_transformer(tolower))
training_corpus <- training_corpus %>% tm_map(stripWhitespace)
training_corpus <- training_corpus %>% tm_map(removeNumbers)
dtm_training <- DocumentTermMatrix(training_corpus)
dtm_training <- dtm_training %>% removeSparseTerms(0.99)
Document Term Matrix Cleaning Summary
## <<DocumentTermMatrix (documents: 2798, terms: 2489)>>
## Non-/sparse entries: 344993/6619229
## Sparsity : 95%
## Maximal term length: 66
## Weighting : term frequency (tf)
In this segment, we are tidying the document term matrix, as well as assigning labels to the emails that are considered spam and ham.
emails_dt <- dtm_training %>%
tidy() %>%
group_by(document) %>%
spread(term,count,fill = 0) %>%
mutate(email_class= "ham")%>%
ungroup()
#assigns spam to the documments that were present in the spam_folder
emails_dt[emails_dt$document %in% spam_filename,"email_class"] <- "spam"
emails_dt["email_class"] = as.factor(emails_dt[["email_class"]])
| document | aaa | ability | able | about | absolutely | abuse | acc | accept | accepted | access | according | account | accounts | achieve |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 00001.1a31cc283af0060967a233d26548a6ce | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 00001.317e78fa8ee2f54cd4890fdc09ba8176 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
| 00002.5a587ae61666c5aa097c8e866aedcc59 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 00002.9438920e9a55591b18e60d1ed37d992b | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
| 00003.19be8acd739ad589cd00d8425bac7115 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 00003.590eff932f8704d8b0fcbe69d023b54d | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 00004.b2ed6c3c62bbdfab7683d60e214d1445 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 00004.bdcc075fa4beb5157b5dd6cd41d8887b | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 00005.07b9d4aa9e6c596440295a5170111392 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 00005.ed0aba4d386c5e62bc737cf3f0ed9589 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 00006.3ca1f399ccda5d897fecb8c57669a283 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 00006.654c4ec7c059531accf388a807064363 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 00007.2e086b13730b68a21ee715db145522b9 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 00007.acefeee792b5298f8fee175f9f65c453 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
| 00008.6b73027e1e56131377941ff1db17ff12 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 00008.ccf927a6aec028f5472ca7b9db9eee20 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
| 00009.13c349859b09264fa131872ed4fb6e4e | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 00009.1e1a8cb4b57532ab38aa23287523659d | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 12 | 0 | 0 |
| 00010.2558d935f6439cb40d3acb8b8569aa9b | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 00010.d1b4dbbad797c5c0537c5a0670c373fd | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 00011.bc1aa4dca14300a8eec8b7658e568f29 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 00011.bd8c904d9f7b161a813d222230214d50 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 00012.3c1ff7380f10a806321027fc0ad09560 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 00012.cb9c9f2a25196f5b16512338625a85b4 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 00013.245fc5b9e5719b033d5d740c51af92e0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 00013.372ec9dc663418ca71f7d880a76f117a | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 00014.13574737e55e51fe6737a475b88b5052 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 00014.8e21078a89bd9c57255d302f346551e8 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 00015.206d5a5d1d34272ae32fc286788fdf55 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 00015.d5c8f360cf052b222819718165db24c6 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
Now, lets create the training set and the test set for the classifier.
#sample size
index <- 1:length(emails_dt)
#uses the 35% of the dataset as the test size
samp_size <- (NROW(emails_dt)*0.35) %>% ceiling()
samp_id <- sample(index,samp_size)
#choose all the records except those present in the test set
spam_train <- emails_dt[-samp_id,]
#choose records that were selected for the testing set
spam_test <- emails_dt[samp_id,]
For training this dataset, the support vector machine algorithm is being used.
trctrl <- trainControl(method = "repeatedcv", number = 10, repeats = 3)
svm_lineal <- train(email_class~.,
data=spam_train[-1],
method="svmLinear",
trControl = trctrl,
tuneLength = 10,
preProcess = c("center", "scale")
)
Model:
svm_lineal
## Support Vector Machines with Linear Kernel
##
## 1817 samples
## 2489 predictors
## 2 classes: 'ham', 'spam'
##
## Pre-processing: centered (2489), scaled (2489)
## Resampling: Cross-Validated (10 fold, repeated 3 times)
## Summary of sample sizes: 1635, 1636, 1636, 1635, 1635, 1634, ...
## Resampling results:
##
## Accuracy Kappa
## 0.9856869 0.9713619
##
## Tuning parameter 'C' was held constant at a value of 1
Prediction:
test_pre <- predict(svm_lineal,newdata = spam_test)
Confusion matrix:
This confusion matrix shows the summary of the classified emails.
confusionMatrix(test_pre,spam_test$email_class)
## Confusion Matrix and Statistics
##
## Reference
## Prediction ham spam
## ham 477 3
## spam 1 498
##
## Accuracy : 0.9959
## 95% CI : (0.9896, 0.9989)
## No Information Rate : 0.5117
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.9918
##
## Mcnemar's Test P-Value : 0.6171
##
## Sensitivity : 0.9979
## Specificity : 0.9940
## Pos Pred Value : 0.9937
## Neg Pred Value : 0.9980
## Prevalence : 0.4883
## Detection Rate : 0.4872
## Detection Prevalence : 0.4903
## Balanced Accuracy : 0.9960
##
## 'Positive' Class : ham
##
According to the confussion matrix, the model predicts an email’s class with an 99.39% accuracy.
This sections offers a brief visual summary on the spam and ham data.
#labeled dataframe
cl_edata <- rbind(spam_train,spam_test)
#long format dataframe
emails_dt <- dtm_training %>% tidy()
spam_freq <- emails_dt[emails_dt$document %in% spam_filename,] %>% top_n(50)
## Selecting by count
There are about the same number of spam and ham emails in the directory.
## Warning: Ignoring unknown parameters: binwidth, bins, pad
Wordcloud
As show in the wordcloud, higer frequency of words in spam emails are those related to html coding and font attributes.
In this markdown, I have classified spam and ham emails according to a trained model tested against a test set. Due to lack of knowledge on machine learning algoriths, there may be parameters for the SVM algorithms that were not appropitated for modeling and therefore, contributed to a possible level of accuracy that is not the expected. however, I am confident that the results were nearly close to the expected for this assigment.