Text Mining

Training corpus preparation.

training_corpus <- training_corpus %>% tm_map(removeWords,stopwords("english"))
training_corpus <- training_corpus %>% tm_map(removePunctuation)
training_corpus <- training_corpus %>% tm_map(content_transformer(tolower))
training_corpus <- training_corpus %>% tm_map(stripWhitespace)
training_corpus <- training_corpus %>% tm_map(removeNumbers)

dtm_training <- DocumentTermMatrix(training_corpus)
dtm_training <- dtm_training %>%  removeSparseTerms(0.99)

Document Term Matrix Cleaning Summary

## <<DocumentTermMatrix (documents: 2798, terms: 2489)>>
## Non-/sparse entries: 344993/6619229
## Sparsity           : 95%
## Maximal term length: 66
## Weighting          : term frequency (tf)

Dataset Preparation

In this segment, we are tidying the document term matrix, as well as assigning labels to the emails that are considered spam and ham.

emails_dt <- dtm_training %>%
  tidy() %>% 
  group_by(document) %>%
  spread(term,count,fill = 0) %>%
  mutate(email_class= "ham")%>%
  ungroup()

#assigns spam to the documments that were present in the spam_folder
emails_dt[emails_dt$document %in% spam_filename,"email_class"] <- "spam" 

emails_dt["email_class"] = as.factor(emails_dt[["email_class"]])

Word Frequency Dataset Overview

This dataset is a overview of the terms present in the dataset, including the documment name were it was seen.

document	ability	about	absolutely	accept	accepted	access	account
00001.1a31cc283af0060967a233d26548a6ce	0	0	0	0	0	0	0
00001.317e78fa8ee2f54cd4890fdc09ba8176	0	0	0	1	0	0	0
00002.5a587ae61666c5aa097c8e866aedcc59	0	0	0	0	0	0	0
00002.9438920e9a55591b18e60d1ed37d992b	0	0	0	0	1	0	0
00003.19be8acd739ad589cd00d8425bac7115	0	0	0	0	0	0	0
00003.590eff932f8704d8b0fcbe69d023b54d	1	0	1	0	0	0	0
00004.b2ed6c3c62bbdfab7683d60e214d1445	0	0	0	0	0	0	0
00004.bdcc075fa4beb5157b5dd6cd41d8887b	1	0	1	0	0	0	0
00005.07b9d4aa9e6c596440295a5170111392	0	0	0	0	0	0	0
00005.ed0aba4d386c5e62bc737cf3f0ed9589	0	0	0	0	0	0	0
00006.3ca1f399ccda5d897fecb8c57669a283	0	0	0	0	0	0	0
00006.654c4ec7c059531accf388a807064363	0	0	0	0	0	0	0
00007.2e086b13730b68a21ee715db145522b9	0	0	0	0	0	0	0
00007.acefeee792b5298f8fee175f9f65c453	0	0	0	0	0	1	0
00008.6b73027e1e56131377941ff1db17ff12	0	0	0	0	0	0	0
00008.ccf927a6aec028f5472ca7b9db9eee20	0	0	0	0	0	0	1
00009.13c349859b09264fa131872ed4fb6e4e	0	0	0	0	0	0	0
00009.1e1a8cb4b57532ab38aa23287523659d	0	1	0	0	0	0	12
00010.2558d935f6439cb40d3acb8b8569aa9b	0	0	0	0	0	0	0
00010.d1b4dbbad797c5c0537c5a0670c373fd	0	0	0	0	0	0	0
00011.bc1aa4dca14300a8eec8b7658e568f29	0	0	0	0	0	0	0
00011.bd8c904d9f7b161a813d222230214d50	0	0	1	0	0	0	0
00012.3c1ff7380f10a806321027fc0ad09560	0	0	0	0	0	0	0
00012.cb9c9f2a25196f5b16512338625a85b4	0	0	1	0	0	0	0
00013.245fc5b9e5719b033d5d740c51af92e0	0	0	0	0	0	0	0
00013.372ec9dc663418ca71f7d880a76f117a	0	0	0	0	0	0	0
00014.13574737e55e51fe6737a475b88b5052	0	0	0	0	0	0	0
00014.8e21078a89bd9c57255d302f346551e8	0	0	0	0	0	0	0
00015.206d5a5d1d34272ae32fc286788fdf55	0	0	0	0	0	0	0
00015.d5c8f360cf052b222819718165db24c6	0	0	0	0	0	0	0

Now, lets create the training set and the test set for the classifier.

#sample size
index <- 1:length(emails_dt)

#uses the 35% of the dataset as the test size
samp_size <- (NROW(emails_dt)*0.35) %>% ceiling()
samp_id <-  sample(index,samp_size)


#choose all the records except those present in the test set
spam_train <- emails_dt[-samp_id,]

#choose records that were selected for the testing set
spam_test <- emails_dt[samp_id,]

Dataset Training

For training this dataset, the support vector machine algorithm is being used.

trctrl <- trainControl(method = "repeatedcv", number = 10, repeats = 3)
svm_lineal <- train(email_class~.,
                    data=spam_train[-1],
                    method="svmLinear",
                    trControl = trctrl,
                    tuneLength = 10,
                    preProcess = c("center", "scale")
                    )

Model:

svm_lineal

## Support Vector Machines with Linear Kernel 
## 
## 1817 samples
## 2489 predictors
##    2 classes: 'ham', 'spam' 
## 
## Pre-processing: centered (2489), scaled (2489) 
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 1635, 1636, 1636, 1635, 1635, 1634, ... 
## Resampling results:
## 
##   Accuracy   Kappa    
##   0.9856869  0.9713619
## 
## Tuning parameter 'C' was held constant at a value of 1

Prediction:

test_pre <- predict(svm_lineal,newdata = spam_test)

Confusion matrix:

This confusion matrix shows the summary of the classified emails.

confusionMatrix(test_pre,spam_test$email_class)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction ham spam
##       ham  477    3
##       spam   1  498
##                                           
##                Accuracy : 0.9959          
##                  95% CI : (0.9896, 0.9989)
##     No Information Rate : 0.5117          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.9918          
##                                           
##  Mcnemar's Test P-Value : 0.6171          
##                                           
##             Sensitivity : 0.9979          
##             Specificity : 0.9940          
##          Pos Pred Value : 0.9937          
##          Neg Pred Value : 0.9980          
##              Prevalence : 0.4883          
##          Detection Rate : 0.4872          
##    Detection Prevalence : 0.4903          
##       Balanced Accuracy : 0.9960          
##                                           
##        'Positive' Class : ham             
##

According to the confussion matrix, the model predicts an email’s class with an 99.39% accuracy.

Email Data Visuals

This sections offers a brief visual summary on the spam and ham data.

#labeled dataframe
cl_edata <- rbind(spam_train,spam_test)
#long format dataframe
emails_dt <- dtm_training %>% tidy() 
spam_freq <- emails_dt[emails_dt$document %in% spam_filename,] %>% top_n(50)

## Selecting by count

There are about the same number of spam and ham emails in the directory.

## Warning: Ignoring unknown parameters: binwidth, bins, pad

Wordcloud

As show in the wordcloud, higer frequency of words in spam emails are those related to html coding and font attributes.

Text Mining

Lewris Mota

April 2, 2019

Data Loading

Corpus Directory Load

Corpus Summary

Corpus cleaning

Training corpus preparation.

Dataset Preparation

Word Frequency Dataset Overview

Dataset Training

Email Data Visuals

Conclusion