Instructions

It can be useful to be able to classify new “test” documents using already classified “training” documents. A common example is using a corpus of labeled spam and ham (non-spam) e-mails to predict whether or not a new document is spam.

For this project, you can start with a spam/ham dataset, then predict the class of new documents (either withheld from the training dataset or from another source such as your own spam folder). One example corpus: https://spamassassin.apache.org/old/publiccorpus/

Here are two short videos that you may find helpful. The first video shows how to unzip the provided files.

Introduction

All data can be found in my github

Load the Required Packages

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.3     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.3     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
## Loading required package: NLP
## 
## 
## Attaching package: 'NLP'
## 
## 
## The following object is masked from 'package:ggplot2':
## 
##     annotate
## 
## 
## 
## Attaching package: 'magrittr'
## 
## 
## The following object is masked from 'package:purrr':
## 
##     set_names
## 
## 
## The following object is masked from 'package:tidyr':
## 
##     extract
## 
## 
## 
## Attaching package: 'data.table'
## 
## 
## The following objects are masked from 'package:lubridate':
## 
##     hour, isoweek, mday, minute, month, quarter, second, wday, week,
##     yday, year
## 
## 
## The following objects are masked from 'package:dplyr':
## 
##     between, first, last
## 
## 
## The following object is masked from 'package:purrr':
## 
##     transpose
## 
## 
## Package version: 3.3.1
## Unicode version: 14.0
## ICU version: 70.1
## 
## Parallel computing: 4 of 4 threads used.
## 
## See https://quanteda.io for tutorials and examples.
## 
## 
## Attaching package: 'quanteda'
## 
## 
## The following object is masked from 'package:tm':
## 
##     stopwords
## 
## 
## The following objects are masked from 'package:NLP':
## 
##     meta, meta<-
## 
## 
## Loading required package: RColorBrewer
## 
## Loading required package: lattice
## 
## 
## Attaching package: 'caret'
## 
## 
## The following object is masked from 'package:purrr':
## 
##     lift
## 
## 
## 
## Attaching package: 'gm'
## 
## 
## The following object is masked from 'package:magrittr':
## 
##     add
## 
## 
## The following object is masked from 'package:lubridate':
## 
##     show
## 
## 
## The following object is masked from 'package:methods':
## 
##     show

Loading Files From Desktop

## [1] 1397
## [1] 2501
## Warning: There was 1 warning in `mutate()`.
## ℹ In argument: `text = lapply(spam_files, read_lines)`.
## Caused by warning:
## ! One or more parsing issues, call `problems()` on your data frame for details,
## e.g.:
##   dat <- vroom(...)
##   problems(dat)

Tidying Data and Creating Corpus

In this section, I’ll use the rbind function to merge the contents of ‘ham’ and ‘spam’. This function is effective for combining vectors, matrices, or data frames by rows. To organize the data, I applied the select() function, concentrating on variables like class, spam, file, and text. Cleaning the ‘ham_spam’ involved using str_replace to eliminate empty spaces. The content transformers function was helpful in modifying punctuation and replacing it with spaces.

I heavily relied on the tm package, employing tm_map() to apply cleaning functions to the entire corpus. This process included manipulating data with tools like a document-term matrix or term-document matrix. These matrices outline term frequencies in a document collection, where each row represents a document, each column represents a term, and each value indicates the term’s frequency in that document. In further exploration, iutilized the removeSparseTerms function to eliminate infrequently appearing terms, setting a threshold of at least 10 documents for word inclusion.

## Warning in tm_map.SimpleCorpus(., content_transformer(tolower)): transformation
## drops documents
## Warning in tm_map.SimpleCorpus(., removeWords, stopwords("english")):
## transformation drops documents
## Warning in tm_map.SimpleCorpus(., replacePunctuation): transformation drops
## documents
## Warning in tm_map.SimpleCorpus(., removeNumbers): transformation drops
## documents
## Warning in tm_map.SimpleCorpus(., stripWhitespace): transformation drops
## documents
## <<DocumentTermMatrix (documents: 3898, terms: 6607)>>
## Non-/sparse entries: 555825/25198261
## Sparsity           : 98%
## Maximal term length: 33
## Weighting          : term frequency (tf)
## Sample             :
##       Terms
## Docs   com font fork http list localhost net org received spamassassin
##   166  175  198    0  141    2         5  11  16        4           11
##   2529 149 1627    0   80    4         0   8   5        2            0
##   2552 167   41    0   83   12         0  14   0        2            0
##   2578 165   41    0   93    0         0  24   4        2            0
##   3580  19    0    0    9    5         5  25   2        8            3
##   3591 204 1102    0  516    5         6  43   4        7            1
##   3592 204 1102    0  516    2         6  43   4        7            1
##   570   15    0   15    3    7         8   0   9        6            5
##   670   17    0   17    3    7         7   0   8        7            5
##   677   21    0   19    3    8         6   0   9        7            6
## [1] 3898 6607

Data Training

Initially, we transformed the previously generated Document Term Matrix into a data frame. Additionally, we introduced a new column to categorize each document/row as either spam or not spam. The spam column was converted into a factor. For the division of data into training and testing sets, we opted to randomly select 80% for training and 20% for testing. We also calculated the proportions of ham to spam counts and observed that both the testing and training data exhibited approximately 65% ham and 35% spam.

## train_labels
##       ham      spam 
## 0.6478512 0.3521488
## test_labels
##       ham      spam 
## 0.6166667 0.3833333

Model Training

## Confusion Matrix and Statistics
## 
##           Actual
## Prediction ham spam
##       ham  472  132
##       spam   9  167
##                                           
##                Accuracy : 0.8192          
##                  95% CI : (0.7904, 0.8456)
##     No Information Rate : 0.6167          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.5854          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.5585          
##             Specificity : 0.9813          
##          Pos Pred Value : 0.9489          
##          Neg Pred Value : 0.7815          
##              Prevalence : 0.3833          
##          Detection Rate : 0.2141          
##    Detection Prevalence : 0.2256          
##       Balanced Accuracy : 0.7699          
##                                           
##        'Positive' Class : spam            
## 

Conclusion

As tested using the Naive Bayes model from the e1071 model, we were able to accurately predict roughly 82% of the emails into the proper categories. There is also a 56% sensitivity rate which means that 56% of the spam emails were classified correctly and a 98% specificity rate means that 77% of the ham emails were classified correctly.