It can be useful to be able to classify new “test” documents using already classified “training” documents. A common example is using a corpus of labeled spam and ham (non-spam) e-mails to predict whether or not a new document is spam.
For this project, you can start with a spam/ham dataset, then predict the class of new documents (either withheld from the training dataset or from another source such as your own spam folder). One example corpus: https://spamassassin.apache.org/old/publiccorpus/
Here are two short videos that you may find helpful. The first video shows how to unzip the provided files.
All data can be found in my github
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.3 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.3 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
## Loading required package: NLP
##
##
## Attaching package: 'NLP'
##
##
## The following object is masked from 'package:ggplot2':
##
## annotate
##
##
##
## Attaching package: 'magrittr'
##
##
## The following object is masked from 'package:purrr':
##
## set_names
##
##
## The following object is masked from 'package:tidyr':
##
## extract
##
##
##
## Attaching package: 'data.table'
##
##
## The following objects are masked from 'package:lubridate':
##
## hour, isoweek, mday, minute, month, quarter, second, wday, week,
## yday, year
##
##
## The following objects are masked from 'package:dplyr':
##
## between, first, last
##
##
## The following object is masked from 'package:purrr':
##
## transpose
##
##
## Package version: 3.3.1
## Unicode version: 14.0
## ICU version: 70.1
##
## Parallel computing: 4 of 4 threads used.
##
## See https://quanteda.io for tutorials and examples.
##
##
## Attaching package: 'quanteda'
##
##
## The following object is masked from 'package:tm':
##
## stopwords
##
##
## The following objects are masked from 'package:NLP':
##
## meta, meta<-
##
##
## Loading required package: RColorBrewer
##
## Loading required package: lattice
##
##
## Attaching package: 'caret'
##
##
## The following object is masked from 'package:purrr':
##
## lift
##
##
##
## Attaching package: 'gm'
##
##
## The following object is masked from 'package:magrittr':
##
## add
##
##
## The following object is masked from 'package:lubridate':
##
## show
##
##
## The following object is masked from 'package:methods':
##
## show
## [1] 1397
## [1] 2501
## Warning: There was 1 warning in `mutate()`.
## ℹ In argument: `text = lapply(spam_files, read_lines)`.
## Caused by warning:
## ! One or more parsing issues, call `problems()` on your data frame for details,
## e.g.:
## dat <- vroom(...)
## problems(dat)
In this section, I’ll use the rbind function to merge the contents of ‘ham’ and ‘spam’. This function is effective for combining vectors, matrices, or data frames by rows. To organize the data, I applied the select() function, concentrating on variables like class, spam, file, and text. Cleaning the ‘ham_spam’ involved using str_replace to eliminate empty spaces. The content transformers function was helpful in modifying punctuation and replacing it with spaces.
I heavily relied on the tm package, employing tm_map() to apply cleaning functions to the entire corpus. This process included manipulating data with tools like a document-term matrix or term-document matrix. These matrices outline term frequencies in a document collection, where each row represents a document, each column represents a term, and each value indicates the term’s frequency in that document. In further exploration, iutilized the removeSparseTerms function to eliminate infrequently appearing terms, setting a threshold of at least 10 documents for word inclusion.
## Warning in tm_map.SimpleCorpus(., content_transformer(tolower)): transformation
## drops documents
## Warning in tm_map.SimpleCorpus(., removeWords, stopwords("english")):
## transformation drops documents
## Warning in tm_map.SimpleCorpus(., replacePunctuation): transformation drops
## documents
## Warning in tm_map.SimpleCorpus(., removeNumbers): transformation drops
## documents
## Warning in tm_map.SimpleCorpus(., stripWhitespace): transformation drops
## documents
## <<DocumentTermMatrix (documents: 3898, terms: 6607)>>
## Non-/sparse entries: 555825/25198261
## Sparsity : 98%
## Maximal term length: 33
## Weighting : term frequency (tf)
## Sample :
## Terms
## Docs com font fork http list localhost net org received spamassassin
## 166 175 198 0 141 2 5 11 16 4 11
## 2529 149 1627 0 80 4 0 8 5 2 0
## 2552 167 41 0 83 12 0 14 0 2 0
## 2578 165 41 0 93 0 0 24 4 2 0
## 3580 19 0 0 9 5 5 25 2 8 3
## 3591 204 1102 0 516 5 6 43 4 7 1
## 3592 204 1102 0 516 2 6 43 4 7 1
## 570 15 0 15 3 7 8 0 9 6 5
## 670 17 0 17 3 7 7 0 8 7 5
## 677 21 0 19 3 8 6 0 9 7 6
## [1] 3898 6607
Initially, we transformed the previously generated Document Term Matrix into a data frame. Additionally, we introduced a new column to categorize each document/row as either spam or not spam. The spam column was converted into a factor. For the division of data into training and testing sets, we opted to randomly select 80% for training and 20% for testing. We also calculated the proportions of ham to spam counts and observed that both the testing and training data exhibited approximately 65% ham and 35% spam.
## train_labels
## ham spam
## 0.6478512 0.3521488
## test_labels
## ham spam
## 0.6166667 0.3833333
## Confusion Matrix and Statistics
##
## Actual
## Prediction ham spam
## ham 472 132
## spam 9 167
##
## Accuracy : 0.8192
## 95% CI : (0.7904, 0.8456)
## No Information Rate : 0.6167
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.5854
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.5585
## Specificity : 0.9813
## Pos Pred Value : 0.9489
## Neg Pred Value : 0.7815
## Prevalence : 0.3833
## Detection Rate : 0.2141
## Detection Prevalence : 0.2256
## Balanced Accuracy : 0.7699
##
## 'Positive' Class : spam
##
As tested using the Naive Bayes model from the e1071 model, we were able to accurately predict roughly 82% of the emails into the proper categories. There is also a 56% sensitivity rate which means that 56% of the spam emails were classified correctly and a 98% specificity rate means that 77% of the ham emails were classified correctly.