Author: Romerl Elizes

I. Preface

For this project, I used the 20030228_easy_ham.tar.bz2 and 20030228_spam.tar.bz2 files. After each zipped files where unzipped, the easy_ham directory contained 501 files and the spam directory contained 2501 files.

A. Summary

This project was difficult to do, but nevertheless, I pulled through with meeting the basic requirements of the project. In this project, I did the following tasks:

Downloaded all the files from the spam and easy_ham files dated 2/28/2003. I placed all the ham files in an easy_ham directory and all the spam files in a spam directory.
Download some test files: 3 ham text files and 3 spam text files from my work email application. For ease of testing, I eliminated most header information.
For the first ham data frame, I opened up all ham files, data cleaned them, and produced an intermediate data frame that contains all bi-grams found in all of the ham files.
For the first spam data frame, I opened up all spam files, data cleaned them, and produced an intermediate data frame that contains all bi-grams found in all of the spam files.
Using, tidytext and dplyr functions, I sorted the intermediate ham data frame to show the top 30 bi-grams, their frequencies, and total counts. A table is displayed showing the top 30 bi-grams minus stop words.
Using, tidytext and dplyr functions, I sorted the intermediate spam data frame to show the top 30 bi-grams, their frequencies, and total counts. A table is displayed showing the top 30 bi-grams minus stop words.
I went a little further, and used the dplyr anti_join function to extract the top 30 bi-grams in the intermediate ham data frame that does not exist in the intermediate spam data frame. A table is displayed showing the top 30 bi-grams that exist in the ham data frame but not in the the spam data frame.
I went a little further, and used the dplyr anti_join function to extract the top 30 bi-grams in the intermediate spam data frame that does not exist in the intermediate ham data frame. A table is displayed showing the top 30 bi-grams that exist in the spam data frame but not in the the ham data frame.
I opened up each test file and counted the number of ham and spam bi-grams that exists form both filtered data frames. I stored them in a final data frame and displayed a table showing the final results.
The final results will be discussed at the end of the project.

You will notice that I did an anti_join. Although not asked for, I thought it would be useful for the analysis portion of the project.

B. Caveats and Limitations

Caveats to this Project:

I did not use tm and its functions for this Project. Instead, I focused on using the tidytext functions specified in the book, “Text Mining with R” by J. Silge [SIL].
I only tested 6 files from my personal work email account. I could have tested for more, but I do not have the facility to export large amount of files for this project.
I only did a count of the top 30 texts of each of the data frames containing the frequency of the ham and spam value “Bag of Word” terms.
For this project, I used bi-grams, word pairs, to help me determine with the project requirements. I felt they would be more useful than finding all instances of single words.

II. Work to Create Data Frames for Ham and Spam Evaluations

A. Load Appropriate Libraries

library(stringr)
library(XML)
library(RCurl)
library(rvest)
library(tidytext)
library(tidyverse)
library(tm)
library(knitr)
library(kableExtra)

B. Declare appropriate global variables

hamfolder <- "c:/per/school/SPS/DATA607/Assignments/Project4/easy_ham"
spamfolder <- "c:/per/school/SPS/DATA607/Assignments/Project4/spam"
testfolder <- "c:/per/school/SPS/DATA607/Assignments/Project4/test"
stop_words <- c('of the', 'in the', 'if you', 'to be', 'to the', 'in the', 'on the', 'for the', 'and the', 'is a', 'with the', 'you can', '20 20', 'nbsp nbsp') # small sample of stop word lists

options(knitr.table.format = "html")

C. List all the files of the ham and spam folders

This is just a test to make sure that I can list both ham and spam files.

length(list.files(hamfolder))

## [1] 2500

list.files(hamfolder)[1:10]

##  [1] "00001.7c53336b37003a9286aba55d2945844c"
##  [2] "00002.9c4069e25e1ef370c078db7ee85ff9ac"
##  [3] "00003.860e3c3cee1b42ead714c5c874fe25f7"
##  [4] "00004.864220c5b6930b209cc287c361c99af1"
##  [5] "00005.bf27cdeaf0b8c4647ecd61b1d09da613"
##  [6] "00006.253ea2f9a9cc36fa0b1129b04b806608"
##  [7] "00007.37a8af848caae585af4fe35779656d55"
##  [8] "00008.5891548d921601906337dcf1ed8543cb"
##  [9] "00009.371eca25b0169ce5cb4f71d3e07b9e2d"
## [10] "00010.145d22c053c1a0c410242e46c01635b3"

length(list.files(spamfolder))

## [1] 501

list.files(spamfolder)[1:10]

##  [1] "00001.7848dde101aa985090474a91ec93fcf0"
##  [2] "00002.d94f1b97e48ed3b553b3508d116e6a09"
##  [3] "00003.2ee33bc6eacdb11f38d052c44819ba6c"
##  [4] "00004.eac8de8d759b7e74154f142194282724"
##  [5] "00005.57696a39d7d84318ce497886896bf90d"
##  [6] "00006.5ab5620d3d7c6c0db76234556a16f6c1"
##  [7] "00007.d8521faf753ff9ee989122f6816f87d7"
##  [8] "00008.dfd941deb10f5eed78b1594b131c9266"
##  [9] "00009.027bf6e0b0c4ab34db3ce0ea4bf2edab"
## [10] "00010.445affef4c70feec58f9198cfbc22997"

D. Load all ham files and put bi-gram contents in an intermediate data frame.

If you look at the source code, I did a for-loop for opening up each ham file. I read the contents of the ham file. Each line of the file is stored in the vector. I noticed on my experimentation, the heading and the body of each file was separated by an empty line. I checked for that initial empty line and used its index as the start point for the reading the file contents. I checked each line, made it lower-cased, removed, some email strings and stored all file contents into one string. For ease of use I removed all strings that begin with < and end with > in the event the body contained XML or HTML tags. It is not a clean process, but with the limited time I had for this project, it is the best in these circumstances. For the intermediate data frame, I used unnest_tokens function to get all bi-grams in all the ham files.

hamvector <- c()
hidx <- 1

for (i in 1:length(list.files(hamfolder))) {
  filename <- list.files(hamfolder)[i]
  filename <- paste(hamfolder,filename, sep="/")
    
  # Put file contents into vector
  tmp <- readLines(filename)
  
  # Pattern matching - get rid of heading information from file
  minidx <- min(which(tmp == ''))
  maxidx <- length(tmp)
  tmp <- tmp[minidx:maxidx]

  tmpcand <- c()
  for (j in 1:length(tmp))  {
    strvalue <- tolower(tmp[j])
    
    if (grepl("date:",strvalue)==TRUE)
      strvalue = ""
    if (grepl("from:",strvalue)==TRUE)
      strvalue = ""
    if (grepl("message-id:",strvalue)==TRUE)
      strvalue = ""
    tmpcand[j] <- strvalue
  }
  tmp <- tmpcand

  # make into one string and trim the string
  tmp <- str_c(tmp, collapse = " ")
  tmp <- str_trim(tmp)
  
  

  # get rid of HTML or XML tags, if any. ref: [REM]
  tmp <- gsub("<.*?>"," ",tmp)

  hamvector[hidx] <- tmp
  hidx <- hidx + 1
}

hamdf <- data.frame(filename = list.files(hamfolder), text = hamvector)

ham_ngrams <- hamdf %>%
  unnest_tokens(bigram,text, token="ngrams", n=2)

E. Sort Ham data frame to show count of bi-grams and frequency

I sorted the intermediate ham data frame by descending count of bi-grams. Next, I filtered the data frame by excluding stop words and added a calculated column for frequency. Next, I only used the top 30 bi-grams by count and placed them in another data frame and showed the results via a table.

sortedham_ngrams <- ham_ngrams %>%
  count(bigram, sort=TRUE)
filteredham_ngrams <- sortedham_ngrams %>%
  filter(!bigram %in% stop_words) %>% 
  mutate(freq = n / sum(n)) 

filteredham_ngrams <- head(filteredham_ngrams, n=30)
  
filteredham_ngrams %>% 
  kable () %>% 
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>%
  scroll_box(width="100%",height="300px")

bigram	n	freq
mailing list	787	0.0015007
mailman listinfo	651	0.0012414
url http	628	0.0011975
rpm list	538	0.0010259
that the	498	0.0009496
in a	493	0.0009401
this is	480	0.0009153
it is	473	0.0009020
at the	456	0.0008695
spamassassin talk	425	0.0008104
from the	403	0.0007685
of a	390	0.0007437
i don’t	386	0.0007361
is the	384	0.0007322
i have	381	0.0007265
http www.newsisfree.com	352	0.0006712
the same	349	0.0006655
with a	329	0.0006274
it was	325	0.0006197
i think	322	0.0006140
razor users	319	0.0006083
is not	318	0.0006064
as a	312	0.0005950
to do	307	0.0005854
have a	305	0.0005816
would be	301	0.0005740
to get	299	0.0005702
for a	298	0.0005683
www.newsisfree.com click	298	0.0005683
https lists.sourceforge.net	297	0.0005663

F. Load all spam files and put bi-gram contents in an intermediate data frame.

If you look at the source code, I did a for-loop for opening up each spam file. I read the contents of the spam file. Each line of the file is stored in the vector. I noticed on my experimentation, the heading and the body of each file was separated by an empty line. I checked for that initial empty line and used its index as the start point for the reading the file contents. I checked each line, made it lower-cased, removed, some email strings and stored all file contents into one string. For ease of use I removed all strings that begin with < and end with > in the event the body contained XML or HTML tags. It is not a clean process, but with the limited time I had for this project, it is the best in these circumstances. For the intermediate data frame, I used unnest_tokens function to get all bi-grams in all the spam files.

spamvector <- c()
hidx <- 1

for (i in 1:length(list.files(spamfolder))) {
  filename <- list.files(spamfolder)[i]
  filename <- paste(spamfolder,filename, sep="/")
    
  # Put file contents into vector
  tmp <- readLines(filename)
  
  # Pattern matching - get rid of heading information from file
  minidx <- min(which(tmp == ''))
  maxidx <- length(tmp)
  tmp <- tmp[minidx:maxidx]

  tmpcand <- c()
  for (j in 1:length(tmp))  {
    strvalue <- tolower(tmp[j])
    
    if (grepl("date:",strvalue)==TRUE)
      strvalue = ""
    if (grepl("from:",strvalue)==TRUE)
      strvalue = ""
    if (grepl("message-id:",strvalue)==TRUE)
      strvalue = ""
    tmpcand[j] <- strvalue
  }
  tmp <- tmpcand

  # make into one string and trim the string
  tmp <- str_c(tmp, collapse = " ")
  tmp <- str_trim(tmp)
  
  

  # get rid of HTML or XML tags, if any. ref: [REM]
  tmp <- gsub("<.*?>"," ",tmp)

  spamvector[hidx] <- tmp
  hidx <- hidx + 1
}

spamdf <- data.frame(filename = list.files(spamfolder), text = spamvector)

spam_ngrams <- spamdf %>%
  unnest_tokens(bigram,text, token="ngrams", n=2)

G. Sort Spam data frame to show count of bi-grams and frequency

I sorted the intermediate spam data frame by descending count of bi-grams. Next, I filtered the data frame by excluding stop words and added a calculated column for frequency. Next, I only used the top 30 bi-grams by count and placed them in another data frame and showed the results via a table.

sortedspam_ngrams <- spam_ngrams %>%
  count(bigram, sort=TRUE)
filteredspam_ngrams <- sortedspam_ngrams %>%
  filter(!bigram %in% stop_words) %>% 
  mutate(freq = n / sum(n)) 
  
filteredspam_ngrams <- head(filteredspam_ngrams, n=30)
filteredspam_ngrams %>% 
  kable () %>% 
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>%
  scroll_box(width="100%",height="300px")

bigram	n	freq
e mail	360	0.0017939
this is	294	0.0014650
click here	287	0.0014301
you have	256	0.0012756
do not	245	0.0012208
you are	241	0.0012009
to receive	214	0.0010664
of this	192	0.0009567
wish to	187	0.0009318
will be	185	0.0009219
for a	176	0.0008770
you will	169	0.0008421
content type	166	0.0008272
to this	162	0.0008072
be removed	160	0.0007973
content transfer	153	0.0007624
i am	153	0.0007624
transfer encoding	153	0.0007624
this message	150	0.0007474
type text	148	0.0007375
you to	148	0.0007375
of our	146	0.0007275
for your	140	0.0006976
one of	140	0.0006976
this email	140	0.0006976
3d 3d	138	0.0006877
how to	137	0.0006827
that you	134	0.0006677
it is	133	0.0006627
removed from	129	0.0006428

H. Find out what words do not match in either data frames

This as an extra step. Instead of relying on just stop words, would it be possible to display all bi-grams that exist in the ham dataset but not in the spam dataset? Conversely, would it be possible to display all bi-grams that exist in the spam dataset but not in the ham dataset? The best method that I saw based on my research was the dplyr anti_join method. Why am I doing this? Perhaps my initial classification was faulty. This additional step would help me programmatically derive at ham and spam bi-gram count data frames that would give me a clearer prediction model for my test files.

filteredham_ngrams2 <- anti_join(filteredham_ngrams,filteredspam_ngrams, by="bigram")
filteredspam_ngrams2 <- anti_join(filteredspam_ngrams, filteredham_ngrams,by="bigram")

filteredham_ngrams2 <- head(filteredham_ngrams2, n=30)
filteredham_ngrams2 %>% 
  kable () %>% 
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>%
  scroll_box(width="100%",height="300px")

bigram	n	freq
mailing list	787	0.0015007
mailman listinfo	651	0.0012414
url http	628	0.0011975
rpm list	538	0.0010259
that the	498	0.0009496
in a	493	0.0009401
at the	456	0.0008695
spamassassin talk	425	0.0008104
from the	403	0.0007685
of a	390	0.0007437
i don’t	386	0.0007361
is the	384	0.0007322
i have	381	0.0007265
http www.newsisfree.com	352	0.0006712
the same	349	0.0006655
with a	329	0.0006274
it was	325	0.0006197
i think	322	0.0006140
razor users	319	0.0006083
is not	318	0.0006064
as a	312	0.0005950
to do	307	0.0005854
have a	305	0.0005816
would be	301	0.0005740
to get	299	0.0005702
www.newsisfree.com click	298	0.0005683
https lists.sourceforge.net	297	0.0005663

filteredspam_ngrams2 <- head(filteredspam_ngrams2, n=30) 
filteredspam_ngrams2 %>% 
  kable () %>% 
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>%
  scroll_box(width="100%",height="300px")

bigram	n	freq
e mail	360	0.0017939
click here	287	0.0014301
you have	256	0.0012756
do not	245	0.0012208
you are	241	0.0012009
to receive	214	0.0010664
of this	192	0.0009567
wish to	187	0.0009318
will be	185	0.0009219
you will	169	0.0008421
content type	166	0.0008272
to this	162	0.0008072
be removed	160	0.0007973
content transfer	153	0.0007624
i am	153	0.0007624
transfer encoding	153	0.0007624
this message	150	0.0007474
type text	148	0.0007375
you to	148	0.0007375
of our	146	0.0007275
for your	140	0.0006976
one of	140	0.0006976
this email	140	0.0006976
3d 3d	138	0.0006877
how to	137	0.0006827
that you	134	0.0006677
removed from	129	0.0006428

III. Test Some Files

Per the Summary, I only tested six emails: 3 ham and 3 spam emails derived from my work email account.

A. Create Vectors for all Bigram Data Frames

I extractd the bi-gram strings from my initial ham and spam count tables and extracted the bi-gram strings derived from my anti_join calculations.

ham_bigram_strings <- filteredham_ngrams$bigram
ham_bigram_strings2 <- filteredham_ngrams2$bigram
spam_bigram_strings <- filteredspam_ngrams$bigram
spam_bigram_strings2 <- filteredspam_ngrams2$bigram

B. Load all test files and summarize counts for all occurrences of ham and spam bi-gram values

My intention was to create a data frame that contained a list of all test files, whether spam or ham files from my work directory and list the count values for all ham bi-gram values from the original ham result, ham anti_join result, original spam result, and spam anti_join result. Using nested for-loops I iterated over each file and searched for the bi-gram values from each data frame. I stored them in a final table for display.

dfTestFiles <- data.frame(filename=character(),numhamval=integer(),numhamval2=integer(),numspamval=integer(),numspamval2=integer())
tidx <- 1

for (i in 1:length(list.files(testfolder))) {
  numhamval <- 0
  numhamval2 <- 0
  numspamval <- 0
  numspamval2 <- 0

  filename <- list.files(testfolder)[i]
  filenamepath <- paste(testfolder,filename, sep="/")
  
  # Put file contents into vector
  tmp <- readLines(filenamepath)
  
  # Pattern matching - get rid of heading information from file
  minidx <- min(which(tmp == ''))
  maxidx <- length(tmp)
  tmp <- tmp[minidx:maxidx]

  tmpcand <- c()
  for (j in 1:length(tmp))  {
    strvalue <- tolower(tmp[j])
    
    if (grepl("date:",strvalue)==TRUE)
      strvalue = ""
    if (grepl("from:",strvalue)==TRUE)
      strvalue = ""
    if (grepl("message-id:",strvalue)==TRUE)
      strvalue = ""
    tmpcand[j] <- strvalue
    
    for (m in 1:length(ham_bigram_strings)) {
      srchvalue <- ham_bigram_strings[m]
      if (grepl(srchvalue,strvalue))
         numhamval <- numhamval + 1   
    }
    for (m in 1:length(ham_bigram_strings2)) {
      srchvalue <- ham_bigram_strings2[m]
      if (grepl(srchvalue,strvalue))
         numhamval2 <- numhamval2 + 1   
    }
    for (m in 1:length(spam_bigram_strings)) {
      srchvalue <- spam_bigram_strings[m]
      if (grepl(srchvalue,strvalue))
         numspamval <- numspamval + 1   
    }
    for (m in 1:length(spam_bigram_strings2)) {
      srchvalue <- spam_bigram_strings2[m]
      if (grepl(srchvalue,strvalue))
         numspamval2 <- numspamval2 + 1   
    }
  }
  dftemp <- data.frame(filename,numhamval,numhamval2,numspamval,numspamval2)
  dfTestFiles <- rbind(dfTestFiles,dftemp)
}
dfTestFiles %>% 
  kable () %>% 
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>%
  scroll_box(width="100%",height="300px")

filename	numhamval	numhamval2	numspamval	numspamval2
hamfile1.txt	5	5	1	1
hamfile2.txt	2	0	3	1
hamfile3.txt	1	1	0	0
spammail1.txt	5	5	7	7
spammail2.txt	7	5	7	5
spammail3.txt	0	0	2	2

C. Analysis

The biggest limitation of the analysis is that I tested limited data to verify if they met the classification requirements of the resultant ham and spam data frames. From previous research experiments I conducted, the more test datasets I had, the results tended to align or not align with my findings.

Nevertheless, some interesting observations can be gleaned from the testing of the 6 files:

hamfile1.txt - According to the comparison between the ham and spam data frames, the number of ham bi-gram phrases found is 10 to 2 (5 to 1) by adding both original and anti-join ham data frame counts against both original and anti-join spam data frame counts. Based on this observation, the ham data frames correctly predicted the identity of this ham email file.
hamfile2.txt - According to the comparison between the ham and spam data frames, the number of ham bi-gram phrases found is 2 to 3 by adding both original and anti-join ham data frame counts against both original and anti-join spam data frame counts. Based on this observation, the ham data frames did not predict the identity of this ham email file.
hamfile3.txt - According to the comparison between the ham and spam data frames, the number of ham bi-gram phrases found is 2 to 0 by adding both original and anti-join ham data frame counts against both original and anti-join spam data frame counts. Based on this observation, the ham data frames correctly predicted the identity of this ham email file.
spamfile1.txt - According to the comparison between the ham and spam data frames, the number of ham bi-gram phrases found is 10 to 14 by adding both original and anti-join ham data frame counts against both original and anti-join spam data frame counts. Based on this observation, the spam data frames correctly predicted the identity of this spam email file.
spamfile2.txt - According to the comparison between the ham and spam data frames, the number of ham bi-gram phrases found is 12 to 12 (1 to 1) by adding both original and anti-join ham data frame counts against both original and anti-join spam data frame counts. Based on this observation, the spam data frames did NOT correctly predict the identity of this spam email file.
spamfile3.txt - According to the comparison between the ham and spam data frames, the number of ham bi-gram phrases found is 0 to 4 by adding both original and anti-join ham data frame counts against both original and anti-join spam data frame counts. Based on this observation, the spam data frames correctly predicted the identity of this spam email file.

It would be useful in a future iteration of this investigation to test hundreds of test files to see if the ham and spam classifications were correct. At the conclusion of this project, it has been shown that the ham data frame results correctly predicted the identity of a ham file 67% of the time and that the spam data frame results correctly predicted the identity of a spam file 67% of the time.

IV. References

[LIS] List of word frequencies using R. Retrieved from website: https://stackoverflow.com/questions/18101047/list-of-word-frequencies-using-r

[REM] Remove long complex html tags from strings in R. Retrieved from website: https://stackoverflow.com/questions/43572086/remove-long-complex-html-tags-from-strings-in-r

[SIL] Silge, J. and Robinsion D. Text Mining with R. Retrieved from website: https://www.tidytextmining.com/

DATA607 - Project 4