Author: Romerl Elizes

I. Preface

For this project, I used the 20030228_easy_ham.tar.bz2 and 20030228_spam.tar.bz2 files. After each zipped files where unzipped, the easy_ham directory contained 501 files and the spam directory contained 2501 files.

A. Summary

This project was difficult to do, but nevertheless, I pulled through with meeting the basic requirements of the project. In this project, I did the following tasks:

  • Downloaded all the files from the spam and easy_ham files dated 2/28/2003. I placed all the ham files in an easy_ham directory and all the spam files in a spam directory.

  • Download some test files: 3 ham text files and 3 spam text files from my work email application. For ease of testing, I eliminated most header information.

  • For the first ham data frame, I opened up all ham files, data cleaned them, and produced an intermediate data frame that contains all bi-grams found in all of the ham files.

  • For the first spam data frame, I opened up all spam files, data cleaned them, and produced an intermediate data frame that contains all bi-grams found in all of the spam files.

  • Using, tidytext and dplyr functions, I sorted the intermediate ham data frame to show the top 30 bi-grams, their frequencies, and total counts. A table is displayed showing the top 30 bi-grams minus stop words.

  • Using, tidytext and dplyr functions, I sorted the intermediate spam data frame to show the top 30 bi-grams, their frequencies, and total counts. A table is displayed showing the top 30 bi-grams minus stop words.

  • I went a little further, and used the dplyr anti_join function to extract the top 30 bi-grams in the intermediate ham data frame that does not exist in the intermediate spam data frame. A table is displayed showing the top 30 bi-grams that exist in the ham data frame but not in the the spam data frame.

  • I went a little further, and used the dplyr anti_join function to extract the top 30 bi-grams in the intermediate spam data frame that does not exist in the intermediate ham data frame. A table is displayed showing the top 30 bi-grams that exist in the spam data frame but not in the the ham data frame.

  • I opened up each test file and counted the number of ham and spam bi-grams that exists form both filtered data frames. I stored them in a final data frame and displayed a table showing the final results.

  • The final results will be discussed at the end of the project.

You will notice that I did an anti_join. Although not asked for, I thought it would be useful for the analysis portion of the project.

B. Caveats and Limitations

Caveats to this Project:

  • I did not use tm and its functions for this Project. Instead, I focused on using the tidytext functions specified in the book, “Text Mining with R” by J. Silge [SIL].
  • I only tested 6 files from my personal work email account. I could have tested for more, but I do not have the facility to export large amount of files for this project.
  • I only did a count of the top 30 texts of each of the data frames containing the frequency of the ham and spam value “Bag of Word” terms.
  • For this project, I used bi-grams, word pairs, to help me determine with the project requirements. I felt they would be more useful than finding all instances of single words.

II. Work to Create Data Frames for Ham and Spam Evaluations

C. List all the files of the ham and spam folders

This is just a test to make sure that I can list both ham and spam files.

## [1] 2500
##  [1] "00001.7c53336b37003a9286aba55d2945844c"
##  [2] "00002.9c4069e25e1ef370c078db7ee85ff9ac"
##  [3] "00003.860e3c3cee1b42ead714c5c874fe25f7"
##  [4] "00004.864220c5b6930b209cc287c361c99af1"
##  [5] "00005.bf27cdeaf0b8c4647ecd61b1d09da613"
##  [6] "00006.253ea2f9a9cc36fa0b1129b04b806608"
##  [7] "00007.37a8af848caae585af4fe35779656d55"
##  [8] "00008.5891548d921601906337dcf1ed8543cb"
##  [9] "00009.371eca25b0169ce5cb4f71d3e07b9e2d"
## [10] "00010.145d22c053c1a0c410242e46c01635b3"
## [1] 501
##  [1] "00001.7848dde101aa985090474a91ec93fcf0"
##  [2] "00002.d94f1b97e48ed3b553b3508d116e6a09"
##  [3] "00003.2ee33bc6eacdb11f38d052c44819ba6c"
##  [4] "00004.eac8de8d759b7e74154f142194282724"
##  [5] "00005.57696a39d7d84318ce497886896bf90d"
##  [6] "00006.5ab5620d3d7c6c0db76234556a16f6c1"
##  [7] "00007.d8521faf753ff9ee989122f6816f87d7"
##  [8] "00008.dfd941deb10f5eed78b1594b131c9266"
##  [9] "00009.027bf6e0b0c4ab34db3ce0ea4bf2edab"
## [10] "00010.445affef4c70feec58f9198cfbc22997"

D. Load all ham files and put bi-gram contents in an intermediate data frame.

If you look at the source code, I did a for-loop for opening up each ham file. I read the contents of the ham file. Each line of the file is stored in the vector. I noticed on my experimentation, the heading and the body of each file was separated by an empty line. I checked for that initial empty line and used its index as the start point for the reading the file contents. I checked each line, made it lower-cased, removed, some email strings and stored all file contents into one string. For ease of use I removed all strings that begin with < and end with > in the event the body contained XML or HTML tags. It is not a clean process, but with the limited time I had for this project, it is the best in these circumstances. For the intermediate data frame, I used unnest_tokens function to get all bi-grams in all the ham files.

E. Sort Ham data frame to show count of bi-grams and frequency

I sorted the intermediate ham data frame by descending count of bi-grams. Next, I filtered the data frame by excluding stop words and added a calculated column for frequency. Next, I only used the top 30 bi-grams by count and placed them in another data frame and showed the results via a table.

bigram n freq
mailing list 787 0.0015007
mailman listinfo 651 0.0012414
url http 628 0.0011975
rpm list 538 0.0010259
that the 498 0.0009496
in a 493 0.0009401
this is 480 0.0009153
it is 473 0.0009020
at the 456 0.0008695
spamassassin talk 425 0.0008104
from the 403 0.0007685
of a 390 0.0007437
i don’t 386 0.0007361
is the 384 0.0007322
i have 381 0.0007265
http www.newsisfree.com 352 0.0006712
the same 349 0.0006655
with a 329 0.0006274
it was 325 0.0006197
i think 322 0.0006140
razor users 319 0.0006083
is not 318 0.0006064
as a 312 0.0005950
to do 307 0.0005854
have a 305 0.0005816
would be 301 0.0005740
to get 299 0.0005702
for a 298 0.0005683
www.newsisfree.com click 298 0.0005683
https lists.sourceforge.net 297 0.0005663

F. Load all spam files and put bi-gram contents in an intermediate data frame.

If you look at the source code, I did a for-loop for opening up each spam file. I read the contents of the spam file. Each line of the file is stored in the vector. I noticed on my experimentation, the heading and the body of each file was separated by an empty line. I checked for that initial empty line and used its index as the start point for the reading the file contents. I checked each line, made it lower-cased, removed, some email strings and stored all file contents into one string. For ease of use I removed all strings that begin with < and end with > in the event the body contained XML or HTML tags. It is not a clean process, but with the limited time I had for this project, it is the best in these circumstances. For the intermediate data frame, I used unnest_tokens function to get all bi-grams in all the spam files.

G. Sort Spam data frame to show count of bi-grams and frequency

I sorted the intermediate spam data frame by descending count of bi-grams. Next, I filtered the data frame by excluding stop words and added a calculated column for frequency. Next, I only used the top 30 bi-grams by count and placed them in another data frame and showed the results via a table.

bigram n freq
e mail 360 0.0017939
this is 294 0.0014650
click here 287 0.0014301
you have 256 0.0012756
do not 245 0.0012208
you are 241 0.0012009
to receive 214 0.0010664
of this 192 0.0009567
wish to 187 0.0009318
will be 185 0.0009219
for a 176 0.0008770
you will 169 0.0008421
content type 166 0.0008272
to this 162 0.0008072
be removed 160 0.0007973
content transfer 153 0.0007624
i am 153 0.0007624
transfer encoding 153 0.0007624
this message 150 0.0007474
type text 148 0.0007375
you to 148 0.0007375
of our 146 0.0007275
for your 140 0.0006976
one of 140 0.0006976
this email 140 0.0006976
3d 3d 138 0.0006877
how to 137 0.0006827
that you 134 0.0006677
it is 133 0.0006627
removed from 129 0.0006428

H. Find out what words do not match in either data frames

This as an extra step. Instead of relying on just stop words, would it be possible to display all bi-grams that exist in the ham dataset but not in the spam dataset? Conversely, would it be possible to display all bi-grams that exist in the spam dataset but not in the ham dataset? The best method that I saw based on my research was the dplyr anti_join method. Why am I doing this? Perhaps my initial classification was faulty. This additional step would help me programmatically derive at ham and spam bi-gram count data frames that would give me a clearer prediction model for my test files.

bigram n freq
mailing list 787 0.0015007
mailman listinfo 651 0.0012414
url http 628 0.0011975
rpm list 538 0.0010259
that the 498 0.0009496
in a 493 0.0009401
at the 456 0.0008695
spamassassin talk 425 0.0008104
from the 403 0.0007685
of a 390 0.0007437
i don’t 386 0.0007361
is the 384 0.0007322
i have 381 0.0007265
http www.newsisfree.com 352 0.0006712
the same 349 0.0006655
with a 329 0.0006274
it was 325 0.0006197
i think 322 0.0006140
razor users 319 0.0006083
is not 318 0.0006064
as a 312 0.0005950
to do 307 0.0005854
have a 305 0.0005816
would be 301 0.0005740
to get 299 0.0005702
www.newsisfree.com click 298 0.0005683
https lists.sourceforge.net 297 0.0005663
bigram n freq
e mail 360 0.0017939
click here 287 0.0014301
you have 256 0.0012756
do not 245 0.0012208
you are 241 0.0012009
to receive 214 0.0010664
of this 192 0.0009567
wish to 187 0.0009318
will be 185 0.0009219
you will 169 0.0008421
content type 166 0.0008272
to this 162 0.0008072
be removed 160 0.0007973
content transfer 153 0.0007624
i am 153 0.0007624
transfer encoding 153 0.0007624
this message 150 0.0007474
type text 148 0.0007375
you to 148 0.0007375
of our 146 0.0007275
for your 140 0.0006976
one of 140 0.0006976
this email 140 0.0006976
3d 3d 138 0.0006877
how to 137 0.0006827
that you 134 0.0006677
removed from 129 0.0006428

III. Test Some Files

Per the Summary, I only tested six emails: 3 ham and 3 spam emails derived from my work email account.

A. Create Vectors for all Bigram Data Frames

I extractd the bi-gram strings from my initial ham and spam count tables and extracted the bi-gram strings derived from my anti_join calculations.

B. Load all test files and summarize counts for all occurrences of ham and spam bi-gram values

My intention was to create a data frame that contained a list of all test files, whether spam or ham files from my work directory and list the count values for all ham bi-gram values from the original ham result, ham anti_join result, original spam result, and spam anti_join result. Using nested for-loops I iterated over each file and searched for the bi-gram values from each data frame. I stored them in a final table for display.

dfTestFiles <- data.frame(filename=character(),numhamval=integer(),numhamval2=integer(),numspamval=integer(),numspamval2=integer())
tidx <- 1

for (i in 1:length(list.files(testfolder))) {
  numhamval <- 0
  numhamval2 <- 0
  numspamval <- 0
  numspamval2 <- 0

  filename <- list.files(testfolder)[i]
  filenamepath <- paste(testfolder,filename, sep="/")
  
  # Put file contents into vector
  tmp <- readLines(filenamepath)
  
  # Pattern matching - get rid of heading information from file
  minidx <- min(which(tmp == ''))
  maxidx <- length(tmp)
  tmp <- tmp[minidx:maxidx]

  tmpcand <- c()
  for (j in 1:length(tmp))  {
    strvalue <- tolower(tmp[j])
    
    if (grepl("date:",strvalue)==TRUE)
      strvalue = ""
    if (grepl("from:",strvalue)==TRUE)
      strvalue = ""
    if (grepl("message-id:",strvalue)==TRUE)
      strvalue = ""
    tmpcand[j] <- strvalue
    
    for (m in 1:length(ham_bigram_strings)) {
      srchvalue <- ham_bigram_strings[m]
      if (grepl(srchvalue,strvalue))
         numhamval <- numhamval + 1   
    }
    for (m in 1:length(ham_bigram_strings2)) {
      srchvalue <- ham_bigram_strings2[m]
      if (grepl(srchvalue,strvalue))
         numhamval2 <- numhamval2 + 1   
    }
    for (m in 1:length(spam_bigram_strings)) {
      srchvalue <- spam_bigram_strings[m]
      if (grepl(srchvalue,strvalue))
         numspamval <- numspamval + 1   
    }
    for (m in 1:length(spam_bigram_strings2)) {
      srchvalue <- spam_bigram_strings2[m]
      if (grepl(srchvalue,strvalue))
         numspamval2 <- numspamval2 + 1   
    }
  }
  dftemp <- data.frame(filename,numhamval,numhamval2,numspamval,numspamval2)
  dfTestFiles <- rbind(dfTestFiles,dftemp)
}
dfTestFiles %>% 
  kable () %>% 
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>%
  scroll_box(width="100%",height="300px")
filename numhamval numhamval2 numspamval numspamval2
hamfile1.txt 5 5 1 1
hamfile2.txt 2 0 3 1
hamfile3.txt 1 1 0 0
spammail1.txt 5 5 7 7
spammail2.txt 7 5 7 5
spammail3.txt 0 0 2 2

C. Analysis

The biggest limitation of the analysis is that I tested limited data to verify if they met the classification requirements of the resultant ham and spam data frames. From previous research experiments I conducted, the more test datasets I had, the results tended to align or not align with my findings.

Nevertheless, some interesting observations can be gleaned from the testing of the 6 files:

  • hamfile1.txt - According to the comparison between the ham and spam data frames, the number of ham bi-gram phrases found is 10 to 2 (5 to 1) by adding both original and anti-join ham data frame counts against both original and anti-join spam data frame counts. Based on this observation, the ham data frames correctly predicted the identity of this ham email file.

  • hamfile2.txt - According to the comparison between the ham and spam data frames, the number of ham bi-gram phrases found is 2 to 3 by adding both original and anti-join ham data frame counts against both original and anti-join spam data frame counts. Based on this observation, the ham data frames did not predict the identity of this ham email file.

  • hamfile3.txt - According to the comparison between the ham and spam data frames, the number of ham bi-gram phrases found is 2 to 0 by adding both original and anti-join ham data frame counts against both original and anti-join spam data frame counts. Based on this observation, the ham data frames correctly predicted the identity of this ham email file.

  • spamfile1.txt - According to the comparison between the ham and spam data frames, the number of ham bi-gram phrases found is 10 to 14 by adding both original and anti-join ham data frame counts against both original and anti-join spam data frame counts. Based on this observation, the spam data frames correctly predicted the identity of this spam email file.

  • spamfile2.txt - According to the comparison between the ham and spam data frames, the number of ham bi-gram phrases found is 12 to 12 (1 to 1) by adding both original and anti-join ham data frame counts against both original and anti-join spam data frame counts. Based on this observation, the spam data frames did NOT correctly predict the identity of this spam email file.

  • spamfile3.txt - According to the comparison between the ham and spam data frames, the number of ham bi-gram phrases found is 0 to 4 by adding both original and anti-join ham data frame counts against both original and anti-join spam data frame counts. Based on this observation, the spam data frames correctly predicted the identity of this spam email file.

It would be useful in a future iteration of this investigation to test hundreds of test files to see if the ham and spam classifications were correct. At the conclusion of this project, it has been shown that the ham data frame results correctly predicted the identity of a ham file 67% of the time and that the spam data frame results correctly predicted the identity of a spam file 67% of the time.

IV. References

[LIS] List of word frequencies using R. Retrieved from website: https://stackoverflow.com/questions/18101047/list-of-word-frequencies-using-r

[REM] Remove long complex html tags from strings in R. Retrieved from website: https://stackoverflow.com/questions/43572086/remove-long-complex-html-tags-from-strings-in-r

[SIL] Silge, J. and Robinsion D. Text Mining with R. Retrieved from website: https://www.tidytextmining.com/