For this project, I used the 20030228_easy_ham.tar.bz2 and 20030228_spam.tar.bz2 files. After each zipped files where unzipped, the easy_ham directory contained 501 files and the spam directory contained 2501 files.
This project was difficult to do, but nevertheless, I pulled through with meeting the basic requirements of the project. In this project, I did the following tasks:
Downloaded all the files from the spam and easy_ham files dated 2/28/2003. I placed all the ham files in an easy_ham directory and all the spam files in a spam directory.
Download some test files: 3 ham text files and 3 spam text files from my work email application. For ease of testing, I eliminated most header information.
For the first ham data frame, I opened up all ham files, data cleaned them, and produced an intermediate data frame that contains all bi-grams found in all of the ham files.
For the first spam data frame, I opened up all spam files, data cleaned them, and produced an intermediate data frame that contains all bi-grams found in all of the spam files.
Using, tidytext and dplyr functions, I sorted the intermediate ham data frame to show the top 30 bi-grams, their frequencies, and total counts. A table is displayed showing the top 30 bi-grams minus stop words.
Using, tidytext and dplyr functions, I sorted the intermediate spam data frame to show the top 30 bi-grams, their frequencies, and total counts. A table is displayed showing the top 30 bi-grams minus stop words.
I went a little further, and used the dplyr anti_join function to extract the top 30 bi-grams in the intermediate ham data frame that does not exist in the intermediate spam data frame. A table is displayed showing the top 30 bi-grams that exist in the ham data frame but not in the the spam data frame.
I went a little further, and used the dplyr anti_join function to extract the top 30 bi-grams in the intermediate spam data frame that does not exist in the intermediate ham data frame. A table is displayed showing the top 30 bi-grams that exist in the spam data frame but not in the the ham data frame.
I opened up each test file and counted the number of ham and spam bi-grams that exists form both filtered data frames. I stored them in a final data frame and displayed a table showing the final results.
The final results will be discussed at the end of the project.
You will notice that I did an anti_join. Although not asked for, I thought it would be useful for the analysis portion of the project.
Caveats to this Project:
hamfolder <- "c:/per/school/SPS/DATA607/Assignments/Project4/easy_ham"
spamfolder <- "c:/per/school/SPS/DATA607/Assignments/Project4/spam"
testfolder <- "c:/per/school/SPS/DATA607/Assignments/Project4/test"
stop_words <- c('of the', 'in the', 'if you', 'to be', 'to the', 'in the', 'on the', 'for the', 'and the', 'is a', 'with the', 'you can', '20 20', 'nbsp nbsp') # small sample of stop word lists
options(knitr.table.format = "html")This is just a test to make sure that I can list both ham and spam files.
## [1] 2500
## [1] "00001.7c53336b37003a9286aba55d2945844c"
## [2] "00002.9c4069e25e1ef370c078db7ee85ff9ac"
## [3] "00003.860e3c3cee1b42ead714c5c874fe25f7"
## [4] "00004.864220c5b6930b209cc287c361c99af1"
## [5] "00005.bf27cdeaf0b8c4647ecd61b1d09da613"
## [6] "00006.253ea2f9a9cc36fa0b1129b04b806608"
## [7] "00007.37a8af848caae585af4fe35779656d55"
## [8] "00008.5891548d921601906337dcf1ed8543cb"
## [9] "00009.371eca25b0169ce5cb4f71d3e07b9e2d"
## [10] "00010.145d22c053c1a0c410242e46c01635b3"
## [1] 501
## [1] "00001.7848dde101aa985090474a91ec93fcf0"
## [2] "00002.d94f1b97e48ed3b553b3508d116e6a09"
## [3] "00003.2ee33bc6eacdb11f38d052c44819ba6c"
## [4] "00004.eac8de8d759b7e74154f142194282724"
## [5] "00005.57696a39d7d84318ce497886896bf90d"
## [6] "00006.5ab5620d3d7c6c0db76234556a16f6c1"
## [7] "00007.d8521faf753ff9ee989122f6816f87d7"
## [8] "00008.dfd941deb10f5eed78b1594b131c9266"
## [9] "00009.027bf6e0b0c4ab34db3ce0ea4bf2edab"
## [10] "00010.445affef4c70feec58f9198cfbc22997"
If you look at the source code, I did a for-loop for opening up each ham file. I read the contents of the ham file. Each line of the file is stored in the vector. I noticed on my experimentation, the heading and the body of each file was separated by an empty line. I checked for that initial empty line and used its index as the start point for the reading the file contents. I checked each line, made it lower-cased, removed, some email strings and stored all file contents into one string. For ease of use I removed all strings that begin with < and end with > in the event the body contained XML or HTML tags. It is not a clean process, but with the limited time I had for this project, it is the best in these circumstances. For the intermediate data frame, I used unnest_tokens function to get all bi-grams in all the ham files.
hamvector <- c()
hidx <- 1
for (i in 1:length(list.files(hamfolder))) {
filename <- list.files(hamfolder)[i]
filename <- paste(hamfolder,filename, sep="/")
# Put file contents into vector
tmp <- readLines(filename)
# Pattern matching - get rid of heading information from file
minidx <- min(which(tmp == ''))
maxidx <- length(tmp)
tmp <- tmp[minidx:maxidx]
tmpcand <- c()
for (j in 1:length(tmp)) {
strvalue <- tolower(tmp[j])
if (grepl("date:",strvalue)==TRUE)
strvalue = ""
if (grepl("from:",strvalue)==TRUE)
strvalue = ""
if (grepl("message-id:",strvalue)==TRUE)
strvalue = ""
tmpcand[j] <- strvalue
}
tmp <- tmpcand
# make into one string and trim the string
tmp <- str_c(tmp, collapse = " ")
tmp <- str_trim(tmp)
# get rid of HTML or XML tags, if any. ref: [REM]
tmp <- gsub("<.*?>"," ",tmp)
hamvector[hidx] <- tmp
hidx <- hidx + 1
}
hamdf <- data.frame(filename = list.files(hamfolder), text = hamvector)
ham_ngrams <- hamdf %>%
unnest_tokens(bigram,text, token="ngrams", n=2)I sorted the intermediate ham data frame by descending count of bi-grams. Next, I filtered the data frame by excluding stop words and added a calculated column for frequency. Next, I only used the top 30 bi-grams by count and placed them in another data frame and showed the results via a table.
sortedham_ngrams <- ham_ngrams %>%
count(bigram, sort=TRUE)
filteredham_ngrams <- sortedham_ngrams %>%
filter(!bigram %in% stop_words) %>%
mutate(freq = n / sum(n))
filteredham_ngrams <- head(filteredham_ngrams, n=30)
filteredham_ngrams %>%
kable () %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>%
scroll_box(width="100%",height="300px")| bigram | n | freq |
|---|---|---|
| mailing list | 787 | 0.0015007 |
| mailman listinfo | 651 | 0.0012414 |
| url http | 628 | 0.0011975 |
| rpm list | 538 | 0.0010259 |
| that the | 498 | 0.0009496 |
| in a | 493 | 0.0009401 |
| this is | 480 | 0.0009153 |
| it is | 473 | 0.0009020 |
| at the | 456 | 0.0008695 |
| spamassassin talk | 425 | 0.0008104 |
| from the | 403 | 0.0007685 |
| of a | 390 | 0.0007437 |
| i don’t | 386 | 0.0007361 |
| is the | 384 | 0.0007322 |
| i have | 381 | 0.0007265 |
| http www.newsisfree.com | 352 | 0.0006712 |
| the same | 349 | 0.0006655 |
| with a | 329 | 0.0006274 |
| it was | 325 | 0.0006197 |
| i think | 322 | 0.0006140 |
| razor users | 319 | 0.0006083 |
| is not | 318 | 0.0006064 |
| as a | 312 | 0.0005950 |
| to do | 307 | 0.0005854 |
| have a | 305 | 0.0005816 |
| would be | 301 | 0.0005740 |
| to get | 299 | 0.0005702 |
| for a | 298 | 0.0005683 |
| www.newsisfree.com click | 298 | 0.0005683 |
| https lists.sourceforge.net | 297 | 0.0005663 |
If you look at the source code, I did a for-loop for opening up each spam file. I read the contents of the spam file. Each line of the file is stored in the vector. I noticed on my experimentation, the heading and the body of each file was separated by an empty line. I checked for that initial empty line and used its index as the start point for the reading the file contents. I checked each line, made it lower-cased, removed, some email strings and stored all file contents into one string. For ease of use I removed all strings that begin with < and end with > in the event the body contained XML or HTML tags. It is not a clean process, but with the limited time I had for this project, it is the best in these circumstances. For the intermediate data frame, I used unnest_tokens function to get all bi-grams in all the spam files.
spamvector <- c()
hidx <- 1
for (i in 1:length(list.files(spamfolder))) {
filename <- list.files(spamfolder)[i]
filename <- paste(spamfolder,filename, sep="/")
# Put file contents into vector
tmp <- readLines(filename)
# Pattern matching - get rid of heading information from file
minidx <- min(which(tmp == ''))
maxidx <- length(tmp)
tmp <- tmp[minidx:maxidx]
tmpcand <- c()
for (j in 1:length(tmp)) {
strvalue <- tolower(tmp[j])
if (grepl("date:",strvalue)==TRUE)
strvalue = ""
if (grepl("from:",strvalue)==TRUE)
strvalue = ""
if (grepl("message-id:",strvalue)==TRUE)
strvalue = ""
tmpcand[j] <- strvalue
}
tmp <- tmpcand
# make into one string and trim the string
tmp <- str_c(tmp, collapse = " ")
tmp <- str_trim(tmp)
# get rid of HTML or XML tags, if any. ref: [REM]
tmp <- gsub("<.*?>"," ",tmp)
spamvector[hidx] <- tmp
hidx <- hidx + 1
}
spamdf <- data.frame(filename = list.files(spamfolder), text = spamvector)
spam_ngrams <- spamdf %>%
unnest_tokens(bigram,text, token="ngrams", n=2)I sorted the intermediate spam data frame by descending count of bi-grams. Next, I filtered the data frame by excluding stop words and added a calculated column for frequency. Next, I only used the top 30 bi-grams by count and placed them in another data frame and showed the results via a table.
sortedspam_ngrams <- spam_ngrams %>%
count(bigram, sort=TRUE)
filteredspam_ngrams <- sortedspam_ngrams %>%
filter(!bigram %in% stop_words) %>%
mutate(freq = n / sum(n))
filteredspam_ngrams <- head(filteredspam_ngrams, n=30)
filteredspam_ngrams %>%
kable () %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>%
scroll_box(width="100%",height="300px")| bigram | n | freq |
|---|---|---|
| e mail | 360 | 0.0017939 |
| this is | 294 | 0.0014650 |
| click here | 287 | 0.0014301 |
| you have | 256 | 0.0012756 |
| do not | 245 | 0.0012208 |
| you are | 241 | 0.0012009 |
| to receive | 214 | 0.0010664 |
| of this | 192 | 0.0009567 |
| wish to | 187 | 0.0009318 |
| will be | 185 | 0.0009219 |
| for a | 176 | 0.0008770 |
| you will | 169 | 0.0008421 |
| content type | 166 | 0.0008272 |
| to this | 162 | 0.0008072 |
| be removed | 160 | 0.0007973 |
| content transfer | 153 | 0.0007624 |
| i am | 153 | 0.0007624 |
| transfer encoding | 153 | 0.0007624 |
| this message | 150 | 0.0007474 |
| type text | 148 | 0.0007375 |
| you to | 148 | 0.0007375 |
| of our | 146 | 0.0007275 |
| for your | 140 | 0.0006976 |
| one of | 140 | 0.0006976 |
| this email | 140 | 0.0006976 |
| 3d 3d | 138 | 0.0006877 |
| how to | 137 | 0.0006827 |
| that you | 134 | 0.0006677 |
| it is | 133 | 0.0006627 |
| removed from | 129 | 0.0006428 |
This as an extra step. Instead of relying on just stop words, would it be possible to display all bi-grams that exist in the ham dataset but not in the spam dataset? Conversely, would it be possible to display all bi-grams that exist in the spam dataset but not in the ham dataset? The best method that I saw based on my research was the dplyr anti_join method. Why am I doing this? Perhaps my initial classification was faulty. This additional step would help me programmatically derive at ham and spam bi-gram count data frames that would give me a clearer prediction model for my test files.
filteredham_ngrams2 <- anti_join(filteredham_ngrams,filteredspam_ngrams, by="bigram")
filteredspam_ngrams2 <- anti_join(filteredspam_ngrams, filteredham_ngrams,by="bigram")
filteredham_ngrams2 <- head(filteredham_ngrams2, n=30)
filteredham_ngrams2 %>%
kable () %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>%
scroll_box(width="100%",height="300px")| bigram | n | freq |
|---|---|---|
| mailing list | 787 | 0.0015007 |
| mailman listinfo | 651 | 0.0012414 |
| url http | 628 | 0.0011975 |
| rpm list | 538 | 0.0010259 |
| that the | 498 | 0.0009496 |
| in a | 493 | 0.0009401 |
| at the | 456 | 0.0008695 |
| spamassassin talk | 425 | 0.0008104 |
| from the | 403 | 0.0007685 |
| of a | 390 | 0.0007437 |
| i don’t | 386 | 0.0007361 |
| is the | 384 | 0.0007322 |
| i have | 381 | 0.0007265 |
| http www.newsisfree.com | 352 | 0.0006712 |
| the same | 349 | 0.0006655 |
| with a | 329 | 0.0006274 |
| it was | 325 | 0.0006197 |
| i think | 322 | 0.0006140 |
| razor users | 319 | 0.0006083 |
| is not | 318 | 0.0006064 |
| as a | 312 | 0.0005950 |
| to do | 307 | 0.0005854 |
| have a | 305 | 0.0005816 |
| would be | 301 | 0.0005740 |
| to get | 299 | 0.0005702 |
| www.newsisfree.com click | 298 | 0.0005683 |
| https lists.sourceforge.net | 297 | 0.0005663 |
filteredspam_ngrams2 <- head(filteredspam_ngrams2, n=30)
filteredspam_ngrams2 %>%
kable () %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>%
scroll_box(width="100%",height="300px")| bigram | n | freq |
|---|---|---|
| e mail | 360 | 0.0017939 |
| click here | 287 | 0.0014301 |
| you have | 256 | 0.0012756 |
| do not | 245 | 0.0012208 |
| you are | 241 | 0.0012009 |
| to receive | 214 | 0.0010664 |
| of this | 192 | 0.0009567 |
| wish to | 187 | 0.0009318 |
| will be | 185 | 0.0009219 |
| you will | 169 | 0.0008421 |
| content type | 166 | 0.0008272 |
| to this | 162 | 0.0008072 |
| be removed | 160 | 0.0007973 |
| content transfer | 153 | 0.0007624 |
| i am | 153 | 0.0007624 |
| transfer encoding | 153 | 0.0007624 |
| this message | 150 | 0.0007474 |
| type text | 148 | 0.0007375 |
| you to | 148 | 0.0007375 |
| of our | 146 | 0.0007275 |
| for your | 140 | 0.0006976 |
| one of | 140 | 0.0006976 |
| this email | 140 | 0.0006976 |
| 3d 3d | 138 | 0.0006877 |
| how to | 137 | 0.0006827 |
| that you | 134 | 0.0006677 |
| removed from | 129 | 0.0006428 |
Per the Summary, I only tested six emails: 3 ham and 3 spam emails derived from my work email account.
I extractd the bi-gram strings from my initial ham and spam count tables and extracted the bi-gram strings derived from my anti_join calculations.
My intention was to create a data frame that contained a list of all test files, whether spam or ham files from my work directory and list the count values for all ham bi-gram values from the original ham result, ham anti_join result, original spam result, and spam anti_join result. Using nested for-loops I iterated over each file and searched for the bi-gram values from each data frame. I stored them in a final table for display.
dfTestFiles <- data.frame(filename=character(),numhamval=integer(),numhamval2=integer(),numspamval=integer(),numspamval2=integer())
tidx <- 1
for (i in 1:length(list.files(testfolder))) {
numhamval <- 0
numhamval2 <- 0
numspamval <- 0
numspamval2 <- 0
filename <- list.files(testfolder)[i]
filenamepath <- paste(testfolder,filename, sep="/")
# Put file contents into vector
tmp <- readLines(filenamepath)
# Pattern matching - get rid of heading information from file
minidx <- min(which(tmp == ''))
maxidx <- length(tmp)
tmp <- tmp[minidx:maxidx]
tmpcand <- c()
for (j in 1:length(tmp)) {
strvalue <- tolower(tmp[j])
if (grepl("date:",strvalue)==TRUE)
strvalue = ""
if (grepl("from:",strvalue)==TRUE)
strvalue = ""
if (grepl("message-id:",strvalue)==TRUE)
strvalue = ""
tmpcand[j] <- strvalue
for (m in 1:length(ham_bigram_strings)) {
srchvalue <- ham_bigram_strings[m]
if (grepl(srchvalue,strvalue))
numhamval <- numhamval + 1
}
for (m in 1:length(ham_bigram_strings2)) {
srchvalue <- ham_bigram_strings2[m]
if (grepl(srchvalue,strvalue))
numhamval2 <- numhamval2 + 1
}
for (m in 1:length(spam_bigram_strings)) {
srchvalue <- spam_bigram_strings[m]
if (grepl(srchvalue,strvalue))
numspamval <- numspamval + 1
}
for (m in 1:length(spam_bigram_strings2)) {
srchvalue <- spam_bigram_strings2[m]
if (grepl(srchvalue,strvalue))
numspamval2 <- numspamval2 + 1
}
}
dftemp <- data.frame(filename,numhamval,numhamval2,numspamval,numspamval2)
dfTestFiles <- rbind(dfTestFiles,dftemp)
}
dfTestFiles %>%
kable () %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>%
scroll_box(width="100%",height="300px")| filename | numhamval | numhamval2 | numspamval | numspamval2 |
|---|---|---|---|---|
| hamfile1.txt | 5 | 5 | 1 | 1 |
| hamfile2.txt | 2 | 0 | 3 | 1 |
| hamfile3.txt | 1 | 1 | 0 | 0 |
| spammail1.txt | 5 | 5 | 7 | 7 |
| spammail2.txt | 7 | 5 | 7 | 5 |
| spammail3.txt | 0 | 0 | 2 | 2 |
The biggest limitation of the analysis is that I tested limited data to verify if they met the classification requirements of the resultant ham and spam data frames. From previous research experiments I conducted, the more test datasets I had, the results tended to align or not align with my findings.
Nevertheless, some interesting observations can be gleaned from the testing of the 6 files:
hamfile1.txt - According to the comparison between the ham and spam data frames, the number of ham bi-gram phrases found is 10 to 2 (5 to 1) by adding both original and anti-join ham data frame counts against both original and anti-join spam data frame counts. Based on this observation, the ham data frames correctly predicted the identity of this ham email file.
hamfile2.txt - According to the comparison between the ham and spam data frames, the number of ham bi-gram phrases found is 2 to 3 by adding both original and anti-join ham data frame counts against both original and anti-join spam data frame counts. Based on this observation, the ham data frames did not predict the identity of this ham email file.
hamfile3.txt - According to the comparison between the ham and spam data frames, the number of ham bi-gram phrases found is 2 to 0 by adding both original and anti-join ham data frame counts against both original and anti-join spam data frame counts. Based on this observation, the ham data frames correctly predicted the identity of this ham email file.
spamfile1.txt - According to the comparison between the ham and spam data frames, the number of ham bi-gram phrases found is 10 to 14 by adding both original and anti-join ham data frame counts against both original and anti-join spam data frame counts. Based on this observation, the spam data frames correctly predicted the identity of this spam email file.
spamfile2.txt - According to the comparison between the ham and spam data frames, the number of ham bi-gram phrases found is 12 to 12 (1 to 1) by adding both original and anti-join ham data frame counts against both original and anti-join spam data frame counts. Based on this observation, the spam data frames did NOT correctly predict the identity of this spam email file.
spamfile3.txt - According to the comparison between the ham and spam data frames, the number of ham bi-gram phrases found is 0 to 4 by adding both original and anti-join ham data frame counts against both original and anti-join spam data frame counts. Based on this observation, the spam data frames correctly predicted the identity of this spam email file.
It would be useful in a future iteration of this investigation to test hundreds of test files to see if the ham and spam classifications were correct. At the conclusion of this project, it has been shown that the ham data frame results correctly predicted the identity of a ham file 67% of the time and that the spam data frame results correctly predicted the identity of a spam file 67% of the time.
[LIS] List of word frequencies using R. Retrieved from website: https://stackoverflow.com/questions/18101047/list-of-word-frequencies-using-r
[REM] Remove long complex html tags from strings in R. Retrieved from website: https://stackoverflow.com/questions/43572086/remove-long-complex-html-tags-from-strings-in-r
[SIL] Silge, J. and Robinsion D. Text Mining with R. Retrieved from website: https://www.tidytextmining.com/