Introduction

For this assignment we have given two sets of email messages. One set is known to be spam and another set is known to be “ham,” a legitimate message. I have downloaded two files from the example corpus https://spamassassin.apache.org/old/publiccorpus/.1 I have downloaded the files named 20021010_easy_ham.tar.bz2 and 20021010_spam.tar.bz2 containg sample ham and spam messages respectively. I have also made my chosen files available on github.

Import Data

Download and Unzip Files

hamURL <- 'https://spamassassin.apache.org/old/publiccorpus/20021010_easy_ham.tar.bz2'
spamURL <- 'https://spamassassin.apache.org/old/publiccorpus/20021010_spam.tar.bz2'

download.file(hamURL,"corpus/20021010_easy_ham.tar.bz2")
download.file(spamURL,"corpus/20021010_spam.tar.bz2")

untar("corpus/20021010_easy_ham.tar.bz2", exdir = "corpus/")
untar("corpus/20021010_spam.tar.bz2", exdir = "corpus/")

Text Mining Package

I will be using the tm, Text Mining Package, in R to import the spam and ham data sets and to do some analysis.

library(tm)

Import Data from Files and Export to Data Frame

The data sets have been imported as as Corpus objects. I have converted these Corpus objects to data frames in order to manipulate the data.

spamCorpus <- Corpus(DirSource(directory = "corpus/spam", encoding = "ASCII"))
hamCorpus <- Corpus(DirSource(directory = "corpus/easy_ham/",encoding = "ASCII"))

spam <- data.frame(text = sapply(spamCorpus, as.character), stringsAsFactors = FALSE)
ham <- data.frame(text = sapply(hamCorpus, as.character), stringsAsFactors = FALSE)

Combine Data Sets

spam <- spam %>% 
  rownames_to_column("message-id") %>% 
  rename( message=text ) %>% 
  mutate ( isSpam = 1)

ham <- ham %>% 
  rownames_to_column("message-id") %>% 
  rename( message=text ) %>% 
  mutate ( isSpam = 0)

The body of a mail message comes after the header and consists of everything that follows the first blank line..2 As a result I will split the message column by a field containing two consecutive new line characters.

combinedDataSet <- rbind(spam,ham) %>% 
  separate(message,sep = "(\r\n|\r|\n)(\r\n|\r|\n)", into = c("headers","body"), extra = "merge")

Separate Useful Information into Columns

I will now strip any HTML tags from the message, using a regular expression.

combinedDataSet <- combinedDataSet %>% 
  mutate( body_plaintext = str_replace_all(body,"</?[^>]+>","") )

I would also like to look at the originating IP address from the header. Please note that since this data is from 2004, I am only matching IPv4 addresses, if this were to be used today IPv6 address should also be matched. The regex below was adapted from the following website3. It will exclude any private and loopback IP addresses. The str_extract function on the headers with this regex should give us the first public IP address that the message passed through.

regex <-"\\b(?!(10)|(127)|192\\.168|172\\.(2[0-9]|1[6-9]|3[0-2]))[0-9]{1,3}\\.[0-9]{1,3}\\.[0-9]{1,3}\\.[0-9]{1,3}" 

combinedDataSet <- combinedDataSet %>% 
  mutate( originatingIP = str_extract(headers,regex)  )

Resultant DataFrame

The data frame now looks like this:

combinedDataSet %>% 
  head(30) %>% 
  reactable(wrap = F)

Creating Corpus

Create Corpus

We will now create the corpus based on the data frame created in th previous steps.

combinedDataFrameSource <- combinedDataSet %>% 
  select( `message-id`,body_plaintext) %>% 
  rename(`doc_id`= `message-id`, text = body_plaintext ) %>% 
  DataframeSource()

spamCorpus <-Corpus(combinedDataFrameSource)

Data Pre-Processing

I will perform some pre-processing on the raw data.

spamCorpus <- spamCorpus %>% 
  tm_map(content_transformer(tolower))%>% 
  tm_map(removeNumbers) %>% 
  tm_map(removePunctuation) %>% 
  tm_map(stripWhitespace) %>% 
  tm_map(removeWords, stopwords("SMART")) %>% 
  tm_map(removeWords, c("nbsp","email")) %>% 
  tm_map(stemDocument) 

Example of One Document

An example of one document that has been stemmed, this requires the SnowballC package.

writeLines(as.character(spamCorpus[[20]])) 
## dear friendi mrs seseseko widow late presid mobutu seseseko zair democrat republ congo drc move write letter confid presentcircumst situat escap husband son georg kongolo basher democrat republ congo drc abidjan cote divoir famili settl move settl morroco husband die cancer diseas due situat decid chang husband billion dollar deposit swiss bank countri form money code safe purpos head state dr mr laurent kabila made arrang swiss govern european countri freez late husband treasur deposit european countri children decid lay low africa studi situat till thing presid kabila dead son take joseph kabila late husband chateaux southern franc confisc french govern chang ident invest trace confisc deposit sum eighteen million unit state dollarsus secur compani safekeep fund secur code prevent know content interest assist receiv money behalfacknowledg messag introduc son kongolo modal claim fund assist invest money ident reveal buy properti stock multin compani engag safe nonspecul invest point emphasis high level confidenti busi demand hope betray trust confid repos conclus assist son put pictur busi fund maintain discuss modal includ remunerationfor servic reason kind furnish contact inform person telephon fax number confidenti purposebest regardsmr sese seko

Create Document Matrix

I will now create a DocumentTermMatrix.

spamDTM <- DocumentTermMatrix(spamCorpus) %>% removeSparseTerms(sparse = .99)

Create Word Cloud - Spam

I will now create a word cloud containing the frequency of common words in the spam.

spamIndex <- which( combinedDataSet$isSpam == 1 )
wordcloud( spamCorpus[spamIndex], min.freq = 200 )

Create Word Cloud - Ham

I will now create a word cloud containing the frequency of common words in the spam.

spamIndex <- which( combinedDataSet$isSpam == 0 )
wordcloud( spamCorpus[spamIndex], min.freq = 200)

Looking at Originating IP Address

Obtaining IP information

I obtained the following function from the following blog post4. It is a simple recursive function that takes an IP address or a vector of IP addresses and returns data about them using a JSON API for freegeoip.app.

freegeoip <- function(ip, format = ifelse(length(ip)==1,'list','dataframe'))
{
    if (1 == length(ip))
    {
        # a single IP address
        require(rjson)
        url <- paste(c("https://freegeoip.app/json/", ip), collapse='')
        ret <- fromJSON(readLines(url, warn=FALSE))
        if (format == 'dataframe')
            ret <- data.frame(t(unlist(ret)))
        return(ret)
    } else {
        ret <- data.frame()
        for (i in 1:length(ip))
        {
            r <- freegeoip(ip[i], format="dataframe")
            ret <- rbind(ret, r)
        }
        return(ret)
    }
}   

There is a limitation as to the number of queries that can be run on the API, as such I have obtained the unique IPs from the data sets, and will pass this to the freegeoip function above. N.B. The sample data consists of data from the year 2002, the API is providing current information, since then the records may have changed.

uniqueIPInfo <- combinedDataSet$originatingIP %>% 
  unique() %>% 
  na.omit() %>%  
  freegeoip()
## Loading required package: rjson

I will now join the data with original data set.

IPDataSet <- combinedDataSet %>% 
  left_join(uniqueIPInfo, by=c("originatingIP" = "ip") )

Preview of Resultant data set.

IPDataSet %>% 
  head(30) %>% 
  reactable(wrap = F)

Looking at the messages based on Location

countryDataSet <- IPDataSet %>% 
  dplyr::select(-`message-id`,country_name, isSpam, originatingIP) %>% 
  na.omit()

countryDataSet %>%
  head(50) %>% 
  reactable(wrap=F)

Summary Statistics for Region

library(epiDisplay)
spamByCountry <- countryDataSet %>% 
  filter(isSpam == 1)

hamByCountry <- countryDataSet %>% 
  filter(isSpam == 0)

tab1(spamByCountry$region_name, sort.group = "decreasing", cum.percent = TRUE ,main = "Spam By Region Name" )

## spamByCountry$region_name : 
##                              Frequency Percent Cum. percent
## Leinster                           277    55.4         55.4
##                                    103    20.6         76.0
## New York                            27     5.4         81.4
## Sao Paulo                            7     1.4         82.8
## Shanghai                             5     1.0         83.8
## Guangdong                            5     1.0         84.8
## Vienna                               4     0.8         85.6
## Nevada                               4     0.8         86.4
## Moscow Oblast                        4     0.8         87.2
## England                              4     0.8         88.0
## Hsinchu                              3     0.6         88.6
## Gyeonggi-do                          3     0.6         89.2
## Georgia                              3     0.6         89.8
## Andalusia                            3     0.6         90.4
## Tokyo                                2     0.4         90.8
## Shaanxi                              2     0.4         91.2
## Oklahoma                             2     0.4         91.6
## Ohio                                 2     0.4         92.0
## Île-de-France                        2     0.4         92.4
## Hlavni mesto Praha                   2     0.4         92.8
## Gyeongsangbuk-do                     2     0.4         93.2
## California                           2     0.4         93.6
## Beijing                              2     0.4         94.0
## Zarqa                                1     0.2         94.2
## Washington                           1     0.2         94.4
## Warmia-Masuria                       1     0.2         94.6
## Valais                               1     0.2         94.8
## Utah                                 1     0.2         95.0
## Texas                                1     0.2         95.2
## Tarapacá                             1     0.2         95.4
## Sham Shui Po                         1     0.2         95.6
## Seoul                                1     0.2         95.8
## Santiago Metropolitan                1     0.2         96.0
## Querétaro                            1     0.2         96.2
## Quebec                               1     0.2         96.4
## North Carolina                       1     0.2         96.6
## Minnesota                            1     0.2         96.8
## Mexico City                          1     0.2         97.0
## Massachusetts                        1     0.2         97.2
## Maharashtra                          1     0.2         97.4
## Madrid                               1     0.2         97.6
## Lower Saxony                         1     0.2         97.8
## Lombardy                             1     0.2         98.0
## Liaoning                             1     0.2         98.2
## Iowa                                 1     0.2         98.4
## Illinois                             1     0.2         98.6
## Gauteng                              1     0.2         98.8
## Fukuoka                              1     0.2         99.0
## District of Columbia                 1     0.2         99.2
## Buenos Aires F.D.                    1     0.2         99.4
## Australian Capital Territory         1     0.2         99.6
## Aragon                               1     0.2         99.8
## Anhui                                1     0.2        100.0
##   Total                            500   100.0        100.0
tab1(hamByCountry$region_name, sort.group = "decreasing", cum.percent = TRUE ,main = "Ham By Region Name" )

## hamByCountry$region_name : 
##                        Frequency Percent Cum. percent
## California                   680    38.4         38.4
##                              469    26.5         64.9
## Gelderland                   258    14.6         79.5
## Massachusetts                180    10.2         89.7
## Leinster                     167     9.4         99.1
## Västerbotten County            4     0.2         99.3
## Alberta                        4     0.2         99.5
## Pennsylvania                   2     0.1         99.7
## North Rhine-Westphalia         1     0.1         99.7
## North Carolina                 1     0.1         99.8
## Mississippi                    1     0.1         99.8
## Kyiv City                      1     0.1         99.9
## England                        1     0.1         99.9
## Connaught                      1     0.1        100.0
##   Total                     1770   100.0        100.0

Summary Statistics for Country (1)

tab1(spamByCountry$country_name, sort.group = "decreasing", cum.percent = TRUE ,main = "Spam By Country" )

## spamByCountry$country_name : 
##                             Frequency Percent Cum. percent
## Ireland                           277    55.4         55.4
## United States                     105    21.0         76.4
## China                              25     5.0         81.4
## Taiwan                             12     2.4         83.8
## South Korea                        12     2.4         86.2
## Spain                              10     2.0         88.2
## United Kingdom                      7     1.4         89.6
## Brazil                              7     1.4         91.0
## Russia                              6     1.2         92.2
## Japan                               5     1.0         93.2
## Austria                             5     1.0         94.2
## Mexico                              3     0.6         94.8
## India                               3     0.6         95.4
## Hong Kong                           3     0.6         96.0
## Singapore                           2     0.4         96.4
## Poland                              2     0.4         96.8
## France                              2     0.4         97.2
## Czechia                             2     0.4         97.6
## Chile                               2     0.4         98.0
## Thailand                            1     0.2         98.2
## Switzerland                         1     0.2         98.4
## South Africa                        1     0.2         98.6
## Philippines                         1     0.2         98.8
## Italy                               1     0.2         99.0
## Hashemite Kingdom of Jordan         1     0.2         99.2
## Germany                             1     0.2         99.4
## Canada                              1     0.2         99.6
## Australia                           1     0.2         99.8
## Argentina                           1     0.2        100.0
##   Total                           500   100.0        100.0
tab1(hamByCountry$country_name, sort.group = "decreasing", cum.percent = TRUE ,main = "Ham By Country" )

## hamByCountry$country_name : 
##                Frequency Percent Cum. percent
## United States       1316    74.4         74.4
## Netherlands          258    14.6         88.9
## Ireland              168     9.5         98.4
## Canada                17     1.0         99.4
## Sweden                 4     0.2         99.6
## Australia              4     0.2         99.8
## United Kingdom         1     0.1         99.9
## Ukraine                1     0.1         99.9
## Germany                1     0.1        100.0
##   Total             1770   100.0        100.0

Summary Statistics for Country (2)

tabpct(countryDataSet$country_code,countryDataSet$isSpam , xlab = "Country Code", ylab = "isSpam" , main = "Spam by Country")
## 
## Original table 
##                            countryDataSet$isSpam
## countryDataSet$country_code     0     1  Total
##                       AR        0     1      1
##                       AT        0     5      5
##                       AU        4     1      5
##                       BR        0     7      7
##                       CA       17     1     18
##                       CH        0     1      1
##                       CL        0     2      2
##                       CN        0    25     25
##                       CZ        0     2      2
##                       DE        1     1      2
##                       ES        0    10     10
##                       FR        0     2      2
##                       GB        1     7      8
##                       HK        0     3      3
##                       IE      168   277    445
##                       IN        0     3      3
##                       IT        0     1      1
##                       JO        0     1      1
##                       JP        0     5      5
##                       KR        0    12     12
##                       MX        0     3      3
##                       NL      258     0    258
##                       PH        0     1      1
##                       PL        0     2      2
##                       RU        0     6      6
##                       SE        4     0      4
##                       SG        0     2      2
##                       TH        0     1      1
##                       TW        0    12     12
##                       UA        1     0      1
##                       US     1316   105   1421
##                       ZA        0     1      1
##                       Total  1770   500   2270
## 
## Row percent 
##                            countryDataSet$isSpam
## countryDataSet$country_code       0       1  Total
##                          AR       0       1      1
##                                 (0)   (100)  (100)
##                          AT       0       5      5
##                                 (0)   (100)  (100)
##                          AU       4       1      5
##                                (80)    (20)  (100)
##                          BR       0       7      7
##                                 (0)   (100)  (100)
##                          CA      17       1     18
##                              (94.4)   (5.6)  (100)
##                          CH       0       1      1
##                                 (0)   (100)  (100)
##                          CL       0       2      2
##                                 (0)   (100)  (100)
##                          CN       0      25     25
##                                 (0)   (100)  (100)
##                          CZ       0       2      2
##                                 (0)   (100)  (100)
##                          DE       1       1      2
##                                (50)    (50)  (100)
##                          ES       0      10     10
##                                 (0)   (100)  (100)
##                          FR       0       2      2
##                                 (0)   (100)  (100)
##                          GB       1       7      8
##                              (12.5)  (87.5)  (100)
##                          HK       0       3      3
##                                 (0)   (100)  (100)
##                          IE     168     277    445
##                              (37.8)  (62.2)  (100)
##                          IN       0       3      3
##                                 (0)   (100)  (100)
##                          IT       0       1      1
##                                 (0)   (100)  (100)
##                          JO       0       1      1
##                                 (0)   (100)  (100)
##                          JP       0       5      5
##                                 (0)   (100)  (100)
##                          KR       0      12     12
##                                 (0)   (100)  (100)
##                          MX       0       3      3
##                                 (0)   (100)  (100)
##                          NL     258       0    258
##                               (100)     (0)  (100)
##                          PH       0       1      1
##                                 (0)   (100)  (100)
##                          PL       0       2      2
##                                 (0)   (100)  (100)
##                          RU       0       6      6
##                                 (0)   (100)  (100)
##                          SE       4       0      4
##                               (100)     (0)  (100)
##                          SG       0       2      2
##                                 (0)   (100)  (100)
##                          TH       0       1      1
##                                 (0)   (100)  (100)
##                          TW       0      12     12
##                                 (0)   (100)  (100)
##                          UA       1       0      1
##                               (100)     (0)  (100)
##                          US    1316     105   1421
##                              (92.6)   (7.4)  (100)
##                          ZA       0       1      1
##                                 (0)   (100)  (100)
## 
## Column percent 
##                            countryDataSet$isSpam
## countryDataSet$country_code     0       %    1       %
##                       AR        0   (0.0)    1   (0.2)
##                       AT        0   (0.0)    5   (1.0)
##                       AU        4   (0.2)    1   (0.2)
##                       BR        0   (0.0)    7   (1.4)
##                       CA       17   (1.0)    1   (0.2)
##                       CH        0   (0.0)    1   (0.2)
##                       CL        0   (0.0)    2   (0.4)
##                       CN        0   (0.0)   25   (5.0)
##                       CZ        0   (0.0)    2   (0.4)
##                       DE        1   (0.1)    1   (0.2)
##                       ES        0   (0.0)   10   (2.0)
##                       FR        0   (0.0)    2   (0.4)
##                       GB        1   (0.1)    7   (1.4)
##                       HK        0   (0.0)    3   (0.6)
##                       IE      168   (9.5)  277  (55.4)
##                       IN        0   (0.0)    3   (0.6)
##                       IT        0   (0.0)    1   (0.2)
##                       JO        0   (0.0)    1   (0.2)
##                       JP        0   (0.0)    5   (1.0)
##                       KR        0   (0.0)   12   (2.4)
##                       MX        0   (0.0)    3   (0.6)
##                       NL      258  (14.6)    0   (0.0)
##                       PH        0   (0.0)    1   (0.2)
##                       PL        0   (0.0)    2   (0.4)
##                       RU        0   (0.0)    6   (1.2)
##                       SE        4   (0.2)    0   (0.0)
##                       SG        0   (0.0)    2   (0.4)
##                       TH        0   (0.0)    1   (0.2)
##                       TW        0   (0.0)   12   (2.4)
##                       UA        1   (0.1)    0   (0.0)
##                       US     1316  (74.4)  105  (21.0)
##                       ZA        0   (0.0)    1   (0.2)
##                       Total  1770   (100)  500   (100)

Model Using TidyModels Package

Load Tidymodel Package

library(tidymodels)
library(textrecipes)
library(stopwords)
library(naivebayes)
library(discrim)
library(janitor)
library(kernlab)

Split Data

I will begin by using the rsample package to split the data into a testing and training set.

combinedDataSetForModel <- IPDataSet %>%
  mutate(isSpam = factor(isSpam),  originatingIP = factor(originatingIP), country_name = factor(country_name) ,region_name = factor(region_name),city = factor(city), time_zone = factor(time_zone)  ) %>% 
  dplyr::select(`message-id`, isSpam, body_plaintext, originatingIP, country_name, region_name, city, time_zone) 

set.seed(1234) 
combinedDataSetSplit <- initial_split(combinedDataSetForModel, strata = "isSpam", p = 0.75)
train_data <- training(combinedDataSetSplit)
test_data <- testing(combinedDataSetSplit)

Update Recipe

This recipe will tokenize the body field, and will keep words that appear more than 100 times. This recipe has been adapted from the following post5.

spam_rec <-
  recipe(isSpam ~ . , data=train_data) %>% 
  step_naomit(all_predictors()) %>% 
  update_role("message-id",new_role = "ID") %>% 
  step_clean_names(body_plaintext) %>% 
  step_tokenize(body_plaintext) %>%
  step_stopwords(body_plaintext, keep = F) %>%
  step_stem(body_plaintext) %>% 
  step_tokenfilter(body_plaintext, max_tokens = 100) %>% 
  step_tfidf(body_plaintext) %>% 
  prep(training = train_data)

Juice the Train Data and Bake the Test Data

rec_train_data <- juice(spam_rec)
rec_test_data <- bake(spam_rec, test_data)

Create and Fit Model

model_nb <- naive_Bayes(Laplace = 1) %>% 
  set_mode("classification") %>%
  set_engine("naivebayes") %>% 
  fit(isSpam ~ .,data = rec_train_data  )

Verify Model Accuracy

Please see the below confusion matrix for the model.

testPred_nb <- model_nb %>% 
  predict( rec_test_data ) %>% 
  bind_cols(rec_test_data %>% 
              dplyr::select(isSpam))

testPred_nb %>%
  conf_mat(isSpam, .pred_class) %>% 
  autoplot()

Please find the accuracy for the test data below.

testPred_nb %>%
  metrics(isSpam, .pred_class) %>%
  filter(.metric == "accuracy") %>% 
  dplyr::select(-.estimator) %>% 
  reactable()

References

Costales, Bryan. 2002. “Sendmail, 3rd Edition.” O’Reilly Online Learning. O’Reilly Media, Inc. https://www.oreilly.com/library/view/sendmail-3rd-edition/1565928393/ch01s05.html.
Mark. 2016. “Regexp for Extracting Public IP Address.” BigDataMark. BigDataMark. https://www.bigdatamark.com/regexp-for-extracting-public-ip-address/.
Silge, Julia. 2020. “Sentiment Analysis with Tidymodels and #TidyTuesday Animal Crossing Reviews.” Julia Silge. https://juliasilge.com/blog/animal-crossing/.
The Apache Software Foundation. 2004. Index of /Old/Publiccorpus. The Apache Software Foundation. https://spamassassin.apache.org/old/publiccorpus/.
Ziem, Andrew. 2013. “Geolocate IP Addresses in r.” Heuristic Andrew. https://heuristically.wordpress.com/2013/05/20/geolocate-ip-addresses-in-r/.

  1. The Apache Software Foundation (2004)↩︎

  2. Costales (2002)↩︎

  3. Mark (2016)↩︎

  4. Ziem (2013)↩︎

  5. Silge (2020)↩︎