For this assignment we have given two sets of email messages. One set is known to be spam and another set is known to be “ham,” a legitimate message. I have downloaded two files from the example corpus https://spamassassin.apache.org/old/publiccorpus/.1 I have downloaded the files named 20021010_easy_ham.tar.bz2 and 20021010_spam.tar.bz2 containg sample ham and spam messages respectively. I have also made my chosen files available on github.
hamURL <- 'https://spamassassin.apache.org/old/publiccorpus/20021010_easy_ham.tar.bz2'
spamURL <- 'https://spamassassin.apache.org/old/publiccorpus/20021010_spam.tar.bz2'
download.file(hamURL,"corpus/20021010_easy_ham.tar.bz2")
download.file(spamURL,"corpus/20021010_spam.tar.bz2")
untar("corpus/20021010_easy_ham.tar.bz2", exdir = "corpus/")
untar("corpus/20021010_spam.tar.bz2", exdir = "corpus/")
I will be using the tm, Text Mining Package, in R to import the spam and ham data sets and to do some analysis.
library(tm)
The data sets have been imported as as Corpus objects. I have converted these Corpus objects to data frames in order to manipulate the data.
spamCorpus <- Corpus(DirSource(directory = "corpus/spam", encoding = "ASCII"))
hamCorpus <- Corpus(DirSource(directory = "corpus/easy_ham/",encoding = "ASCII"))
spam <- data.frame(text = sapply(spamCorpus, as.character), stringsAsFactors = FALSE)
ham <- data.frame(text = sapply(hamCorpus, as.character), stringsAsFactors = FALSE)
spam <- spam %>%
rownames_to_column("message-id") %>%
rename( message=text ) %>%
mutate ( isSpam = 1)
ham <- ham %>%
rownames_to_column("message-id") %>%
rename( message=text ) %>%
mutate ( isSpam = 0)
The body of a mail message comes after the header and consists of everything that follows the first blank line..2 As a result I will split the message column by a field containing two consecutive new line characters.
combinedDataSet <- rbind(spam,ham) %>%
separate(message,sep = "(\r\n|\r|\n)(\r\n|\r|\n)", into = c("headers","body"), extra = "merge")
I will now strip any HTML tags from the message, using a regular expression.
combinedDataSet <- combinedDataSet %>%
mutate( body_plaintext = str_replace_all(body,"</?[^>]+>","") )
I would also like to look at the originating IP address from the header. Please note that since this data is from 2004, I am only matching IPv4 addresses, if this were to be used today IPv6 address should also be matched. The regex below was adapted from the following website3. It will exclude any private and loopback IP addresses. The str_extract function on the headers with this regex should give us the first public IP address that the message passed through.
regex <-"\\b(?!(10)|(127)|192\\.168|172\\.(2[0-9]|1[6-9]|3[0-2]))[0-9]{1,3}\\.[0-9]{1,3}\\.[0-9]{1,3}\\.[0-9]{1,3}"
combinedDataSet <- combinedDataSet %>%
mutate( originatingIP = str_extract(headers,regex) )
The data frame now looks like this:
combinedDataSet %>%
head(30) %>%
reactable(wrap = F)
We will now create the corpus based on the data frame created in th previous steps.
combinedDataFrameSource <- combinedDataSet %>%
select( `message-id`,body_plaintext) %>%
rename(`doc_id`= `message-id`, text = body_plaintext ) %>%
DataframeSource()
spamCorpus <-Corpus(combinedDataFrameSource)
I will perform some pre-processing on the raw data.
spamCorpus <- spamCorpus %>%
tm_map(content_transformer(tolower))%>%
tm_map(removeNumbers) %>%
tm_map(removePunctuation) %>%
tm_map(stripWhitespace) %>%
tm_map(removeWords, stopwords("SMART")) %>%
tm_map(removeWords, c("nbsp","email")) %>%
tm_map(stemDocument)
An example of one document that has been stemmed, this requires the SnowballC package.
writeLines(as.character(spamCorpus[[20]]))
## dear friendi mrs seseseko widow late presid mobutu seseseko zair democrat republ congo drc move write letter confid presentcircumst situat escap husband son georg kongolo basher democrat republ congo drc abidjan cote divoir famili settl move settl morroco husband die cancer diseas due situat decid chang husband billion dollar deposit swiss bank countri form money code safe purpos head state dr mr laurent kabila made arrang swiss govern european countri freez late husband treasur deposit european countri children decid lay low africa studi situat till thing presid kabila dead son take joseph kabila late husband chateaux southern franc confisc french govern chang ident invest trace confisc deposit sum eighteen million unit state dollarsus secur compani safekeep fund secur code prevent know content interest assist receiv money behalfacknowledg messag introduc son kongolo modal claim fund assist invest money ident reveal buy properti stock multin compani engag safe nonspecul invest point emphasis high level confidenti busi demand hope betray trust confid repos conclus assist son put pictur busi fund maintain discuss modal includ remunerationfor servic reason kind furnish contact inform person telephon fax number confidenti purposebest regardsmr sese seko
I will now create a DocumentTermMatrix.
spamDTM <- DocumentTermMatrix(spamCorpus) %>% removeSparseTerms(sparse = .99)
I will now create a word cloud containing the frequency of common words in the spam.
spamIndex <- which( combinedDataSet$isSpam == 1 )
wordcloud( spamCorpus[spamIndex], min.freq = 200 )
I will now create a word cloud containing the frequency of common words in the spam.
spamIndex <- which( combinedDataSet$isSpam == 0 )
wordcloud( spamCorpus[spamIndex], min.freq = 200)
I obtained the following function from the following blog post4. It is a simple recursive function that takes an IP address or a vector of IP addresses and returns data about them using a JSON API for freegeoip.app.
freegeoip <- function(ip, format = ifelse(length(ip)==1,'list','dataframe'))
{
if (1 == length(ip))
{
# a single IP address
require(rjson)
url <- paste(c("https://freegeoip.app/json/", ip), collapse='')
ret <- fromJSON(readLines(url, warn=FALSE))
if (format == 'dataframe')
ret <- data.frame(t(unlist(ret)))
return(ret)
} else {
ret <- data.frame()
for (i in 1:length(ip))
{
r <- freegeoip(ip[i], format="dataframe")
ret <- rbind(ret, r)
}
return(ret)
}
}
There is a limitation as to the number of queries that can be run on the API, as such I have obtained the unique IPs from the data sets, and will pass this to the freegeoip function above. N.B. The sample data consists of data from the year 2002, the API is providing current information, since then the records may have changed.
uniqueIPInfo <- combinedDataSet$originatingIP %>%
unique() %>%
na.omit() %>%
freegeoip()
## Loading required package: rjson
I will now join the data with original data set.
IPDataSet <- combinedDataSet %>%
left_join(uniqueIPInfo, by=c("originatingIP" = "ip") )
Preview of Resultant data set.
IPDataSet %>%
head(30) %>%
reactable(wrap = F)
countryDataSet <- IPDataSet %>%
dplyr::select(-`message-id`,country_name, isSpam, originatingIP) %>%
na.omit()
countryDataSet %>%
head(50) %>%
reactable(wrap=F)
library(epiDisplay)
spamByCountry <- countryDataSet %>%
filter(isSpam == 1)
hamByCountry <- countryDataSet %>%
filter(isSpam == 0)
tab1(spamByCountry$region_name, sort.group = "decreasing", cum.percent = TRUE ,main = "Spam By Region Name" )
## spamByCountry$region_name :
## Frequency Percent Cum. percent
## Leinster 277 55.4 55.4
## 103 20.6 76.0
## New York 27 5.4 81.4
## Sao Paulo 7 1.4 82.8
## Shanghai 5 1.0 83.8
## Guangdong 5 1.0 84.8
## Vienna 4 0.8 85.6
## Nevada 4 0.8 86.4
## Moscow Oblast 4 0.8 87.2
## England 4 0.8 88.0
## Hsinchu 3 0.6 88.6
## Gyeonggi-do 3 0.6 89.2
## Georgia 3 0.6 89.8
## Andalusia 3 0.6 90.4
## Tokyo 2 0.4 90.8
## Shaanxi 2 0.4 91.2
## Oklahoma 2 0.4 91.6
## Ohio 2 0.4 92.0
## Île-de-France 2 0.4 92.4
## Hlavni mesto Praha 2 0.4 92.8
## Gyeongsangbuk-do 2 0.4 93.2
## California 2 0.4 93.6
## Beijing 2 0.4 94.0
## Zarqa 1 0.2 94.2
## Washington 1 0.2 94.4
## Warmia-Masuria 1 0.2 94.6
## Valais 1 0.2 94.8
## Utah 1 0.2 95.0
## Texas 1 0.2 95.2
## Tarapacá 1 0.2 95.4
## Sham Shui Po 1 0.2 95.6
## Seoul 1 0.2 95.8
## Santiago Metropolitan 1 0.2 96.0
## Querétaro 1 0.2 96.2
## Quebec 1 0.2 96.4
## North Carolina 1 0.2 96.6
## Minnesota 1 0.2 96.8
## Mexico City 1 0.2 97.0
## Massachusetts 1 0.2 97.2
## Maharashtra 1 0.2 97.4
## Madrid 1 0.2 97.6
## Lower Saxony 1 0.2 97.8
## Lombardy 1 0.2 98.0
## Liaoning 1 0.2 98.2
## Iowa 1 0.2 98.4
## Illinois 1 0.2 98.6
## Gauteng 1 0.2 98.8
## Fukuoka 1 0.2 99.0
## District of Columbia 1 0.2 99.2
## Buenos Aires F.D. 1 0.2 99.4
## Australian Capital Territory 1 0.2 99.6
## Aragon 1 0.2 99.8
## Anhui 1 0.2 100.0
## Total 500 100.0 100.0
tab1(hamByCountry$region_name, sort.group = "decreasing", cum.percent = TRUE ,main = "Ham By Region Name" )
## hamByCountry$region_name :
## Frequency Percent Cum. percent
## California 680 38.4 38.4
## 469 26.5 64.9
## Gelderland 258 14.6 79.5
## Massachusetts 180 10.2 89.7
## Leinster 167 9.4 99.1
## Västerbotten County 4 0.2 99.3
## Alberta 4 0.2 99.5
## Pennsylvania 2 0.1 99.7
## North Rhine-Westphalia 1 0.1 99.7
## North Carolina 1 0.1 99.8
## Mississippi 1 0.1 99.8
## Kyiv City 1 0.1 99.9
## England 1 0.1 99.9
## Connaught 1 0.1 100.0
## Total 1770 100.0 100.0
tab1(spamByCountry$country_name, sort.group = "decreasing", cum.percent = TRUE ,main = "Spam By Country" )
## spamByCountry$country_name :
## Frequency Percent Cum. percent
## Ireland 277 55.4 55.4
## United States 105 21.0 76.4
## China 25 5.0 81.4
## Taiwan 12 2.4 83.8
## South Korea 12 2.4 86.2
## Spain 10 2.0 88.2
## United Kingdom 7 1.4 89.6
## Brazil 7 1.4 91.0
## Russia 6 1.2 92.2
## Japan 5 1.0 93.2
## Austria 5 1.0 94.2
## Mexico 3 0.6 94.8
## India 3 0.6 95.4
## Hong Kong 3 0.6 96.0
## Singapore 2 0.4 96.4
## Poland 2 0.4 96.8
## France 2 0.4 97.2
## Czechia 2 0.4 97.6
## Chile 2 0.4 98.0
## Thailand 1 0.2 98.2
## Switzerland 1 0.2 98.4
## South Africa 1 0.2 98.6
## Philippines 1 0.2 98.8
## Italy 1 0.2 99.0
## Hashemite Kingdom of Jordan 1 0.2 99.2
## Germany 1 0.2 99.4
## Canada 1 0.2 99.6
## Australia 1 0.2 99.8
## Argentina 1 0.2 100.0
## Total 500 100.0 100.0
tab1(hamByCountry$country_name, sort.group = "decreasing", cum.percent = TRUE ,main = "Ham By Country" )
## hamByCountry$country_name :
## Frequency Percent Cum. percent
## United States 1316 74.4 74.4
## Netherlands 258 14.6 88.9
## Ireland 168 9.5 98.4
## Canada 17 1.0 99.4
## Sweden 4 0.2 99.6
## Australia 4 0.2 99.8
## United Kingdom 1 0.1 99.9
## Ukraine 1 0.1 99.9
## Germany 1 0.1 100.0
## Total 1770 100.0 100.0
tabpct(countryDataSet$country_code,countryDataSet$isSpam , xlab = "Country Code", ylab = "isSpam" , main = "Spam by Country")
##
## Original table
## countryDataSet$isSpam
## countryDataSet$country_code 0 1 Total
## AR 0 1 1
## AT 0 5 5
## AU 4 1 5
## BR 0 7 7
## CA 17 1 18
## CH 0 1 1
## CL 0 2 2
## CN 0 25 25
## CZ 0 2 2
## DE 1 1 2
## ES 0 10 10
## FR 0 2 2
## GB 1 7 8
## HK 0 3 3
## IE 168 277 445
## IN 0 3 3
## IT 0 1 1
## JO 0 1 1
## JP 0 5 5
## KR 0 12 12
## MX 0 3 3
## NL 258 0 258
## PH 0 1 1
## PL 0 2 2
## RU 0 6 6
## SE 4 0 4
## SG 0 2 2
## TH 0 1 1
## TW 0 12 12
## UA 1 0 1
## US 1316 105 1421
## ZA 0 1 1
## Total 1770 500 2270
##
## Row percent
## countryDataSet$isSpam
## countryDataSet$country_code 0 1 Total
## AR 0 1 1
## (0) (100) (100)
## AT 0 5 5
## (0) (100) (100)
## AU 4 1 5
## (80) (20) (100)
## BR 0 7 7
## (0) (100) (100)
## CA 17 1 18
## (94.4) (5.6) (100)
## CH 0 1 1
## (0) (100) (100)
## CL 0 2 2
## (0) (100) (100)
## CN 0 25 25
## (0) (100) (100)
## CZ 0 2 2
## (0) (100) (100)
## DE 1 1 2
## (50) (50) (100)
## ES 0 10 10
## (0) (100) (100)
## FR 0 2 2
## (0) (100) (100)
## GB 1 7 8
## (12.5) (87.5) (100)
## HK 0 3 3
## (0) (100) (100)
## IE 168 277 445
## (37.8) (62.2) (100)
## IN 0 3 3
## (0) (100) (100)
## IT 0 1 1
## (0) (100) (100)
## JO 0 1 1
## (0) (100) (100)
## JP 0 5 5
## (0) (100) (100)
## KR 0 12 12
## (0) (100) (100)
## MX 0 3 3
## (0) (100) (100)
## NL 258 0 258
## (100) (0) (100)
## PH 0 1 1
## (0) (100) (100)
## PL 0 2 2
## (0) (100) (100)
## RU 0 6 6
## (0) (100) (100)
## SE 4 0 4
## (100) (0) (100)
## SG 0 2 2
## (0) (100) (100)
## TH 0 1 1
## (0) (100) (100)
## TW 0 12 12
## (0) (100) (100)
## UA 1 0 1
## (100) (0) (100)
## US 1316 105 1421
## (92.6) (7.4) (100)
## ZA 0 1 1
## (0) (100) (100)
##
## Column percent
## countryDataSet$isSpam
## countryDataSet$country_code 0 % 1 %
## AR 0 (0.0) 1 (0.2)
## AT 0 (0.0) 5 (1.0)
## AU 4 (0.2) 1 (0.2)
## BR 0 (0.0) 7 (1.4)
## CA 17 (1.0) 1 (0.2)
## CH 0 (0.0) 1 (0.2)
## CL 0 (0.0) 2 (0.4)
## CN 0 (0.0) 25 (5.0)
## CZ 0 (0.0) 2 (0.4)
## DE 1 (0.1) 1 (0.2)
## ES 0 (0.0) 10 (2.0)
## FR 0 (0.0) 2 (0.4)
## GB 1 (0.1) 7 (1.4)
## HK 0 (0.0) 3 (0.6)
## IE 168 (9.5) 277 (55.4)
## IN 0 (0.0) 3 (0.6)
## IT 0 (0.0) 1 (0.2)
## JO 0 (0.0) 1 (0.2)
## JP 0 (0.0) 5 (1.0)
## KR 0 (0.0) 12 (2.4)
## MX 0 (0.0) 3 (0.6)
## NL 258 (14.6) 0 (0.0)
## PH 0 (0.0) 1 (0.2)
## PL 0 (0.0) 2 (0.4)
## RU 0 (0.0) 6 (1.2)
## SE 4 (0.2) 0 (0.0)
## SG 0 (0.0) 2 (0.4)
## TH 0 (0.0) 1 (0.2)
## TW 0 (0.0) 12 (2.4)
## UA 1 (0.1) 0 (0.0)
## US 1316 (74.4) 105 (21.0)
## ZA 0 (0.0) 1 (0.2)
## Total 1770 (100) 500 (100)
library(tidymodels)
library(textrecipes)
library(stopwords)
library(naivebayes)
library(discrim)
library(janitor)
library(kernlab)
I will begin by using the rsample package to split the data into a testing and training set.
combinedDataSetForModel <- IPDataSet %>%
mutate(isSpam = factor(isSpam), originatingIP = factor(originatingIP), country_name = factor(country_name) ,region_name = factor(region_name),city = factor(city), time_zone = factor(time_zone) ) %>%
dplyr::select(`message-id`, isSpam, body_plaintext, originatingIP, country_name, region_name, city, time_zone)
set.seed(1234)
combinedDataSetSplit <- initial_split(combinedDataSetForModel, strata = "isSpam", p = 0.75)
train_data <- training(combinedDataSetSplit)
test_data <- testing(combinedDataSetSplit)
This recipe will tokenize the body field, and will keep words that appear more than 100 times. This recipe has been adapted from the following post5.
spam_rec <-
recipe(isSpam ~ . , data=train_data) %>%
step_naomit(all_predictors()) %>%
update_role("message-id",new_role = "ID") %>%
step_clean_names(body_plaintext) %>%
step_tokenize(body_plaintext) %>%
step_stopwords(body_plaintext, keep = F) %>%
step_stem(body_plaintext) %>%
step_tokenfilter(body_plaintext, max_tokens = 100) %>%
step_tfidf(body_plaintext) %>%
prep(training = train_data)
rec_train_data <- juice(spam_rec)
rec_test_data <- bake(spam_rec, test_data)
model_nb <- naive_Bayes(Laplace = 1) %>%
set_mode("classification") %>%
set_engine("naivebayes") %>%
fit(isSpam ~ .,data = rec_train_data )
Please see the below confusion matrix for the model.
testPred_nb <- model_nb %>%
predict( rec_test_data ) %>%
bind_cols(rec_test_data %>%
dplyr::select(isSpam))
testPred_nb %>%
conf_mat(isSpam, .pred_class) %>%
autoplot()
Please find the accuracy for the test data below.
testPred_nb %>%
metrics(isSpam, .pred_class) %>%
filter(.metric == "accuracy") %>%
dplyr::select(-.estimator) %>%
reactable()