This assignment will start with a spam/ham dataset, then predict the class of new documents withheld from the training dataset.
The corpus for this analysis is located here: https://spamassassin.apache.org/publiccorpus/
The code for this assignment requires the following R packages:
This analysis heavily uses the quanteda
package in R. Information on the quanteda package can be found here:
https://cran.r-project.org/web/packages/quanteda/vignettes/quickstart.html
quanteda Package Introduction
“quanteda makes it easy to manage texts in the form of a corpus, defined as a collection of texts that includes document-level variables specific to each text, as well as meta-data for documents and for the collection as a whole. quanteda includes tools to make it easy and fast to manuipulate the texts in a corpus, by performing the most common natural language processing tasks simply and quickly, such as tokenizing, stemming, or forming ngrams. quanteda’s functions for tokenizing texts and forming multiple tokenized documents into a document-feature matrix are both extremely fast and extremely simple to use. quanteda can segment texts easily by words, paragraphs, sentences, or even user-supplied delimiters and tags.”
Functions used throughout the analysis:
# =========================================================================
# Function: download_and_untar
# =========================================================================
# Description:
# Function downloads the specified bz2 spam or ham file
# on https://spamassassin.apache.org/publiccorpus.
# Once downloaded to the local compute, the files are
# bunzipped and untarred
#
# Parameters:
# 1. name of the filename to download from the public corpus
# 2. boolean whether the file should be downloaded only;
# default = FALSE
#
# Reurn: N/A
# =========================================================================
download_and_untar <- function(filename, downloadOnly = FALSE) {
# download the specified files from
# https://spamassassin.apache.org/publiccorpus
downloader::download(url = paste0(URL, filename), filename )
tar.file <- stri_replace_all_regex(filename, ".bz2", "")
if (!downloadOnly) {
# bunzip2 the file
bunzip2(filename, tar.file, remove = FALSE, skip = TRUE)
# untar the file
untar(tar.file, exdir = ".")
# remove the tar file
if (file.exists(tar.file)) file.remove(tar.file)
}
}
# =========================================================================
# Function: createCorpus
# =========================================================================
# Description:
# Uses the tm package vCorpus object to convert a corpus into
# a quanteda corpus.
#
# Parameters:
# 1. directory location of the files to be used in the corpus
# 2. type of email- spam or ham. This value is set as a
# docvar on the corpus
# Reurn: corpus (quanteda)
# =========================================================================
createCorpus <- function(directory, emailType) {
quantCorpus <- corpus(Corpus(DirSource(directory = directory, encoding = "UTF-8"),
readerControl = list(language="en_US")),
notes=emailType)
docvars(quantCorpus, "email_type") <- emailType
docvars(quantCorpus, "source") <- stri_replace_all_regex(directory, "./", "")
return(quantCorpus)
}
# =========================================================================
# Function: buildDFM
# =========================================================================
# Description:
# Accepts a corpus object and converts to a document-feature
# matrix (dfm).
#
# Parameters:
# 1. the corpus to convert to a dfm
# 2. minDoc value
# 3. minCount value
#
# Reurn: dfm (document-feature matrix)
# =========================================================================
buildDFM <- function(corpus, minDoc, minCount) {
# create the document-feature matrix
# dfm = document-feature matrix
dfm <- dfm(corpus, ignoredFeatures = stopwords("english"), stem = TRUE)
dfm <- trim(dfm, minDoc = minDoc, minCount = minCount)
return(dfm)
}
plotDFM <- function(dfm) {
# plot in colors with some additional options passed to wordcloud
plot(dfm, random.color = TRUE, rot.per = .25, colors = sample(colors()[2:128], 5))
}
# =========================================================================
# Function: create_df_matrix
# =========================================================================
# Description:
# Accepts a dfm object, applies the td-idf function, and
# returns a dataframe
#
# tfidf computes term frequency-inverse document frequency weighting.
# The default is not to normalize term frequency # # (by computing relative term frequency
# within document) but this will be performed if normalize = TRUE.
#
# Parameters:
# 1. dfm to process
# 2. tpye of email - spam or ham
#
# Reurn: dataframe
# =========================================================================
create_df_matrix <- function(dfm, emailType) {
# apply the tfidf function
mat <- data.matrix(tfidf(dfm))
# convert to a dataframe
df <- as.data.frame(mat, stringsAsFactors = FALSE)
df$Source <- emailType
return(df)
}
The following sets of files are used as input into the document classification. The files are classified either as 1.) Ham which is email that is generally desired to be received or 2.) Spam which is typically unsolicited email, generated in bulk and is generally unwanted by the recipient.
Filename | Type |
---|---|
20021010_easy_ham.tar.bz2 | Ham |
20021010_spam.tar.bz2 | Spam |
20021010_hard_ham.tar.bz2 | Ham |
20030228_spam_2.tar.bz2 | Spam |
# use lapply to download and untar all files specified
lapply(files, download_and_untar)
## [[1]]
## [1] TRUE
##
## [[2]]
## [1] TRUE
##
## [[3]]
## [1] TRUE
##
## [[4]]
## [1] TRUE
Create the Spam Corpus by combining the files found in the spam
and spam_2
compressed file downloads from the spamassassin public corpus.
########### SPAM ###############
spamCorpus <- createCorpus("./spam", "spam")
spam2Corpus <- createCorpus("./spam_2", "spam")
#combine the 2 Spam corpora
spamCorpusCombined <- spamCorpus + spam2Corpus
Let’s look at the combined Spam corpus using the summary
function:
## Corpus consisting of 1899 documents, showing 20 documents.
##
## Text Types Tokens Sentences author
## 0000.7b1b73cf36cf9dbc3d64e3f2ee2b91f1 1170 1835 1 <NA>
## 0001.bfc8d64d12b325ff385cca8d07b84288 341 1495 18 <NA>
## 0002.24b47bb3ce90708ae29d0aec1da08610 237 637 10 <NA>
## 0003.4b3d943b8df71af248d12f8b2e7a224a 201 474 9 <NA>
## 0004.1874ab60c71f0b31b580f313a3f6e777 364 1112 46 <NA>
## 0005.1f42bb885de0ef7fc5cd09d34dc2ba54 227 602 7 <NA>
## 0006.7a32642f8c22bbeb85d6c3b5f3890a2c 378 821 27 <NA>
## 0007.859c901719011d56f8b652ea071c1f8b 189 423 10 <NA>
## 0008.9562918b57e044abfbce260cc875acde 613 5937 22 <NA>
## 0009.c05e264fbf18783099b53dbc9a9aacda 424 951 40 <NA>
## 0010.7f5fb525755c45eb78efc18d7c9ea5aa 231 825 5 <NA>
## 0011.2a1247254a535bac29c476b86c708901 199 469 9 <NA>
## 0012.7bc8e619ad0264979edce15083e70a02 166 535 7 <NA>
## 0013.9034ac0917f6fdb82c5ee6a7509029ed 199 470 9 <NA>
## 0014.ed99ffe0f452b91be11684cbfe8d349c 308 1876 38 <NA>
## 0015.1b871d654560011a0aaa29bb4e9054f7 182 501 7 <NA>
## 0016.f9c349935955e1ccc7626270da898445 314 1699 10 <NA>
## 0017.49ab70c7a4042cb1c695a0e59a6ede54 357 784 40 <NA>
## 0018.259154a52bc55dcae491cfded60a5cd2 186 417 11 <NA>
## 0019.939e70d8367f315193e4bc5be80dc262 326 723 19 <NA>
## datetimestamp description heading
## 2016-04-10 23:26:28 <NA> <NA>
## 2016-04-10 23:26:28 <NA> <NA>
## 2016-04-10 23:26:28 <NA> <NA>
## 2016-04-10 23:26:28 <NA> <NA>
## 2016-04-10 23:26:28 <NA> <NA>
## 2016-04-10 23:26:28 <NA> <NA>
## 2016-04-10 23:26:28 <NA> <NA>
## 2016-04-10 23:26:28 <NA> <NA>
## 2016-04-10 23:26:28 <NA> <NA>
## 2016-04-10 23:26:28 <NA> <NA>
## 2016-04-10 23:26:28 <NA> <NA>
## 2016-04-10 23:26:28 <NA> <NA>
## 2016-04-10 23:26:28 <NA> <NA>
## 2016-04-10 23:26:28 <NA> <NA>
## 2016-04-10 23:26:28 <NA> <NA>
## 2016-04-10 23:26:28 <NA> <NA>
## 2016-04-10 23:26:28 <NA> <NA>
## 2016-04-10 23:26:28 <NA> <NA>
## 2016-04-10 23:26:28 <NA> <NA>
## 2016-04-10 23:26:28 <NA> <NA>
## id language origin email_type source
## 0000.7b1b73cf36cf9dbc3d64e3f2ee2b91f1 en_US <NA> spam spam
## 0001.bfc8d64d12b325ff385cca8d07b84288 en_US <NA> spam spam
## 0002.24b47bb3ce90708ae29d0aec1da08610 en_US <NA> spam spam
## 0003.4b3d943b8df71af248d12f8b2e7a224a en_US <NA> spam spam
## 0004.1874ab60c71f0b31b580f313a3f6e777 en_US <NA> spam spam
## 0005.1f42bb885de0ef7fc5cd09d34dc2ba54 en_US <NA> spam spam
## 0006.7a32642f8c22bbeb85d6c3b5f3890a2c en_US <NA> spam spam
## 0007.859c901719011d56f8b652ea071c1f8b en_US <NA> spam spam
## 0008.9562918b57e044abfbce260cc875acde en_US <NA> spam spam
## 0009.c05e264fbf18783099b53dbc9a9aacda en_US <NA> spam spam
## 0010.7f5fb525755c45eb78efc18d7c9ea5aa en_US <NA> spam spam
## 0011.2a1247254a535bac29c476b86c708901 en_US <NA> spam spam
## 0012.7bc8e619ad0264979edce15083e70a02 en_US <NA> spam spam
## 0013.9034ac0917f6fdb82c5ee6a7509029ed en_US <NA> spam spam
## 0014.ed99ffe0f452b91be11684cbfe8d349c en_US <NA> spam spam
## 0015.1b871d654560011a0aaa29bb4e9054f7 en_US <NA> spam spam
## 0016.f9c349935955e1ccc7626270da898445 en_US <NA> spam spam
## 0017.49ab70c7a4042cb1c695a0e59a6ede54 en_US <NA> spam spam
## 0018.259154a52bc55dcae491cfded60a5cd2 en_US <NA> spam spam
## 0019.939e70d8367f315193e4bc5be80dc262 en_US <NA> spam spam
##
## Source: Combination of corpuses spamCorpus and spam2Corpus
## Created: Sun Apr 10 19:26:35 2016
## Notes: spam
dfmSpam <- buildDFM(spamCorpusCombined, round(length(docnames(spamCorpusCombined))/10), 50)
## Creating a dfm from a corpus ...
## ... lowercasing
## ... tokenizing
## ... indexing documents: 1,899 documents
## ... indexing features: 85,477 feature types
## ... removed 163 features, from 174 supplied (glob) feature types
## ... stemming features (English), trimmed 5537 feature variants
## ... created a 1899 x 79777 sparse dfm
## ... complete.
## Elapsed time: 14.94 seconds.
## Removing features occurring fewer than 50 times: 77866
## Removing features occurring in fewer than 190 documents: 79440
dim(dfmSpam) # basic dimensions of the dfm
## [1] 1899 337
topfeatures(dfmSpam, 20) # top features of the spam dfm
## 3d font td br b size tr nbsp p face
## 40927 40668 21170 20148 14866 14770 12640 11868 11858 11563
## http width color receiv align arial id center height tabl
## 11393 11123 10769 10356 7949 7076 6887 5980 5569 5509
plot(topfeatures(dfmSpam, 100), log = "y", cex = .6, ylab = "Term frequency", main = "Top Features of Spam")
Workcloud of the top 100 Spam features or words:
Create the Ham Corpus by combining the files found in the easy_ham
and hard_ham
compressed file downloads from the spamassassin public corpus.
########### HAM ###############
hamCorpus <- createCorpus("./easy_ham", "ham")
ham2Corpus <- createCorpus("./hard_ham", "ham")
#combine the 2 ham corpa
hamCorpusCombined <- hamCorpus + ham2Corpus
The summary of the Ham Corpus:
## Corpus consisting of 2801 documents, showing 20 documents.
##
## Text Types Tokens Sentences author
## 0001.ea7e79d3153e7469e7a9c3e0af6a357e 300 1080 25 <NA>
## 0002.b3120c4bcbf3101e661161ee7efcb8bf 250 802 5 <NA>
## 0003.acfc5ad94bbd27118a0d8685d18c89dd 326 904 11 <NA>
## 0004.e8d5727378ddde5c3be181df593f1712 275 742 9 <NA>
## 0005.8c3b9e9c0f3f183ddaf7592a11b99957 365 1141 23 <NA>
## 0006.ee8b0dba12856155222be180ba122058 272 802 10 <NA>
## 0007.c75188382f64b090022fa3b095b020b0 246 792 7 <NA>
## 0008.20bc0b4ba2d99aae1c7098069f611a9b 306 928 9 <NA>
## 0009.435ae292d75abb1ca492dcc2d5cf1570 291 850 14 <NA>
## 0010.4996141de3f21e858c22f88231a9f463 688 1904 42 <NA>
## 0011.07b11073b53634cff892a7988289a72e 324 1191 30 <NA>
## 0012.d354b2d2f24d1036caf1374dd94f4c94 268 812 11 <NA>
## 0013.ff597adee000d073ae72200b0af00cd1 242 787 14 <NA>
## 0014.532e0a17d0674ba7a9baa7b0afe5fb52 363 1149 34 <NA>
## 0015.a9ff8d7550759f6ab62cc200bdf156e7 261 802 10 <NA>
## 0016.d82758030e304d41fb3f4ebbb7d9dd91 308 920 17 <NA>
## 0017.d81093a2182fc9135df6d9158a8ebfd6 271 757 15 <NA>
## 0018.ba70ecbeea6f427b951067f34e23bae6 400 1425 45 <NA>
## 0019.a8a1b2767e83b3be653e4af0148e1897 541 1543 36 <NA>
## 0020.ef397cef16f8041242e3b6560e168053 223 589 6 <NA>
## datetimestamp description heading
## 2016-04-10 23:26:58 <NA> <NA>
## 2016-04-10 23:26:58 <NA> <NA>
## 2016-04-10 23:26:58 <NA> <NA>
## 2016-04-10 23:26:58 <NA> <NA>
## 2016-04-10 23:26:58 <NA> <NA>
## 2016-04-10 23:26:58 <NA> <NA>
## 2016-04-10 23:26:58 <NA> <NA>
## 2016-04-10 23:26:58 <NA> <NA>
## 2016-04-10 23:26:58 <NA> <NA>
## 2016-04-10 23:26:58 <NA> <NA>
## 2016-04-10 23:26:58 <NA> <NA>
## 2016-04-10 23:26:58 <NA> <NA>
## 2016-04-10 23:26:58 <NA> <NA>
## 2016-04-10 23:26:58 <NA> <NA>
## 2016-04-10 23:26:58 <NA> <NA>
## 2016-04-10 23:26:58 <NA> <NA>
## 2016-04-10 23:26:58 <NA> <NA>
## 2016-04-10 23:26:58 <NA> <NA>
## 2016-04-10 23:26:58 <NA> <NA>
## 2016-04-10 23:26:58 <NA> <NA>
## id language origin email_type source
## 0001.ea7e79d3153e7469e7a9c3e0af6a357e en_US <NA> ham easy_ham
## 0002.b3120c4bcbf3101e661161ee7efcb8bf en_US <NA> ham easy_ham
## 0003.acfc5ad94bbd27118a0d8685d18c89dd en_US <NA> ham easy_ham
## 0004.e8d5727378ddde5c3be181df593f1712 en_US <NA> ham easy_ham
## 0005.8c3b9e9c0f3f183ddaf7592a11b99957 en_US <NA> ham easy_ham
## 0006.ee8b0dba12856155222be180ba122058 en_US <NA> ham easy_ham
## 0007.c75188382f64b090022fa3b095b020b0 en_US <NA> ham easy_ham
## 0008.20bc0b4ba2d99aae1c7098069f611a9b en_US <NA> ham easy_ham
## 0009.435ae292d75abb1ca492dcc2d5cf1570 en_US <NA> ham easy_ham
## 0010.4996141de3f21e858c22f88231a9f463 en_US <NA> ham easy_ham
## 0011.07b11073b53634cff892a7988289a72e en_US <NA> ham easy_ham
## 0012.d354b2d2f24d1036caf1374dd94f4c94 en_US <NA> ham easy_ham
## 0013.ff597adee000d073ae72200b0af00cd1 en_US <NA> ham easy_ham
## 0014.532e0a17d0674ba7a9baa7b0afe5fb52 en_US <NA> ham easy_ham
## 0015.a9ff8d7550759f6ab62cc200bdf156e7 en_US <NA> ham easy_ham
## 0016.d82758030e304d41fb3f4ebbb7d9dd91 en_US <NA> ham easy_ham
## 0017.d81093a2182fc9135df6d9158a8ebfd6 en_US <NA> ham easy_ham
## 0018.ba70ecbeea6f427b951067f34e23bae6 en_US <NA> ham easy_ham
## 0019.a8a1b2767e83b3be653e4af0148e1897 en_US <NA> ham easy_ham
## 0020.ef397cef16f8041242e3b6560e168053 en_US <NA> ham easy_ham
##
## Source: Combination of corpuses hamCorpus and ham2Corpus
## Created: Sun Apr 10 19:27:05 2016
## Notes: ham
dfmHam <- buildDFM(hamCorpusCombined, round(length(docnames(hamCorpusCombined))/10), 50)
## Creating a dfm from a corpus ...
## ... lowercasing
## ... tokenizing
## ... indexing documents: 2,801 documents
## ... indexing features: 63,673 feature types
## ... removed 171 features, from 174 supplied (glob) feature types
## ... stemming features (English), trimmed 10241 feature variants
## ... created a 2801 x 53261 sparse dfm
## ... complete.
## Elapsed time: 13.42 seconds.
## Removing features occurring fewer than 50 times: 50796
## Removing features occurring in fewer than 280 documents: 52997
dim(dfmHam)
## [1] 2801 264
plot(topfeatures(dfmHam, 100), log = "y", cex = .6, ylab = "Term frequency", main = "Top Features of Ham")
Workcloud of the top 100 Ham features or words:
Apply the tdidf
function the Spam and Ham dfm objects to create a matrix of word frequencies. These two matrices are combined using rbind.fill
from the plyr package.
dfSpam <- create_df_matrix(dfmSpam, "spam")
dfHam <- create_df_matrix(dfmHam, "ham")
stacked.df <- rbind.fill(dfSpam, dfHam)
# set NA values to 0
stacked.df[is.na(stacked.df)] <- 0
This script is based on Timothy DAuria’s YouTube tutorial “How to Build a Text Mining, Machine Learning Document Classification #System in R!” (https://www.youtube.com/watch?v=j1V2McKbkLo).
## Create the training and test datasets
train.idx <- sample(nrow(stacked.df), ceiling(nrow(stacked.df) * 0.7))
test.idx <- (1:nrow(stacked.df)) [-train.idx]
length(train.idx) #
## [1] 3290
length(test.idx)
## [1] 1410
tdm.email <- stacked.df[, "Source"]
stacked.nl <- stacked.df[, !colnames(stacked.df) %in% "Source"] #stacked.nl
Run the kNN prediction using the training and test datasets
knn.pred <- knn(stacked.nl[train.idx, ], stacked.nl[test.idx, ], tdm.email[train.idx])
The resulting Confusion Matrix:
conf.mat <- table("Predictions" = knn.pred, Actual = tdm.email[test.idx])
## Actual
## Predictions ham spam
## ham 802 16
## spam 6 586
The accuracy of the model = 98.4397163
# To output the predictions
df.pred <- cbind(knn.pred, stacked.nl[test.idx, ])