DATA 607: Week 11 Assignment - Document Classification

Document Classification using quanteda and k-Nearest Neighbor (kNN)

This assignment will start with a spam/ham dataset, then predict the class of new documents withheld from the training dataset.

The corpus for this analysis is located here: https://spamassassin.apache.org/publiccorpus/

The code for this assignment requires the following R packages:

downloader
R.utils
quanteda
tm
plyr
class
stringi
knitr

This analysis heavily uses the quanteda package in R. Information on the quanteda package can be found here:

https://cran.r-project.org/web/packages/quanteda/vignettes/quickstart.html

quanteda Package Introduction

“quanteda makes it easy to manage texts in the form of a corpus, defined as a collection of texts that includes document-level variables specific to each text, as well as meta-data for documents and for the collection as a whole. quanteda includes tools to make it easy and fast to manuipulate the texts in a corpus, by performing the most common natural language processing tasks simply and quickly, such as tokenizing, stemming, or forming ngrams. quanteda’s functions for tokenizing texts and forming multiple tokenized documents into a document-feature matrix are both extremely fast and extremely simple to use. quanteda can segment texts easily by words, paragraphs, sentences, or even user-supplied delimiters and tags.”

Functions used throughout the analysis:

# =========================================================================
# Function: download_and_untar
# =========================================================================
# Description: 
#              Function downloads the specified bz2 spam or ham file
#              on https://spamassassin.apache.org/publiccorpus.
#              Once downloaded to the local compute, the files are
#              bunzipped and untarred
# 
# Parameters: 
#            1. name of the filename to download from the public corpus
#            2. boolean whether the file should be downloaded only; 
#               default = FALSE
#
# Reurn: N/A
# =========================================================================
download_and_untar <- function(filename, downloadOnly = FALSE) {
    
        # download the specified files from 
        # https://spamassassin.apache.org/publiccorpus 
        downloader::download(url = paste0(URL, filename), filename )
        
        tar.file <- stri_replace_all_regex(filename, ".bz2", "")
    
        if (!downloadOnly) {
            
           # bunzip2 the file    
           bunzip2(filename, tar.file, remove = FALSE, skip = TRUE)
           
           # untar the file     
           untar(tar.file, exdir = ".")
        
           # remove the tar file
           if (file.exists(tar.file)) file.remove(tar.file)
        
        }
}

# =========================================================================
# Function: createCorpus
# =========================================================================
# Description: 
#              Uses the tm package vCorpus object to convert a corpus into 
#              a quanteda corpus.
# 
# Parameters: 
#            1. directory location of the files to be used in the corpus
#            2. type of email- spam or ham.  This value is set as a 
#               docvar on the corpus
# Reurn: corpus (quanteda)
# =========================================================================
createCorpus <- function(directory, emailType) {
    
    quantCorpus <- corpus(Corpus(DirSource(directory = directory, encoding = "UTF-8"), 
                                    readerControl = list(language="en_US")),
                      notes=emailType)
    
    docvars(quantCorpus, "email_type") <- emailType
    docvars(quantCorpus, "source")     <- stri_replace_all_regex(directory, "./", "")
    
    return(quantCorpus)
    
}

# =========================================================================
# Function: buildDFM
# =========================================================================
# Description: 
#              Accepts a corpus object and converts to a document-feature
#              matrix (dfm).
# 
# Parameters: 
#            1. the corpus to convert to a dfm
#            2. minDoc value 
#            3. minCount value 
#
# Reurn: dfm (document-feature matrix)
# =========================================================================
buildDFM <- function(corpus, minDoc, minCount) {
    # create the document-feature matrix
    
    # dfm = document-feature matrix
    dfm <- dfm(corpus, ignoredFeatures = stopwords("english"), stem = TRUE)

    dfm <- trim(dfm, minDoc = minDoc, minCount = minCount)
    
    return(dfm)
    
}

plotDFM <- function(dfm) {
    
    # plot in colors with some additional options passed to wordcloud
    plot(dfm, random.color = TRUE, rot.per = .25, colors = sample(colors()[2:128], 5))
    
}

# =========================================================================
# Function: create_df_matrix
# =========================================================================
# Description: 
#              Accepts a dfm object, applies the td-idf function, and
#              returns a dataframe
#
#    tfidf computes term frequency-inverse document frequency weighting. 
#    The default is not to normalize term frequency # #   (by computing relative term frequency 
#    within document) but this will be performed if normalize = TRUE.
# 
# Parameters: 
#            1. dfm to process
#            2. tpye of email - spam or ham
#
# Reurn: dataframe
# =========================================================================
create_df_matrix <- function(dfm, emailType) {
    
    # apply the tfidf function
    mat <- data.matrix(tfidf(dfm))
 
    # convert to a dataframe
    df <- as.data.frame(mat, stringsAsFactors =  FALSE)
    df$Source <- emailType
    
    return(df)
}

1. Download and Create the Spam and Ham Corpuses

The following sets of files are used as input into the document classification. The files are classified either as 1.) Ham which is email that is generally desired to be received or 2.) Spam which is typically unsolicited email, generated in bulk and is generally unwanted by the recipient.

Filename	Type
20021010_easy_ham.tar.bz2	Ham
20021010_spam.tar.bz2	Spam
20021010_hard_ham.tar.bz2	Ham
20030228_spam_2.tar.bz2	Spam

# use lapply to download and untar all files specified
lapply(files, download_and_untar)

## [[1]]
## [1] TRUE
## 
## [[2]]
## [1] TRUE
## 
## [[3]]
## [1] TRUE
## 
## [[4]]
## [1] TRUE

2. Create the Spam Corpus

Create the Spam Corpus by combining the files found in the spam and spam_2 compressed file downloads from the spamassassin public corpus.

########### SPAM ###############

spamCorpus <- createCorpus("./spam", "spam")
spam2Corpus <- createCorpus("./spam_2", "spam")

#combine the 2 Spam corpora 
spamCorpusCombined <- spamCorpus + spam2Corpus

Let’s look at the combined Spam corpus using the summary function:

## Corpus consisting of 1899 documents, showing 20 documents.
## 
##                                   Text Types Tokens Sentences author
##  0000.7b1b73cf36cf9dbc3d64e3f2ee2b91f1  1170   1835         1   <NA>
##  0001.bfc8d64d12b325ff385cca8d07b84288   341   1495        18   <NA>
##  0002.24b47bb3ce90708ae29d0aec1da08610   237    637        10   <NA>
##  0003.4b3d943b8df71af248d12f8b2e7a224a   201    474         9   <NA>
##  0004.1874ab60c71f0b31b580f313a3f6e777   364   1112        46   <NA>
##  0005.1f42bb885de0ef7fc5cd09d34dc2ba54   227    602         7   <NA>
##  0006.7a32642f8c22bbeb85d6c3b5f3890a2c   378    821        27   <NA>
##  0007.859c901719011d56f8b652ea071c1f8b   189    423        10   <NA>
##  0008.9562918b57e044abfbce260cc875acde   613   5937        22   <NA>
##  0009.c05e264fbf18783099b53dbc9a9aacda   424    951        40   <NA>
##  0010.7f5fb525755c45eb78efc18d7c9ea5aa   231    825         5   <NA>
##  0011.2a1247254a535bac29c476b86c708901   199    469         9   <NA>
##  0012.7bc8e619ad0264979edce15083e70a02   166    535         7   <NA>
##  0013.9034ac0917f6fdb82c5ee6a7509029ed   199    470         9   <NA>
##  0014.ed99ffe0f452b91be11684cbfe8d349c   308   1876        38   <NA>
##  0015.1b871d654560011a0aaa29bb4e9054f7   182    501         7   <NA>
##  0016.f9c349935955e1ccc7626270da898445   314   1699        10   <NA>
##  0017.49ab70c7a4042cb1c695a0e59a6ede54   357    784        40   <NA>
##  0018.259154a52bc55dcae491cfded60a5cd2   186    417        11   <NA>
##  0019.939e70d8367f315193e4bc5be80dc262   326    723        19   <NA>
##        datetimestamp description heading
##  2016-04-10 23:26:28        <NA>    <NA>
##  2016-04-10 23:26:28        <NA>    <NA>
##  2016-04-10 23:26:28        <NA>    <NA>
##  2016-04-10 23:26:28        <NA>    <NA>
##  2016-04-10 23:26:28        <NA>    <NA>
##  2016-04-10 23:26:28        <NA>    <NA>
##  2016-04-10 23:26:28        <NA>    <NA>
##  2016-04-10 23:26:28        <NA>    <NA>
##  2016-04-10 23:26:28        <NA>    <NA>
##  2016-04-10 23:26:28        <NA>    <NA>
##  2016-04-10 23:26:28        <NA>    <NA>
##  2016-04-10 23:26:28        <NA>    <NA>
##  2016-04-10 23:26:28        <NA>    <NA>
##  2016-04-10 23:26:28        <NA>    <NA>
##  2016-04-10 23:26:28        <NA>    <NA>
##  2016-04-10 23:26:28        <NA>    <NA>
##  2016-04-10 23:26:28        <NA>    <NA>
##  2016-04-10 23:26:28        <NA>    <NA>
##  2016-04-10 23:26:28        <NA>    <NA>
##  2016-04-10 23:26:28        <NA>    <NA>
##                                     id language origin email_type source
##  0000.7b1b73cf36cf9dbc3d64e3f2ee2b91f1    en_US   <NA>       spam   spam
##  0001.bfc8d64d12b325ff385cca8d07b84288    en_US   <NA>       spam   spam
##  0002.24b47bb3ce90708ae29d0aec1da08610    en_US   <NA>       spam   spam
##  0003.4b3d943b8df71af248d12f8b2e7a224a    en_US   <NA>       spam   spam
##  0004.1874ab60c71f0b31b580f313a3f6e777    en_US   <NA>       spam   spam
##  0005.1f42bb885de0ef7fc5cd09d34dc2ba54    en_US   <NA>       spam   spam
##  0006.7a32642f8c22bbeb85d6c3b5f3890a2c    en_US   <NA>       spam   spam
##  0007.859c901719011d56f8b652ea071c1f8b    en_US   <NA>       spam   spam
##  0008.9562918b57e044abfbce260cc875acde    en_US   <NA>       spam   spam
##  0009.c05e264fbf18783099b53dbc9a9aacda    en_US   <NA>       spam   spam
##  0010.7f5fb525755c45eb78efc18d7c9ea5aa    en_US   <NA>       spam   spam
##  0011.2a1247254a535bac29c476b86c708901    en_US   <NA>       spam   spam
##  0012.7bc8e619ad0264979edce15083e70a02    en_US   <NA>       spam   spam
##  0013.9034ac0917f6fdb82c5ee6a7509029ed    en_US   <NA>       spam   spam
##  0014.ed99ffe0f452b91be11684cbfe8d349c    en_US   <NA>       spam   spam
##  0015.1b871d654560011a0aaa29bb4e9054f7    en_US   <NA>       spam   spam
##  0016.f9c349935955e1ccc7626270da898445    en_US   <NA>       spam   spam
##  0017.49ab70c7a4042cb1c695a0e59a6ede54    en_US   <NA>       spam   spam
##  0018.259154a52bc55dcae491cfded60a5cd2    en_US   <NA>       spam   spam
##  0019.939e70d8367f315193e4bc5be80dc262    en_US   <NA>       spam   spam
## 
## Source:  Combination of corpuses spamCorpus and spam2Corpus
## Created: Sun Apr 10 19:26:35 2016
## Notes:   spam

2.1 Build the document-feature matrix using the Spam corpus

dfmSpam <- buildDFM(spamCorpusCombined, round(length(docnames(spamCorpusCombined))/10), 50)

## Creating a dfm from a corpus ...
##    ... lowercasing
##    ... tokenizing
##    ... indexing documents: 1,899 documents
##    ... indexing features: 85,477 feature types
##    ... removed 163 features, from 174 supplied (glob) feature types
##    ... stemming features (English), trimmed 5537 feature variants
##    ... created a 1899 x 79777 sparse dfm
##    ... complete. 
## Elapsed time: 14.94 seconds.
## Removing features occurring fewer than 50 times: 77866
## Removing features occurring in fewer than 190 documents: 79440

dim(dfmSpam)              # basic dimensions of the dfm

## [1] 1899  337

topfeatures(dfmSpam, 20)  # top features of the spam dfm

##     3d   font     td     br      b   size     tr   nbsp      p   face 
##  40927  40668  21170  20148  14866  14770  12640  11868  11858  11563 
##   http  width  color receiv  align  arial     id center height   tabl 
##  11393  11123  10769  10356   7949   7076   6887   5980   5569   5509

plot(topfeatures(dfmSpam, 100), log = "y", cex = .6, ylab = "Term frequency", main = "Top Features of Spam")

Workcloud of the top 100 Spam features or words:

3. Create the Ham Corpus

Create the Ham Corpus by combining the files found in the easy_ham and hard_ham compressed file downloads from the spamassassin public corpus.

########### HAM ###############

hamCorpus <- createCorpus("./easy_ham", "ham")
ham2Corpus <- createCorpus("./hard_ham", "ham")


#combine the 2 ham corpa 
hamCorpusCombined <- hamCorpus + ham2Corpus

The summary of the Ham Corpus:

## Corpus consisting of 2801 documents, showing 20 documents.
## 
##                                   Text Types Tokens Sentences author
##  0001.ea7e79d3153e7469e7a9c3e0af6a357e   300   1080        25   <NA>
##  0002.b3120c4bcbf3101e661161ee7efcb8bf   250    802         5   <NA>
##  0003.acfc5ad94bbd27118a0d8685d18c89dd   326    904        11   <NA>
##  0004.e8d5727378ddde5c3be181df593f1712   275    742         9   <NA>
##  0005.8c3b9e9c0f3f183ddaf7592a11b99957   365   1141        23   <NA>
##  0006.ee8b0dba12856155222be180ba122058   272    802        10   <NA>
##  0007.c75188382f64b090022fa3b095b020b0   246    792         7   <NA>
##  0008.20bc0b4ba2d99aae1c7098069f611a9b   306    928         9   <NA>
##  0009.435ae292d75abb1ca492dcc2d5cf1570   291    850        14   <NA>
##  0010.4996141de3f21e858c22f88231a9f463   688   1904        42   <NA>
##  0011.07b11073b53634cff892a7988289a72e   324   1191        30   <NA>
##  0012.d354b2d2f24d1036caf1374dd94f4c94   268    812        11   <NA>
##  0013.ff597adee000d073ae72200b0af00cd1   242    787        14   <NA>
##  0014.532e0a17d0674ba7a9baa7b0afe5fb52   363   1149        34   <NA>
##  0015.a9ff8d7550759f6ab62cc200bdf156e7   261    802        10   <NA>
##  0016.d82758030e304d41fb3f4ebbb7d9dd91   308    920        17   <NA>
##  0017.d81093a2182fc9135df6d9158a8ebfd6   271    757        15   <NA>
##  0018.ba70ecbeea6f427b951067f34e23bae6   400   1425        45   <NA>
##  0019.a8a1b2767e83b3be653e4af0148e1897   541   1543        36   <NA>
##  0020.ef397cef16f8041242e3b6560e168053   223    589         6   <NA>
##        datetimestamp description heading
##  2016-04-10 23:26:58        <NA>    <NA>
##  2016-04-10 23:26:58        <NA>    <NA>
##  2016-04-10 23:26:58        <NA>    <NA>
##  2016-04-10 23:26:58        <NA>    <NA>
##  2016-04-10 23:26:58        <NA>    <NA>
##  2016-04-10 23:26:58        <NA>    <NA>
##  2016-04-10 23:26:58        <NA>    <NA>
##  2016-04-10 23:26:58        <NA>    <NA>
##  2016-04-10 23:26:58        <NA>    <NA>
##  2016-04-10 23:26:58        <NA>    <NA>
##  2016-04-10 23:26:58        <NA>    <NA>
##  2016-04-10 23:26:58        <NA>    <NA>
##  2016-04-10 23:26:58        <NA>    <NA>
##  2016-04-10 23:26:58        <NA>    <NA>
##  2016-04-10 23:26:58        <NA>    <NA>
##  2016-04-10 23:26:58        <NA>    <NA>
##  2016-04-10 23:26:58        <NA>    <NA>
##  2016-04-10 23:26:58        <NA>    <NA>
##  2016-04-10 23:26:58        <NA>    <NA>
##  2016-04-10 23:26:58        <NA>    <NA>
##                                     id language origin email_type   source
##  0001.ea7e79d3153e7469e7a9c3e0af6a357e    en_US   <NA>        ham easy_ham
##  0002.b3120c4bcbf3101e661161ee7efcb8bf    en_US   <NA>        ham easy_ham
##  0003.acfc5ad94bbd27118a0d8685d18c89dd    en_US   <NA>        ham easy_ham
##  0004.e8d5727378ddde5c3be181df593f1712    en_US   <NA>        ham easy_ham
##  0005.8c3b9e9c0f3f183ddaf7592a11b99957    en_US   <NA>        ham easy_ham
##  0006.ee8b0dba12856155222be180ba122058    en_US   <NA>        ham easy_ham
##  0007.c75188382f64b090022fa3b095b020b0    en_US   <NA>        ham easy_ham
##  0008.20bc0b4ba2d99aae1c7098069f611a9b    en_US   <NA>        ham easy_ham
##  0009.435ae292d75abb1ca492dcc2d5cf1570    en_US   <NA>        ham easy_ham
##  0010.4996141de3f21e858c22f88231a9f463    en_US   <NA>        ham easy_ham
##  0011.07b11073b53634cff892a7988289a72e    en_US   <NA>        ham easy_ham
##  0012.d354b2d2f24d1036caf1374dd94f4c94    en_US   <NA>        ham easy_ham
##  0013.ff597adee000d073ae72200b0af00cd1    en_US   <NA>        ham easy_ham
##  0014.532e0a17d0674ba7a9baa7b0afe5fb52    en_US   <NA>        ham easy_ham
##  0015.a9ff8d7550759f6ab62cc200bdf156e7    en_US   <NA>        ham easy_ham
##  0016.d82758030e304d41fb3f4ebbb7d9dd91    en_US   <NA>        ham easy_ham
##  0017.d81093a2182fc9135df6d9158a8ebfd6    en_US   <NA>        ham easy_ham
##  0018.ba70ecbeea6f427b951067f34e23bae6    en_US   <NA>        ham easy_ham
##  0019.a8a1b2767e83b3be653e4af0148e1897    en_US   <NA>        ham easy_ham
##  0020.ef397cef16f8041242e3b6560e168053    en_US   <NA>        ham easy_ham
## 
## Source:  Combination of corpuses hamCorpus and ham2Corpus
## Created: Sun Apr 10 19:27:05 2016
## Notes:   ham

3.1 Build the document-feature matrix (dfm) using the Ham corpus.

dfmHam <- buildDFM(hamCorpusCombined, round(length(docnames(hamCorpusCombined))/10), 50)

## Creating a dfm from a corpus ...
##    ... lowercasing
##    ... tokenizing
##    ... indexing documents: 2,801 documents
##    ... indexing features: 63,673 feature types
##    ... removed 171 features, from 174 supplied (glob) feature types
##    ... stemming features (English), trimmed 10241 feature variants
##    ... created a 2801 x 53261 sparse dfm
##    ... complete. 
## Elapsed time: 13.42 seconds.
## Removing features occurring fewer than 50 times: 50796
## Removing features occurring in fewer than 280 documents: 52997

dim(dfmHam)

## [1] 2801  264

plot(topfeatures(dfmHam, 100), log = "y", cex = .6, ylab = "Term frequency", main = "Top Features of Ham")

Workcloud of the top 100 Ham features or words:

4. Build the k-Nearest Neighbor Model for Document Classification

Apply the tdidf function the Spam and Ham dfm objects to create a matrix of word frequencies. These two matrices are combined using rbind.fill from the plyr package.

dfSpam <- create_df_matrix(dfmSpam, "spam")  

dfHam <- create_df_matrix(dfmHam, "ham")  

stacked.df <- rbind.fill(dfSpam, dfHam)

# set NA values to 0
stacked.df[is.na(stacked.df)] <- 0

This script is based on Timothy DAuria’s YouTube tutorial “How to Build a Text Mining, Machine Learning Document Classification #System in R!” (https://www.youtube.com/watch?v=j1V2McKbkLo).

## Create the training and test datasets 

train.idx <- sample(nrow(stacked.df), ceiling(nrow(stacked.df) * 0.7))
test.idx <- (1:nrow(stacked.df)) [-train.idx]

length(train.idx)  #

## [1] 3290

length(test.idx)

## [1] 1410

tdm.email <- stacked.df[, "Source"]
stacked.nl <- stacked.df[, !colnames(stacked.df) %in% "Source"]  #stacked.nl

Run the kNN prediction using the training and test datasets

knn.pred <- knn(stacked.nl[train.idx, ], stacked.nl[test.idx, ], tdm.email[train.idx])

The resulting Confusion Matrix:

conf.mat <- table("Predictions" = knn.pred, Actual = tdm.email[test.idx])

##            Actual
## Predictions ham spam
##        ham  802   16
##        spam   6  586

The accuracy of the model = 98.4397163

# To output the predictions 

df.pred <- cbind(knn.pred, stacked.nl[test.idx, ])