Natural Language Processing - Week 2 Milestone Report

Introduction

The goal of this milestone report is to explore the major features of the text data given to us for the Coursera Data Science Capstone through Johns Hopkins University. The project is sponsored by SwiftKey. The end-goal is to create text-prediction application with R’s Shiny package that predicts words using a natural language processing model.

The first step here is to get an idea of what kinds of pre-processing will be necessary to prepare the data for creating the model. Specifically, certain kinds of characters and words need to be removed and/or modified to aid in prediction accuracy. Finally, I need to create list of single words, and two/three word phrases to see which occur most frequently.

Downloading and Splitting the Raw Data

Load Libraries

The following R packages are necessary for this analysis.

library(tm)
library(quanteda)
library(dplyr)
library(ggplot2)
library(stringr)
library(pander)

Download

The following downloads the dataset and unzips it into the current working directory. Next, the files of interest are read into R with the readLines function. The data is available here.

fileURL <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
if(!file.exists(basename(fileURL))){
    download.file(fileURL, basename(fileURL))
    unzip(basename(fileURL))
}

blog <- readLines(con = "./final/en_US/en_US.blogs.txt", encoding= "UTF-8", skipNul = T)
news <- readLines(con = "./final/en_US/en_US.news.txt", encoding= "UTF-8", skipNul = T)
twit <- readLines(con = "./final/en_US/en_US.twitter.txt", encoding= "UTF-8", skipNul = T)

Split Into Separate Files

Now that the dataset has been obtained and the Corpus created, it needs to be split into a training set, development set, and test set. Since the files contain a large number of documents, a small sample of the training set will be created for exploratory analysis. The ratio will be 60% training, 20% development, and 20% test. The small training dataset for exploratory analysis will be 20% of the training set, or 12% of the total dataset.

#Randomly permute the order of lines in the files for splitting
set.seed(310)
blog <- blog[sample(seq(length(blog)))]
news <- news[sample(seq(length(news)))]
twit <- twit[sample(seq(length(twit)))]

#Split the blog text
n <- length(blog)
train.blog <- blog[1:floor(n*0.6)]
dev.blog <- blog[(floor(n*0.6)+1):floor(n*0.8)]
test.blog <- blog[(floor(n*0.8)+1):n]
#Split the news text
n <- length(news)
train.news <- news[1:floor(n*0.6)]
dev.news <- news[(floor(n*0.6)+1):floor(n*0.8)]
test.news <- news[(floor(n*0.8)+1):n]
#Split the twitter text
n <- length(twit)
train.twit <- twit[1:floor(n*0.6)]
dev.twit <- twit[(floor(n*0.6)+1):floor(n*0.8)]
test.twit <- twit[(floor(n*0.8)+1):n]


#Take a smaller sample of the training sets for exploratory analysis
set.seed(424)
n <- sample(seq(length(train.blog)))
small.blog <- train.blog[n[1:floor(length(n)*0.2)]]
n <- sample(seq(length(train.news)))
small.news <- train.news[n[1:floor(length(n)*0.2)]]
n <- sample(seq(length(train.twit)))
small.twit <- train.twit[n[1:floor(length(n)*0.2)]]

Note: At this point, all the datasets are written to local files so that they can be loaded later. Code is not shown.

Basic File Information

Here we can see the list of files created, their file size, and a rough estimate of the number of words contained in each file. To get the word estimate, the str_count function from the stringer package was used to count the number of character sequences separated by a space. Specifically, with the call: str_count(x, "\\S+").

As you can see, the total number of words in the three “small” files totals approximately 12 million, which should provide enough data for a representative exploratory analysis.

fileName	fileSize	nLines	numWords	avgNumWords
en_US.blogs	200.4 Mb	899,288	37,334,131	41.5
en_US.news	196.3 Mb	1,010,243	34,372,529	34
en_US.twitter	159.4 Mb	2,360,148	30,373,583	12.9
train.blog	119.4 Mb	539,572	22,387,983	41.5
train.news	117.6 Mb	606,145	20,635,720	34
train.twit	96 Mb	1,416,088	18,230,256	12.9
dev.blog	39.9 Mb	179,858	7,484,601	41.6
dev.news	39.2 Mb	202,049	6,881,963	34.1
dev.twit	32 Mb	472,030	6,072,570	12.9
test.blog	39.8 Mb	179,858	7,461,547	41.5
test.news	39.1 Mb	202,049	6,854,846	33.9
test.twit	32 Mb	472,030	6,070,757	12.9
small.blog	24 Mb	107,914	4,504,186	41.7
small.news	23.5 Mb	121,229	4,126,578	34
small.twit	19.2 Mb	283,217	3,647,260	12.9

Transformations

I will briefly describe the transformations I did to the Corpus and my reasons or doing so.

Check the types of characters and transform to ASCII to eliminate foreign letters and symbols.
Remove profanity words found on this list so that the prediction app will not produce obscene words.
Remove all unicode tags “<>”, Twitter-style hashtag “#” words, and urls because they are not actual words.
Remove all remaining punctuation except for end-of-sentence (EOS) punctuation and apostrophes. In some cases, it is useful to remove all punctuation, but in this case, the EOS punctuation provides meaningful context to bigrams and trigrams. Apostrophes are preserved because eliminating them could destroy context as well.
Create an , end-of-sentence, tag to replace all “. ! ?” for context clarity.
Clean up various irregularities caused by the previous transformations.
Remove all numbers since they are not words. I will later explore using a tag instead.
Strip all excess whitespace to clean up the mess created by all the other transformations.

NOTE: Other common pre-processing steps such as stopword removal and stemming will not be done. Stopwords are the most commonly used words in a language and removing them will negatively affect prediction accuracy. Similarly, stemming, or removal of suffixes could have detrimental effects on prediction.

Check for Character Types

corp <- VCorpus(DirSource("data/small"))
myChars <- function(x, n=seq(x)) {
    # x: a Corpus
    # n: the elements of x for which characters will be returned
    require(dplyr)
    t <- character()
    for(i in n){
        t <- c(t, x[[i]][[1]])
    }
    t %>%
    str_split("") %>%
    sapply(function(x) x[-1]) %>%
    unlist %>%
    unique %>%
    sort(dec=F)
}
chars <- myChars(corp)
print(chars, quote = F)

##   [1]       \003   \b     \020   \035   '      -                        
##  [11] !      "      #      $      %      &      (      )      *      ,     
##  [21] .      /      :      ;      ?      @      [      \\     ]      ^     
##  [31] _      `      {      |      }      ~      ¡      ¦      ¨      ¯     
##  [41] ´      ¸      ¿                                               
##  [51]             ¢      £      ¤      ¥            +      <      =     
##  [61] >      ±      «      »      ×      ÷      §      ©      ¬      ®     
##  [71] °      µ      ¶      ·                                         
##  [81]       0      ¼      ½      ¾      1      ¹      2      ²      3     
##  [91] ³      4      5      6      7      8      9      a      A      á     
## [101] Á      à      À      â      Â      ä      Ä      ã      Ã      å     
## [111] Å      æ      Æ      b      B      c      C      ç      Ç      d     
## [121] D      ð      Ð      e      E      é      É      è      È      ê     
## [131] Ê      ë      f      F      g      G      h      H      i      I     
## [141] í      Í      ì      î      Î      ï      j      J      k      K     
## [151] l      L      m      M      n      N      ñ      o      O      º     
## [161] ó      Ó      ò      ô      ö      Ö      õ      ø      Ø           
## [171] p      P      q      Q      r      R      s      S            ß     
## [181] t      T      þ      Þ            u      U      ú      Ú      ù     
## [191] û      ü      Ü      v      V      w      W      x      X      y     
## [201] Y      ý      ÿ      z      Z

There are lots of foreign characters and symbols that are unnecessary or harmful to prediction, so they need to be converted or deleted.

dat <- sapply(corp, function(row) iconv(row, "latin1", "ASCII", sub=""))
corp <- VCorpus(VectorSource(dat)); rm(dat)
chars <- myChars(corp)
print(chars, quote = F)

##  [1] \003 \b   \020 \035 '    -         !    "    #    $    %    &    (   
## [15] )    *    ,    .    /    :    ;    ?    @    [    \\   ]    ^    _   
## [29] `    {    |    }    ~    +    <    =    >    0    1    2    3    4   
## [43] 5    6    7    8    9    a    A    b    B    c    C    d    D    e   
## [57] E    f    F    g    G    h    H    i    I    j    J    k    K    l   
## [71] L    m    M    n    N    o    O    p    P    q    Q    r    R    s   
## [85] S    t    T    u    U    v    V    w    W    x    X    y    Y    z   
## [99] Z

Now, the number of characters has been drastically reduced and can be more easily dealt with in the following transformations.

swap <- content_transformer(function(x, from, to) gsub(from, to, x))
corp <- tm_map(corp, content_transformer(tolower))
# Remove profanity words
profanityWords <- readLines(con="data/profanityWords.txt", skipNul = T)
corp <- tm_map(corp, removeWords, profanityWords)
# Replace all foreign unicode character codes with a space
corp <- tm_map(corp, swap, "<.*>", " ")
# Delete all twitter-style hashtag references
corp <- tm_map(corp, swap, "#.*", " ")
# Delete website names
corp <- tm_map(corp, swap, "www\\..*", " ")
corp <- tm_map(corp, swap, ".*\\.com", " ")
# Replace all punctuation except EOS punctuation and apostrophe with a space
corp <- tm_map(corp, swap, "[^[:alnum:][:space:]\'\\.\\?!]", " ")
# Delete numbers with decimal places
corp <- tm_map(corp, swap, "[0-9]+\\.[0-9]+", "")
# Replace all instances of multiple EOS punctuation with one instance
corp <- tm_map(corp, swap, "([\\.\\?!]){2,}", ". ")
# Replace . ? ! with <EOS> tag
corp <- tm_map(corp, swap, "\\. |\\.$", " <EOS> ")
corp <- tm_map(corp, swap, "\\? |\\?$", " <EOS> ")
corp <- tm_map(corp, swap, "! |!$", " <EOS> ")
# Fix instances of probable accidental typo with EOS punctuation
corp <- tm_map(corp, swap, "[[:alnum:]]+\\?[[:alnum:]]+", " <EOS> ")
corp <- tm_map(corp, swap, "[[:alnum:]]+![[:alnum:]]+", " <EOS> ")
# Remove any extra ? !
corp <- tm_map(corp, swap, "!", " ")
corp <- tm_map(corp, swap, "\\?", " ")
# Convert very common occurence of u.s to US
corp <- tm_map(corp, swap, "u\\.s", "US")
corp <- tm_map(corp, swap, "\\.", "")
# Clean up leftover punctuation artifacts
corp <- tm_map(corp, swap, " 's", " ")
corp <- tm_map(corp, swap, " ' ", " ")
corp <- tm_map(corp, swap, "\\\\", " ")

corp <- tm_map(corp, removeNumbers)
corp <- tm_map(corp, stripWhitespace)
if(!dir.exists("./data/corp")) {dir.create("./data/corp")}
if(!dir.exists("./data/corp/small")) {dir.create("./data/corp/small")}
writeCorpus(corp, "./data/corp/small",
            filenames = c("cleanSmallBlog", "cleanSmallNews", "cleanSmallTwit"))

Data Analysis

After the transformations, I am ready to do some exploratory analysis to determine the most frequent unigrams, bigrams, and trigrams (sets of 1, 2, and 3 words that occur together).

Let’s first look and the list of characters in the corpus and a small sample of text from one of the documents to get a feel for how it looks.

rm(corp) # Reload the corpus from the new file to ensure changes are set
corp <- VCorpus(DirSource("data/corp/small"))
print(myChars(corp), quote = F); print(strwrap(corp[[2]]$content[c(4,6)]), quote=F)

##  [1] '   < > a b c d e E f g h i j k l m n o O p q r s S t u U v w x y z

## [1] the storm prediction center has placed chicago in a severe weather
## [2] risk area wednesday and wednesday night and again thursday and    
## [3] thursday night <EOS> this indicates thunderstorms may at times    
## [4] grow severe <EOS>                                                 
## [5] vascular technologist lauren sherrill presses a hand held         
## [6] ultrasound device which looks like a sturdy silver pen on the top 
## [7] of his foot near his big toe <EOS> the ultrasound called a doppler
## [8] produces a static sound <EOS> then theres the thumping of         
## [9] granthams pulse <EOS>

As you can see, the text has far fewer characters than the original document and only apostrophes for punctuation. This will make it so that all instances of the same word are matched regardless of accent marks and capitalization.

Frequency of Terms

At this point, I realized the limitations of my computer’s ability to create document-term matrices using the tm package’s DocumentTermMatrix() function. Luckily I found the quanteda package, which performs many of the same functions as tm, but much faster, as much of its backend code is done in C++. (More on this in the Next Steps section below)

corp <- quanteda::corpus(corp)
freq_df <- function(x){
    # This helper function takes a token output and outputs a sorted N-gram frequency table
    fr <- sort(colSums(as.matrix(x)),decreasing = TRUE)
    df <- data.frame(n_gram = names(fr), freq=fr, row.names = NULL)
    return(df)
}

# Create N-grams and dataframes
uni <- dfm(tokens(corp, removeSymbols=TRUE), tolower=FALSE)
uni_freq <- freq_df(uni)
rm(uni)
uni_freq <- uni_freq[-1,]

biToks <- tokens_ngrams(tokens(corp, removeSymbols=TRUE), n=2L)
bi <- dfm(biToks, tolower=FALSE); rm(biToks)
bi_freq <- freq_df(bi)
rm(bi)
bi_freq <- bi_freq[-grep("EOS", bi_freq$n_gram),]

triToks <- tokens_ngrams(tokens(corp, removeSymbols=TRUE), n=3L)
tri <- dfm(triToks, tolower=FALSE); rm(triToks)
tri_freq <- freq_df(tri)
rm(tri)
tri_freq <- tri_freq[-grep("EOS", tri_freq$n_gram),]

Visualization

Let’s take a quick look at some plots of the most frequent unigrams, bigrams, and trigrams. I removed all instances of phrases that included the tag, since it was by far the most common and I am only interested in intra-sentence phrases at this point.

top40 <- function(df, title) {
    df <- df[1:40,]
    df$n_gram <- factor(df$n_gram, levels = df$n_gram[order(-df$freq)])
  ggplot(df, aes(x = n_gram, y = freq)) +
    geom_bar(stat = "identity", fill = "dodgerblue3", colour = "gray40") +
    labs(title = title, x="N-Gram", y="Count") +
    theme(axis.text.x = element_text(angle=60, size=12, hjust = 1),
          axis.title = element_text(size=14, face="bold"),
          plot.title = element_text(size=16, face="bold"))
}
top40(uni_freq, "40 Most Common Unigrams")

top40(bi_freq, "40 Most Common Bigrams")

top40(tri_freq, "40 Most Common Trigrams")

Next Steps

As I mentioned earlier, the quanteda package’s functions ran much faster on my machine. I will definitely use it for the remainder of the project.

Building The Models

I need to do the following:
- Carefully reconsider all the pre-processing steps I used to see if there is anything important I missed.
- Create dataframes of 1-,2-,3-, and possibly 4-grams based on the larger training dataset including word-relation frequencies.
- Look into using the filehash package to load parts of the data at a time to get around my machine’s RAM limitations.
- Create and test several prediction algorithms, with and without and tags and apply them to the development dataset to determine their efficacy and speed.
- Find a way to use either Katz backoff or Kneser-Ney smoothing to deal with unknown words.
- The final goal is to create a Shiny app, with a simple user interface that provides reactive predictions as quickly and accurately as possible.