Natural Language Processing - Milestone Report

Author: Hon Yung Ho

Executive Summary

This is the milestone report for the Natural Language Processing project for the Coursera Data Science Capstone. We have been given a data set with three US English text files from a corpus called HC Corpora (www.corpora.heliohost.org). The ultimate goal of the capstone project is to create a predictive algorithm to fill in a missing word for a give phrase. Once the predictive algorithm is created, we will then create a Shiny app for a user to interact with the prediction machine. For this milestone report, we’re going to explore the data, understanding the size of the files we are working with, as well as the most frequent individual words and two- and three-word combinations.

Data File Loadings

First of all, we download the HC Corpora data from https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip. In the folder “en_US”, we read in all lines from three text files for further exploration and anlysis. The files are: “en_US.blogs.txt”, “en_US.news.txt”, and “en_US.twitter.txt”, respectively.

# Defines the unzipped files
blogFile <- "./data/raw/final/en_US/en_US.blogs.txt"
newsFile <- "./data/raw/final/en_US/en_US.news.txt"
twitterFile <- "./data/raw/final/en_US/en_US.twitter.txt"

# Skips the download and unzip steps if files are already there
if(!file.exists(blogFile) || !file.exists(newsFile) || !file.exists(twitterFile)) {
    sourcefile <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
    download.file(sourcefile,"data/raw/Coursera-SwiftKey.zip")
    unzip("data/raw/Coursera-SwiftKey.zip", exdir = "./data/raw")
}
# Lists the unzipped files
list.files("./data/raw/final/en_US")
## [1] "en_US.blogs.txt"   "en_US.news.txt"    "en_US.twitter.txt"
# Reads all lines from the three files into memory, then closes file connections
blogsCon <- file(blogFile, "rb")
newsCon <- file(newsFile, "rb")
twitterCon <- file(twitterFile, "rb")

blogs <- readLines(blogsCon, encoding = "UTF-8", skipNul = T)
news <- readLines(newsCon, encoding = "UTF-8", skipNul = T)
twitter <- readLines(twitterCon, encoding = "UTF-8", skipNul = T)

# Closes file connections
close(blogsCon)
close(newsCon)
close(twitterCon)

# Clears the memory
rm(blogsCon, newsCon, twitterCon)

Exploratory Analyses and Data Cleaning

In this section, we want to get an idea of how big the files are, number of words in each file, and number of lines in each file.

library(stringi)

# Gets file size
blogsSize <- round(file.info(blogFile)$size / 1024^2, 0)
newsSize <- round(file.info(newsFile)$size / 1024^2, 0)
twitterSize <- round(file.info(twitterFile)$size / 1024^2, 0)

# Gets number of words in a file
blogsWords <- sum(stri_count_words(blogs))
newsWords <- sum(stri_count_words(news))
twitterWords <- sum(stri_count_words(twitter))

# Gets number of lines in a file
blogsLines <- length(blogs)
newsLines <- length(news)
twitterLines <- length(twitter)

# Builds a data frame for file summary
fileInfo <- data.frame(
                c("blogs.txt", "news.txt", "twitter.txt"),
                c(blogsSize, newsSize, twitterSize),
                c(blogsWords, newsWords, twitterWords),
                c(blogsLines, newsLines, twitterLines)
            )
colnames(fileInfo) <- c("File", "Size (MB)", "Words", "Lines")

Summary of the three files:

##          File Size (MB)    Words   Lines
## 1   blogs.txt       200 37546246  899288
## 2    news.txt       196 34762395 1010242
## 3 twitter.txt       159 30093410 2360148

As we can see, blogs in average have relatively more words at each line compared to tweets, while news in average have relatively less words at each line than blogs but more words at each line than tweets.

Since there are many weird characters found in the files, we will convert the files to ASCII and replace strange characters with a space which will be removed later:

blogs <- iconv(blogs, "UTF-8", "ascii", sub = " ")
news <- iconv(news, "UTF-8", "ascii", sub = " ")
twitter <- iconv(twitter, "UTF-8", "ascii", sub = " ")

Sampling

For the exploratory and analysis purpose, we will use 1% of the lines from each file as sample. The samples are saved on disk for building corpus in the next step.

set.seed(888)

# Uses 1% of the lines as sample
sampleBlogs <- sample(blogs, length(blogs) * 0.01)
sampleNews <- sample(news, length(news) * 0.01)
sampleTwitter <- sample(twitter, length(twitter) * 0.01)

# Writes samples to files
writeLines(sampleBlogs, "./data/sample/sampleBlogs.txt")
writeLines(sampleNews, "./data/sample/sampleNews.txt")
writeLines(sampleTwitter, "./data/sample/sampleTweets.txt")

# Clears out the memory
rm(blogs, news, twitter, sampleBlogs, sampleNews, sampleTwitter)

Corpus Creation and Cleaning

Now we define a corpus using the sample files that we saved in the previous step:

library(NLP)
library(tm)

myCorpus <- VCorpus(DirSource("data/sample"), readerControl = list(reader = readPlain, language = "en"))

We then clean up the corpus. Basically we are only interested in English words, but we also need to remove profanity and some English stopwords as well.

library(bitops)
library(RCurl)

# Defines profanity library
profanity <- c(t(read.csv(text = getURL("http://www.bannedwordlist.com/lists/swearWords.csv"), header = F)))

# A function to remove a given string that matches a pattern
toSpace <- content_transformer(function(x, pattern) gsub(pattern, " ", x))

myCorpus <- tm_map(myCorpus, content_transformer(tolower)) # Converts all letters to lower case
myCorpus <- tm_map(myCorpus, toSpace, "/|@|\\|") # Removes these chars
myCorpus <- tm_map(myCorpus, removeNumbers)
myCorpus <- tm_map(myCorpus, removeWords, stopwords("english")) # Removes English stopwords
myCorpus <- tm_map(myCorpus, removeWords, profanity)
myCorpus <- tm_map(myCorpus, removePunctuation)
myCorpus <- tm_map(myCorpus, toSpace, "[^A-Za-z ]") # Only letters and spaces left
myCorpus <- tm_map(myCorpus, removeWords, letters) # Removes alphabets
myCorpus <- tm_map(myCorpus, stripWhitespace)
myCorpus <- tm_map(myCorpus, stemDocument, language="english") # Does the stemming

Corpus Exploratory Analysis

In this section we computes a term-document matrix that contains occurrance of terms in each document. Then we perform some basic exploratory analysis.

# Computes a term-document matrix that contains occurrance of terms in each doc
tdm = TermDocumentMatrix(myCorpus)

# Randomly inspects 10 terms
inspect(tdm[sample(1:tdm$nrow, 10), 1:tdm$ncol])
## <<TermDocumentMatrix (terms: 10, documents: 3)>>
## Non-/sparse entries: 14/16
## Sparsity           : 53%
## Maximal term length: 10
## Weighting          : term frequency (tf)
## 
##             Docs
## Terms        sampleBlogs.txt sampleNews.txt sampleTweets.txt
##   laschet                  0              0                1
##   couldn                  36              9                1
##   enquiri                  1              0                0
##   trumbo                   0              2                0
##   dessert                 18             14                7
##   vikinga                  1              0                0
##   meadowfoam               1              0                0
##   stopit                   0              0                1
##   demot                    0              4                0
##   psg                      0              0                1
# Lists terms with a frequency higher than 200
head(findFreqTerms(tdm, lowfreq = 200), 10) # find terms with a frequency higher than 5
##  [1] "abl"    "accord" "act"    "actual" "add"    "age"    "ago"   
##  [8] "agre"   "allow"  "almost"
# Displays a wordcloud
library(RColorBrewer)
library(wordcloud)
wordcloud(myCorpus, scale = c(5, 0.5), max.words = 50, random.order = FALSE, 
           rot.per = 0.35, use.r.layout = FALSE, colors = brewer.pal(8, "Dark2"))

N-gram Creation and Analysis

In this section, the corpus is tokenized into unigram, bigram, and trigram so they can be analysed in form of plots.

library(RWeka)

# Gets tokenizers
myCorpus_df <- data.frame(text = unlist(sapply(myCorpus, '[', "content")), stringsAsFactors = F)
token_delim <- " \\t\\r\\n.!?,;\"()"
UnigramTokenizer <- NGramTokenizer(myCorpus_df, Weka_control(min = 1, max = 1))
BigramTokenizer <- NGramTokenizer(myCorpus_df, Weka_control(min = 2, max = 2, delimiters = token_delim))
TrigramTokenizer <- NGramTokenizer(myCorpus_df, Weka_control(min = 3, max = 3, delimiters = token_delim))

# Tokenizes the corpus
unigramTable <- data.frame(table(UnigramTokenizer))
bigramTable <- data.frame(table(BigramTokenizer))
trigramTable <- data.frame(table(TrigramTokenizer))

unigramTable <- unigramTable[order(unigramTable$Freq,decreasing = TRUE),]
bigramTable <- bigramTable[order(bigramTable$Freq,decreasing = TRUE),]
trigramTable <- trigramTable[order(trigramTable$Freq,decreasing = TRUE),]

library(ggplot2)

g1 <- ggplot(unigramTable[1:10,], aes(x=reorder(UnigramTokenizer,-Freq,sum),y=Freq), ) + 
        geom_bar(stat="Identity",fill="lightcyan2", colour='blue') + geom_text(aes(label=Freq)) +
        labs(title = "Top 10 Unigrams", x = "Unigrams", y = "Frequency") +
        theme(axis.text.x=element_text(angle=45, hjust=1))
g2 <- ggplot(bigramTable[1:10,], aes(x=reorder(BigramTokenizer,-Freq,sum),y=Freq), ) + 
        geom_bar(stat="Identity",fill="lightcyan2", colour='blue') + geom_text(aes(label=Freq)) +
        labs(title = "Top 10 Bigrams", x = "Bigrams", y = "Frequency") +
        theme(axis.text.x=element_text(angle=45, hjust=1))
g3 <- ggplot(trigramTable[1:10,], aes(x=reorder(TrigramTokenizer,-Freq,sum),y=Freq), ) + 
        geom_bar(stat="Identity",fill="lightcyan2", colour='blue') + geom_text(aes(label=Freq)) +
        labs(title = "Top 10 Trigrams", x = "Trigrams", y = "Frequency") +
        theme(axis.text.x=element_text(angle=45, hjust=1))

Next Steps

The next action items will be the development of a predictive algorithm, and using that algorithm as the prediction engine of a Shiny app. The predictive algorithm will use n-gram model with frequency lookup similar to our exploratory analysis performed above.