Language Modeling - A Coursera Data Science Capstone Project, the Exploratory Milestone Report

Summary

The current project focuses on building a predictive language model based on data from HC Corpora. Exploratory analyses were conducted on blog, news, and twitter feeds in the English language. Data can be downloaded from here. The current report summarizes the preliminary exploration of data. Because blogs, news, and tweets have very different linguistic styles, to better understand each corpus, exploratory analysis is done separately. The data may be combined later for predictive modeling. For brevity, most codes are not shown or shown only for the blog corpus as an example. Complete codes can be found here.

Load libraries

# Import libaries
libs <- c('tm', 'ggplot2', 'openNLP', 'RWeka', 'slam', 'knitr')
lapply(libs, require, character.only = TRUE)

## Loading required package: tm
## Loading required package: NLP
## Loading required package: ggplot2
## 
## Attaching package: 'ggplot2'
## 
## The following object is masked from 'package:NLP':
## 
##     annotate
## 
## Loading required package: openNLP
## Loading required package: RWeka
## Loading required package: slam
## Loading required package: knitr

Task 0: Understanding the Problem

# read file
con1 <- file("~/Online_Classes/data_science_coursera/capstone/swiftkey_train_data/en_US/en_US.blogs.txt", "r")
# con1 <- file("~/Documents/data_science_coursera/capstone/swiftkey_train_data/en_US/en_US.blogs.txt", "r")
blog <- readLines(con1)
close(con1)
format(object.size(blog), units="MB")
length(blog)

## Warning in readLines(con2): incomplete final line found on
## '~/Online_Classes/data_science_coursera/capstone/swiftkey_train_data/en_US/en_US.news.txt'

The blog, news, and twitter corpora have a size of 248.5 Mb, 19.2 Mb, and 301.4 Mb; and there are 899288, 77259, and 2360148 lines, respecitively. Note that the news data has unexpectedly few lines. Running the code in Mac and command line wc -l en_US.news.txt confirm that the actual number of lines is 1010242. This is likely to be caused by the Windows OS mishandling some characters/lines.

The Brief summaries of entries in each corpora are given by simply counting characters.

blog

summary(nchar(blog))

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     1.0    47.0   157.0   231.7   331.0 40840.0

news

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       2     111     186     203     270    5760

twitter

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     2.0    37.0    64.0    68.8   100.0   213.0

The results are largely unsurprising, with blogs and news having much longer entry than tweets. However, the longest entry in twitter exceeds the known maximum, 140 characters. Running the same code on a Mac, however, yields the expected 140 characters. Again, this suggests the Windows OS misinterprets some special characters. This may also be the case for the Mac for other characters.

Task 1: Data Acquisition and Cleaning

The warning messages while reading files and the expected results are likely produced by unprintable characters such as control characters. Theerefore, these characters are removed using the command line: tr -cd '\11\12\15\40-\176' < en_US.blogs.txt > en_US.blogs.filt.txt.

Sampling the data: Given the large size of the original files, the remaining analyses are done on a sample of 10% of the orginal data. Larger samples may be used for later predictions. This can be achieved by creating a vector of uniformly distributed random numbers of the length of a corpus, e.g. runif(length(blog)).

Non-numeral-alphabetical characters except apostrophes are replaced with " “. While there is a removePunctuation in the tm package, simply removing punctuations can sometimes lead to non-words, e.g. coursera.org becomes courseraorg. Apostrophes are retained for later removal of stop words like”don’t“.

blog2 <- gsub("[^[:alnum:][:space:]']", " ", blog)

The sampled character objects are converted into so-called volatile corpora for further processsing.

# system runtime in seconds is recorded to estimate scalability
system.time(blogs_sample <- VCorpus(VectorSource(blog2))) #19.44 sec

## [1] "blog.sample10 is "

## <<VCorpus (documents: 90027, metadata (corpus/indexed): 0/0)>>

Clean Corpus: The function is created, modeled after the wonderful video by Timothy DAuria, to remove certain elements of the corpus, i.e. extra white spaces, punctuation, number, stop words, sparse terms, to covert text to lower case, and to perform stemming, if desired. Some of the procedures will sacrifice accuracy for “cleaner” and easier-to-handle data. For example, converting all characters to lower case will make “windows” and “Windows” (the operating system) indistinguishable. Code for the function can be found here.

Furthermore, simple profanity filtering is done by removing a modified list of George Carlin’s seven dirty words. The amount of profanity in the corpus is relatively small, e.g. 0.57percent in the blog data.

The cleaned corpus is then converted to a Term Document Matrix, that describes the frequency of terms (words/phrases) that occur in a collection of documents (our corpora).

system.time(bs_clean <- cleanCorpus(blogs_sample, remove_numbers = T)) # 14.95 sec
system.time(bs_tdm <- TermDocumentMatrix(bs_clean)) # 49.67 sec

Certain words are very common in the English language but they are often functional words that do not convey much content, e.g. “the”. These words are often considered stop words that are to be filtered out before further analysis of a corpus. Otherwise, frequency analysis of the corpus will be dominated by such stop words as shown below in the top 20 most frequent words in the blog data. Thus, in the current report, for a better assessment of content, stop words from this list are removed. However, for building a predictive model involving later, the stop words should probably be kept.

##    the    and   that    for    you   with    was   this   have    but 
## 187571 110089  46443  36538  29470  29028  28267  25800  22230  20688 
##    are    not   from    all   they    one  about   will   what    out 
##  19355  17555  14886  14759  13955  12851  11516  11467  11252  11227

system.time(bs_clean <- cleanCorpus(blogs_sample, remove_numbers = T, remove_stopwords = T)) # 86.04 sec
system.time(bs_tdm <- TermDocumentMatrix(bs_clean)) # 42.25 sec

Task 2: Exploratory Data Analysis

Build n-grams: n-grams are continuous sequences of terms/words in a given text, e.g. unigrams are singular terms. Note the content-related difference in the top 20 most frequent words below after removing the stop words.

bs_uni_row_total <- row_sums(bs_tdm)
summary(bs_uni_row_total)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    1.00    2.00   16.63    5.00 9199.00

bs_uni_sort <- sort(bs_uni_row_total, decreasing = T)
# head(bs_uni_sort, 20)

With the stop words there are a total of 2924841 word instances; without them there are 1520999.

Here are the terms with the highest frequecy in news.

Compared them with ones in tweets.

Compare word frequency distribution in blogs, news, and tweets: From the summary data above, the frequency distributions of terms are extremely skewed. Frequencies are log10 transformed for plotting. In the plot below, red represents blogs, blue represents news, and green represents tweets. The distributions are quite similar as they largely overlap. Howver, the twitter data have more words that occur only once, probably due to stylized words like “aahhh” and “ahahahaha”.

## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

Bi-grams and tri-grams: Next, bi-grams and tri-grams are built. Here are the top 20 most frequent bi-grams and tri-grams sampled from the blog data.

# set the default number of threads to use
options(mc.cores=1) # needed for n-gram function, works better with single thread

# create bigrams
# !!! consider experimenting with the delimiters for future analysis
# default should be ' \r\n\t.,;:'"()?!'
BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
system.time(bs_bi_tdm <- TermDocumentMatrix(bs_clean, control = list(tokenize = BigramTokenizer))) # 141.43 sec

# trigram
TrigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))
system.time(bs_tri_tdm <- TermDocumentMatrix(bs_clean, control = list(tokenize = TrigramTokenizer))) # 147.46 sec
# str(bs_tri_tdm)

bs_bi_row_total <- row_sums(bs_bi_tdm)
bs_bi_sort <- sort(bs_bi_row_total, decreasing = T)
head(bs_bi_sort, 20)

##     years ago united states     long time   high school     weeks ago 
##           522           294           272           262           165 
##     ice cream    spend time    th century     time time       day day 
##           161           153           147           147           137 
##  couple weeks     long term   pretty good    good thing    past years 
##           132           131           127           126           126 
##  south africa     time year     back home   black white  social media 
##           126           122           121           121           121

bs_tri_row_total <- row_sums(bs_tri_tdm)
bs_tri_sort <- sort(bs_tri_row_total, decreasing = T)
head(bs_tri_sort, 20)

##          incorporated item pp              couple weeks ago 
##                            43                            39 
##           amazon services llc                 llc amazon eu 
##                            38                            38 
##           services llc amazon                  world war ii 
##                            38                            32 
##            bmw service center     service center california 
##                            30                            30 
##                love love love              medium high heat 
##                            28                            25 
##    illinois incorporated item                 level mp cost 
##                            24                            24 
##              couple years ago                spent lot time 
##                            23                            23 
## chicago illinois incorporated              long story short 
##                            22                            22 
##          preheat oven degrees           high blood pressure 
##                            22                            21 
##                spend lot time              amazon uk amazon 
##                            21                            20

# total word instances 
total <- sum(bs_uni_row_total)

Efficiency and accuracy assessments: The number of unique words needed to cover certain porportion of all word instances are estimated. The total number of word instances in 10% of the cleaned blog corpus is 1520999.

To cover 50% of the word instances, 1371 unique words are needed. To cover 90%, 17287 unique words are needed.

Future Tasks

Build predicitive language model: The general strategy is as followed. 1. Rebuild n-grams models, up to 4-grams, pooling blogs, news, and twitter data and including stop words. Probability of a term is modeled based on the Markov chain assumption that the occurrence of a term is dependent on the preceding terms. 2. Remove sparse terms, threshold to be determined. 3. Use a [backoff strategy](http://en.wikipedia.org/wiki/Katz's_back-off_model) to predict so that if the probability a quad-gram is very low, use tri-gram to predict, and so on.

Create interactive Shiny App: This interactive web app will take in text input and return the predicted upcoming terms. It is still being considered whether there will be a separate predictor for each of blogs, news, and twitter feeds. In addition, it is also being considered that whether the user has a choice for outputing unigram, bigram, or trigram.