The current project focuses on building a predictive language model based on data from HC Corpora. Exploratory analyses were conducted on blog, news, and twitter feeds in the English language. Data can be downloaded from here. The current report summarizes the preliminary exploration of data. Because blogs, news, and tweets have very different linguistic styles, to better understand each corpus, exploratory analysis is done separately. The data may be combined later for predictive modeling. For brevity, most codes are not shown or shown only for the blog corpus as an example. Complete codes can be found here.
Load libraries
# Import libaries
libs <- c('tm', 'ggplot2', 'openNLP', 'RWeka', 'slam', 'knitr')
lapply(libs, require, character.only = TRUE)
## Loading required package: tm
## Loading required package: NLP
## Loading required package: ggplot2
##
## Attaching package: 'ggplot2'
##
## The following object is masked from 'package:NLP':
##
## annotate
##
## Loading required package: openNLP
## Loading required package: RWeka
## Loading required package: slam
## Loading required package: knitr
# read file
con1 <- file("~/Online_Classes/data_science_coursera/capstone/swiftkey_train_data/en_US/en_US.blogs.txt", "r")
# con1 <- file("~/Documents/data_science_coursera/capstone/swiftkey_train_data/en_US/en_US.blogs.txt", "r")
blog <- readLines(con1)
close(con1)
format(object.size(blog), units="MB")
length(blog)
## Warning in readLines(con2): incomplete final line found on
## '~/Online_Classes/data_science_coursera/capstone/swiftkey_train_data/en_US/en_US.news.txt'
The blog, news, and twitter corpora have a size of 248.5 Mb, 19.2 Mb, and 301.4 Mb; and there are 899288, 77259, and 2360148 lines, respecitively. Note that the news data has unexpectedly few lines. Running the code in Mac and command line wc -l en_US.news.txt confirm that the actual number of lines is 1010242. This is likely to be caused by the Windows OS mishandling some characters/lines.
The Brief summaries of entries in each corpora are given by simply counting characters.
blog
summary(nchar(blog))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.0 47.0 157.0 231.7 331.0 40840.0
news
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2 111 186 203 270 5760
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.0 37.0 64.0 68.8 100.0 213.0
The results are largely unsurprising, with blogs and news having much longer entry than tweets. However, the longest entry in twitter exceeds the known maximum, 140 characters. Running the same code on a Mac, however, yields the expected 140 characters. Again, this suggests the Windows OS misinterprets some special characters. This may also be the case for the Mac for other characters.
The warning messages while reading files and the expected results are likely produced by unprintable characters such as control characters. Theerefore, these characters are removed using the command line: tr -cd '\11\12\15\40-\176' < en_US.blogs.txt > en_US.blogs.filt.txt.
Sampling the data: Given the large size of the original files, the remaining analyses are done on a sample of 10% of the orginal data. Larger samples may be used for later predictions. This can be achieved by creating a vector of uniformly distributed random numbers of the length of a corpus, e.g. runif(length(blog)).
Non-numeral-alphabetical characters except apostrophes are replaced with " “. While there is a removePunctuation in the tm package, simply removing punctuations can sometimes lead to non-words, e.g. coursera.org becomes courseraorg. Apostrophes are retained for later removal of stop words like”don’t“.
blog2 <- gsub("[^[:alnum:][:space:]']", " ", blog)
The sampled character objects are converted into so-called volatile corpora for further processsing.
# system runtime in seconds is recorded to estimate scalability
system.time(blogs_sample <- VCorpus(VectorSource(blog2))) #19.44 sec
## [1] "blog.sample10 is "
## <<VCorpus (documents: 90027, metadata (corpus/indexed): 0/0)>>
Clean Corpus: The function is created, modeled after the wonderful video by Timothy DAuria, to remove certain elements of the corpus, i.e. extra white spaces, punctuation, number, stop words, sparse terms, to covert text to lower case, and to perform stemming, if desired. Some of the procedures will sacrifice accuracy for “cleaner” and easier-to-handle data. For example, converting all characters to lower case will make “windows” and “Windows” (the operating system) indistinguishable. Code for the function can be found here.
Furthermore, simple profanity filtering is done by removing a modified list of George Carlin’s seven dirty words. The amount of profanity in the corpus is relatively small, e.g. 0.57percent in the blog data.
The cleaned corpus is then converted to a Term Document Matrix, that describes the frequency of terms (words/phrases) that occur in a collection of documents (our corpora).
system.time(bs_clean <- cleanCorpus(blogs_sample, remove_numbers = T)) # 14.95 sec
system.time(bs_tdm <- TermDocumentMatrix(bs_clean)) # 49.67 sec
Certain words are very common in the English language but they are often functional words that do not convey much content, e.g. “the”. These words are often considered stop words that are to be filtered out before further analysis of a corpus. Otherwise, frequency analysis of the corpus will be dominated by such stop words as shown below in the top 20 most frequent words in the blog data. Thus, in the current report, for a better assessment of content, stop words from this list are removed. However, for building a predictive model involving later, the stop words should probably be kept.
## the and that for you with was this have but
## 187571 110089 46443 36538 29470 29028 28267 25800 22230 20688
## are not from all they one about will what out
## 19355 17555 14886 14759 13955 12851 11516 11467 11252 11227
system.time(bs_clean <- cleanCorpus(blogs_sample, remove_numbers = T, remove_stopwords = T)) # 86.04 sec
system.time(bs_tdm <- TermDocumentMatrix(bs_clean)) # 42.25 sec
Build n-grams: n-grams are continuous sequences of terms/words in a given text, e.g. unigrams are singular terms. Note the content-related difference in the top 20 most frequent words below after removing the stop words.
bs_uni_row_total <- row_sums(bs_tdm)
summary(bs_uni_row_total)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 1.00 2.00 16.63 5.00 9199.00
bs_uni_sort <- sort(bs_uni_row_total, decreasing = T)
# head(bs_uni_sort, 20)
With the stop words there are a total of 2924841 word instances; without them there are 1520999.
Here are the terms with the highest frequecy in news.
Compared them with ones in tweets.
Compare word frequency distribution in blogs, news, and tweets: From the summary data above, the frequency distributions of terms are extremely skewed. Frequencies are log10 transformed for plotting. In the plot below, red represents blogs, blue represents news, and green represents tweets. The distributions are quite similar as they largely overlap. Howver, the twitter data have more words that occur only once, probably due to stylized words like “aahhh” and “ahahahaha”.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
Bi-grams and tri-grams: Next, bi-grams and tri-grams are built. Here are the top 20 most frequent bi-grams and tri-grams sampled from the blog data.
# set the default number of threads to use
options(mc.cores=1) # needed for n-gram function, works better with single thread
# create bigrams
# !!! consider experimenting with the delimiters for future analysis
# default should be ' \r\n\t.,;:'"()?!'
BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
system.time(bs_bi_tdm <- TermDocumentMatrix(bs_clean, control = list(tokenize = BigramTokenizer))) # 141.43 sec
# trigram
TrigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))
system.time(bs_tri_tdm <- TermDocumentMatrix(bs_clean, control = list(tokenize = TrigramTokenizer))) # 147.46 sec
# str(bs_tri_tdm)
bs_bi_row_total <- row_sums(bs_bi_tdm)
bs_bi_sort <- sort(bs_bi_row_total, decreasing = T)
head(bs_bi_sort, 20)
## years ago united states long time high school weeks ago
## 522 294 272 262 165
## ice cream spend time th century time time day day
## 161 153 147 147 137
## couple weeks long term pretty good good thing past years
## 132 131 127 126 126
## south africa time year back home black white social media
## 126 122 121 121 121
bs_tri_row_total <- row_sums(bs_tri_tdm)
bs_tri_sort <- sort(bs_tri_row_total, decreasing = T)
head(bs_tri_sort, 20)
## incorporated item pp couple weeks ago
## 43 39
## amazon services llc llc amazon eu
## 38 38
## services llc amazon world war ii
## 38 32
## bmw service center service center california
## 30 30
## love love love medium high heat
## 28 25
## illinois incorporated item level mp cost
## 24 24
## couple years ago spent lot time
## 23 23
## chicago illinois incorporated long story short
## 22 22
## preheat oven degrees high blood pressure
## 22 21
## spend lot time amazon uk amazon
## 21 20
# total word instances
total <- sum(bs_uni_row_total)
Efficiency and accuracy assessments: The number of unique words needed to cover certain porportion of all word instances are estimated. The total number of word instances in 10% of the cleaned blog corpus is 1520999.
To cover 50% of the word instances, 1371 unique words are needed. To cover 90%, 17287 unique words are needed.
Build predicitive language model: The general strategy is as followed. 1. Rebuild n-grams models, up to 4-grams, pooling blogs, news, and twitter data and including stop words. Probability of a term is modeled based on the Markov chain assumption that the occurrence of a term is dependent on the preceding terms. 2. Remove sparse terms, threshold to be determined. 3. Use a [backoff strategy](http://en.wikipedia.org/wiki/Katz's_back-off_model) to predict so that if the probability a quad-gram is very low, use tri-gram to predict, and so on.
Create interactive Shiny App: This interactive web app will take in text input and return the predicted upcoming terms. It is still being considered whether there will be a separate predictor for each of blogs, news, and twitter feeds. In addition, it is also being considered that whether the user has a choice for outputing unigram, bigram, or trigram.