Introduction

The goal of this project is to build a web application that can predict next words while text input is being entered. The predictive model will be based on N-Gram Language Model, which uses the immediate prior N-1 words to predict the most likely next words. In this report, I will explore the data that will be used to train the model. Specifically, I will perform data pre-processing, n-gram generation, and n-gram frequency characterization.

Data Loading

The English version of the source data, provided by the course team, include 3 files, en_US.blogs.txt, en_US.news.txt, and en_US.twitter.txt, which were originally acquired from crawling distinct web sites. In this analysis, I will keep the 3 sets of data separated and compare their results at the end.

### loading required libraries
require(qdap)       ### for data pre-processing and sentence detection
require(tm)         ### for data pre-processing and other text mining tasks
require(RWeka)      ### for n-gram generation
require(wordcloud)      ### for wordcloud visualization
require(ggplot2)
### loading data
setwd(wd_path)
blogs <- readLines("en_US.blogs.txt")
news <- readLines("en_US.news.txt")
tweets <- readLines("en_US.twitter.txt")
b_line_cnt <- length(blogs)
n_line_cnt <- length(news)
t_line_cnt <- length(tweets)

Due to the very bad performance of all the text processing R libraries for very large data set on a typical personal computer, I will use only 20000 lines of data from each of the 3 sets to generate this report.

### sampling 20000 lines from the middle part of each set
b_sample <- blogs [450001:470000]
n_sample <- news [500001:520000]
t_sample <- tweets [1180001:1200000]

Sentence Detection

The beginning of a sentence is important in building N-Gram Language Model. I will first split multiple sentences in each line into multiple lines, using qdap package.

b_sent <- sent_detect(b_sample, language = "en", model = NULL)
n_sent <- sent_detect(n_sample, language = "en", model = NULL)
### tweets are full of incomplete sentences and excessive usages of "!", ">", and "<".
### The two additional pre-processing below are applied on Tweets before sentence detection.
t_sample <- gsub("[<>]+","",t_sample)   # remove "<" or ">", because they tend to combine lines.
t_sample <- gsub("!{2,}","!",t_sample)  # remove excessive "!", becaus !!! causes 3 lines.
t_sent <- sent_detect(t_sample, incomplete.sub = TRUE, language = "en", model = NULL)
b_sent_cnt <- length(b_sent)
n_sent_cnt <- length(n_sent)
t_sent_cnt <- length(t_sent)

Data Pre-Processing

I will use tm package to perform the following data pre-processing: change to all lower cases, remove profanity words, remove numbers, remove punctuations, and remove extra white spaces. But stemming and stop words removal will not be done here, because we need complete words and common words for the prediction task.

### In order to use tm package functions, we need to convert vectors to corpora.
b_corpus <- VCorpus(VectorSource(b_sent))
n_corpus <- VCorpus(VectorSource(n_sent))
t_corpus <- VCorpus(VectorSource(t_sent))
### change to all lower cases
b_corpus <- tm_map(b_corpus, content_transformer(tolower))
n_corpus <- tm_map(n_corpus, content_transformer(tolower))
t_corpus <- tm_map(t_corpus, content_transformer(tolower))
### The profanity file contains 452 profanity words, which is based on Google bad words list.
profanity <- readLines(profanity_file_path)
b_corpus <- tm_map(b_corpus, removeWords, profanity)
n_corpus <- tm_map(n_corpus, removeWords, profanity)
t_corpus <- tm_map(t_corpus, removeWords, profanity)
### remove numbers
b_corpus <- tm_map(b_corpus, removeNumbers)
n_corpus <- tm_map(n_corpus, removeNumbers)
t_corpus <- tm_map(t_corpus, removeNumbers)
### remove punctuations
b_corpus <- tm_map(b_corpus, removePunctuation)
n_corpus <- tm_map(n_corpus, removePunctuation)
t_corpus <- tm_map(t_corpus, removePunctuation)
### remove extra white spaces
b_corpus <- tm_map(b_corpus, stripWhitespace)
n_corpus <- tm_map(n_corpus, stripWhitespace)
t_corpus <- tm_map(t_corpus, stripWhitespace)
### Convert out of corpora to data frames for other R packages to continue the data processing pipeline.
b_corpus_df <- data.frame(sentence=unlist(sapply(b_corpus, '[',"content")),stringsAsFactors=F)
n_corpus_df <- data.frame(sentence=unlist(sapply(n_corpus, '[',"content")),stringsAsFactors=F)
t_corpus_df <- data.frame(sentence=unlist(sapply(t_corpus, '[',"content")),stringsAsFactors=F)

N-Gram Generation and Characterization

RWeka package will be used to generate n-grams.

delimiter_set <- " \\t\\r\\n.!?,;\"()"
### 1-gram vector 
b_one_gram <- NGramTokenizer(b_corpus_df, Weka_control(min=1, max=1, delimiters = delimiter_set))
n_one_gram <- NGramTokenizer(n_corpus_df, Weka_control(min=1, max=1, delimiters = delimiter_set))
t_one_gram <- NGramTokenizer(t_corpus_df, Weka_control(min=1, max=1, delimiters = delimiter_set))
b_one_gram_cnt <- length(b_one_gram)
n_one_gram_cnt <- length(n_one_gram)
t_one_gram_cnt <- length(t_one_gram)
#### generate word-frequency table in descending frequenct order
b_one_gram_wf <- data.frame(table(b_one_gram))
n_one_gram_wf <- data.frame(table(n_one_gram))
t_one_gram_wf <- data.frame(table(t_one_gram))
b_one_gram_wf <- b_one_gram_wf[order(b_one_gram_wf$Freq,decreasing=TRUE),]
n_one_gram_wf <- n_one_gram_wf[order(n_one_gram_wf$Freq,decreasing=TRUE),]
t_one_gram_wf <- t_one_gram_wf[order(t_one_gram_wf$Freq,decreasing=TRUE),]
colnames(b_one_gram_wf) <- c("bWord", "bFreq")
colnames(n_one_gram_wf) <- c("nWord", "nFreq")
colnames(t_one_gram_wf) <- c("tWord", "tFreq")
b_word_cnt <- nrow(b_one_gram_wf)
n_word_cnt <- nrow(n_one_gram_wf)
t_word_cnt <- nrow(t_one_gram_wf)

Now, let’s summarize and compare the numbers from the 3 different data sources.

##                                        Blogs      News     Tweets
## lines in data file                  899288.0 1010242.0 2360148.00
## % of lines used by 20K samples           2.2       2.0       0.85
## sentences in sample                  49673.0   42846.0   17010.00
## average sentences/line in sample         2.5       2.1       0.85
## 1-gram tokens in sample             784920.0  654869.0  149549.00
## average tokens/sentence in sample       15.8      15.3       8.79
## Unique words (vocabulary) in sample  47039.0   45081.0   16953.00
## average frequence/word                  16.7      14.5       8.82

The numbers in the summary table show that Blogs and News are very similar in average sentences per line, average tokens per sentence, vocabulary size, and average frequency per word. However, Tweets are quite different. It contains many informal expressions with far smaller vocabulary set, much shorter and incomplete sentences, and much lower average frequency per word. Even with the additional pre-processing above, before the sentence detection, there were still about 3000 lines out of the 20000 lines getting combined into other lines to form sentences. These observations suggest that Tweets data set may not be a good source for training the language model.

word-frequency plots for the 3 data sets

par(ps=8, las=2, mar=c(5.1,5.1,4.1,2.1))
barplot(b_one_gram_wf[1:40,]$bFreq,
        col="blue",
        main="Top 40 Most Frequently Used Words in Blogs Samples",
        ylab="Frequency",
        names.arg=b_one_gram_wf$bWord[1:40])

par(ps=8, las=2, mar=c(5.1,5.1,4.1,2.1))
barplot(n_one_gram_wf[1:40,]$nFreq,
        col="blue",
        main="Top 40 Most Frequently Used Words in News Samples",
        ylab="Frequency",
        names.arg=n_one_gram_wf$nWord[1:40])

par(ps=8, las=2, mar=c(5.1,5.1,4.1,2.1))
barplot(t_one_gram_wf[1:40,]$tFreq,
        col="blue",
        main="Top 40 Most Frequently Used Words in Tweets Samples",
        ylab="Frequency",
        names.arg=t_one_gram_wf$tWord[1:40])

The top 40 words and their frequency distribution pattern are very similar between Blogs and News. One interesting difference is that “said” is the 12th mostly used word in News, but not in the top 40 in Blogs. This is understandable because news writers mostly quote what others said; on the other hand, blogs writers usually describe their own opinions. In contrast, the top 40 words in Tweets and their frequency distribution are significantly different from those of Blogs and News. For example, “good”, “love”, and “like” are among the top 40 in Tweets, but not in Blogs or News. This suggests that many tweets are used to express personal positive sentiments. Another interesting difference is that “I” ranks much higher in Tweets and Blogs than in News, suggesting that Tweets and Blogs are expressed more in first person style than News.

Words needed for different level of coverages

Next, I will use Blogs data to determine the number of words needed to cover 50%, 90%, 95%, and 98% of word usages, namely 1-gram tokens.

b_one_gram_wf_pct <- cbind(b_one_gram_wf,b_one_gram_wf$bFreq/sum(b_one_gram_wf$bFreq))
colnames(b_one_gram_wf_pct) <- c("bWord", "bFreq", "bPct_Freq")
running_pctFreq = 0
words_counted_50 = 0
while(running_pctFreq<0.50){
   running_pctFreq <- running_pctFreq + b_one_gram_wf_pct[(words_counted_50+1),3]
   words_counted_50 <- words_counted_50 + 1
}
running_pctFreq = 0
words_counted_90 = 0
while(running_pctFreq<0.90){
   running_pctFreq <- running_pctFreq + b_one_gram_wf_pct[(words_counted_90+1),3]
   words_counted_90 <- words_counted_90 + 1
}
running_pctFreq = 0
words_counted_95 = 0
while(running_pctFreq<0.95){
   running_pctFreq <- running_pctFreq + b_one_gram_wf_pct[(words_counted_95+1),3]
   words_counted_95 <- words_counted_95 + 1
}
running_pctFreq = 0
words_counted_98 = 0
while(running_pctFreq<0.98){
   running_pctFreq <- running_pctFreq + b_one_gram_wf_pct[(words_counted_98+1),3]
   words_counted_98 <- words_counted_98 + 1
}
words_counted_all = c(nrow(b_one_gram_wf), words_counted_50, words_counted_90, words_counted_95, words_counted_98)
words_counted_all_pct = c(100, words_counted_50*100/nrow(b_one_gram_wf), words_counted_90*100/nrow(b_one_gram_wf), words_counted_95*100/nrow(b_one_gram_wf), words_counted_98*100/nrow(b_one_gram_wf))
m2 <- cbind(words_counted_all,words_counted_all_pct)
colnames(m2) <- c("Vocabulary", "% Vocabulary")
rownames(m2) <- c("Total","50% Usage Coverage","90% Usage Coverage","95% Usage Coverage","98% Usage Coverage")
m2
##                    Vocabulary % Vocabulary
## Total                   47039       100.00
## 50% Usage Coverage        108         0.23
## 90% Usage Coverage       6619        14.07
## 95% Usage Coverage      15114        32.13
## 98% Usage Coverage      31341        66.63

It is striking to find that only about one-third (32%) of the vocabulary covers 95% of the usages, which means that great majority of the vocabulary are used very infrequently. This pattern of distribution is typically described as “long tail” and word usage has a very long tail. To cover 98% of the usage, we only need two-thirds (66.6%) of the vocabulary. This knowledge may be utilized in building a less sparse frequency matrix for the language model.

2-gram, 3-gram, 4-gram for Blogs only

Next, I will do 2-gram, 3-gram, and 4-gram generation and characterization using the Blogs sample.

b_two_gram <- NGramTokenizer(b_corpus_df, Weka_control(min=2, max=2, delimiters = delimiter_set))
b_two_gram_wf <- data.frame(table(b_two_gram))
b_two_gram_wf <- b_two_gram_wf[order(b_two_gram_wf$Freq,decreasing=TRUE),]
colnames(b_two_gram_wf) <- c("bTwoGram", "bFreq")
par(ps=8, las=2, mar=c(5.1,5.1,4.1,2.1))
barplot(b_two_gram_wf[1:40,]$bFreq,
        col="purple",
        main="Top 40 Most Frequently Used 2-Gram in Blogs Samples",
        ylab="Frequency",
        names.arg=b_two_gram_wf$bTwoGram[1:40])

b_three_gram <- NGramTokenizer(b_corpus_df, Weka_control(min=3, max=3, delimiters = delimiter_set))
b_three_gram_wf <- data.frame(table(b_three_gram))
b_three_gram_wf <- b_three_gram_wf[order(b_three_gram_wf$Freq,decreasing=TRUE),]
colnames(b_three_gram_wf) <- c("bThreeGram", "bFreq")
par(ps=8, las=2, mar=c(5.1,5.1,4.1,2.1))
barplot(b_three_gram_wf[1:40,]$bFreq,
        col="purple",
        main="Top 40 Most Frequently Used 3-Gram in Blogs Samples",
        ylab="Frequency",
        names.arg=b_three_gram_wf$bThreeGram[1:40])

b_four_gram <- NGramTokenizer(b_corpus_df, Weka_control(min=4, max=4, delimiters = delimiter_set))
b_four_gram_wf <- data.frame(table(b_four_gram))
b_four_gram_wf <- b_four_gram_wf[order(b_four_gram_wf$Freq,decreasing=TRUE),]
colnames(b_four_gram_wf) <- c("bFourGram", "bFreq")
par(ps=8, las=2, mar=c(5.1,5.1,4.1,2.1))
barplot(b_four_gram_wf[1:40,]$bFreq,
        col="purple",
        main="Top 40 Most Frequently Used 4-Gram in Blogs Samples",
        ylab="Frequency",
        names.arg=b_four_gram_wf$bFourGram[1:40])

Wordcloud is a popular way to visualize the most frequently used words and N-grams, albeit not particularly informative.

pal <- brewer.pal(9,"YlOrRd")
pal <- pal[-(1:4)]
wordcloud(b_two_gram_wf[1:30,1],b_two_gram_wf[1:30,2],c(8,.3),2,,FALSE,,.15,pal)

pal2 <- brewer.pal(9,"YlGnBu")
pal2 <- pal2[-(1:4)]
wordcloud(b_three_gram_wf[1:30,1],b_three_gram_wf[1:30,2],c(8,.3),2,,FALSE,,.15,pal2)

pal3 <- brewer.pal(9,"YlGn")
pal3 <- pal2[-(1:4)]
wordcloud(b_four_gram_wf[1:30,1],b_four_gram_wf[1:30,2],c(8,.3),2,,FALSE,,.15,pal3)

Some further thoughts about pre-processing

There are ways to improve the pre-processing steps presented in this report in the context of building a better next word predictive language models.

Numbers Handling

I removed all the numbers in data pre-processing. But this will leave the model incapable of predicting the word next to a number. For example, when users enter “The allowed baggage size is 30”, the model should predict “inches” as the next word. But the model only has “size is inches”, not “is 30 inches”; therefore, the chance of selecting “inches” after “is 30” will not be as high as it should be. To cope with this, we can replace all numbers with a special token, such as [NUM]. Then we can easily deal with all numbers. Say, “Your total cost is [NUM] dollars” can correctly lead to a prediction of “dollars”, no matter the cost is 5.99 or 1999.99.

Informal language in Tweets

As mentioned above, Tweets data set contains many informal, non-standard languages. The check_text function in the qdap package may be useful in guiding the additional cleaning steps. On the other hand, it seems to be a reasonable idea to avoid using Tweets data source, if the goal is to build a model for formal and standard language prediction.

Text Processing Performance

Both qdap and tm packages have very bad performance for larger data set. In my personal computer, if a function takes 30 seconds to process 20K lines; it will take 2 minutes (more than doubled) to process 40K lines. It will take unknown hours to process the entire file, about 900K lines in Blogs, which I had to stop prematurely after waiting for more than 2 hours. During the entire time, neither the CPU nor the memory reached full utilization. One potential way to improve performance is to do parallel processing on partitioned data and combine the results at the end. The doMC package for Linux and doParallel package for Windows are worth looking into, if one wants to move beyond 2% sampled data.

Brief plans for creating the prediction algorithm and Shiny app

Plan for the Predictive Model

I will follow Jurafsky & Martin’s article on N-Grams to develop the probability matrices. Special attention will be paid to Smoothing Methods for unknown words and zero probability issue in general. I will also use Perplexity as a measure to evaluate the models.

Plan for the Shiny App

I plan to have a text input area at UI that will detect the end of a word, such as a space or a punctuation, and send the word with its immediate prior words to the server for the next word prediction. The server will send the top most likely next words back to the UI for display. Users can then select the desired one out of the predicted words. If none is desired, users can keep typing the next word and the prediction cycle will continue.