This report summarizes text mining activities performed with the goal of building a word prediction algorithm.

Objectives

The objective of this project is to build a simple word prediction algorithm that can be deployed as a Shiny app (http://shiny.rstudio.com) in several weeks. The application must be built using the R programming language. It needs to be computationally efficient and reside within the infrastructure provided by the Shiny environment. Deploying an app using Shiny requires working with datasets that can be easily managed in working memory and code that ensures timely responses to user input.

Predicting the next word when given a one, two, or three word phrase involves at least two kinds of patterns in language. One pattern to understand is that certain words frequently go together in a specific order. An example of this is “the United ______.” Words that frequently complete this phrase are “Kingdom” or “States”. Another example is the phrase “a couple of _____”. The next word is very likely to be a noun, an adjective, or an adverb (e.g., “years”, “happy”, “barking”) and almost certainly not a verb (e.g., “are”, “will”).

The other language pattern that is relevant to understanding how to build this prediction algorithm is that certain words tend to be found in proximity to other words in written and spoken language, although they do not have to appear in a specific order or phrase when found together. One example of this is the word “good”, which is often found in combination with the words “time” or “bad” in a sentence.

We will use text mining techniques to analyze large data sets of text and identify sets of common English phrases and lists of words that frequently go together. These lists of words and words patterns will be used to make a prediction model that guesses the next word when provided with a one, two, or three word input. We anticipate the algorithm will employ a combination of exact phrase matching and related word lists.

Data Sources

The documented requirements for this project state that text data sources from HC Corpora (http://www.corpora.heliohost.org) should be used for building the prediction model. We have limited this project to leverage the English language data taken from blogs, newspapers, and Twitter specified by the project requirements.

News text from these sources look like this:

## [1] "The Alaimo Group of Mount Holly was up for a contract last fall to evaluate and suggest improvements to Trenton Water Works. But campaign finance records released this week show the two employees donated a total of $4,500 to the political action committee (PAC) Partners for Progress in early June. Partners for Progress reported it gave more than $10,000 in both direct and in-kind contributions to Mayor Tony Mack in the two weeks leading up to his victory in the mayoral runoff election June 15."
## [2] "And when it's often difficult to predict a law's impact, legislators should think twice before carrying any bill. Is it absolutely necessary? Is it an issue serious enough to merit their attention? Will it definitely not make the situation worse?"

Examples of blog text look like this:

## [1] "Chad has been awesome with the kids and holding down the fort while I work later than usual! The kids have been busy together playing Skylander on the XBox together, after Kyan cashed in his $$$ from his piggy bank. He wanted that game so bad and used his gift card from his birthday he has been saving and the money to get it (he never taps into that thing either, that is how we know he wanted it so bad). We made him count all of his money to make sure that he had enough! It was very cute to watch his reaction when he realized he did! He also does a very good job of letting Lola feel like she is playing too, by letting her switch out the characters! She loves it almost as much as him."
## [2] "With graduation season right around the corner, Nancy has whipped up a fun set to help you out with not only your graduation cards and gifts, but any occasion that brings on a change in one's life. I stamped the images in Memento Tuxedo Black and cut them out with circle Nestabilities. I embossed the kraft and red cardstock with TE's new Stars Impressions Plate, which is double sided and gives you 2 fantastic patterns. You can see how to use the Impressions Plates in this tutorial Taylor created. Just one pass through your die cut machine using the Embossing Pad Kit is all you need to do - super easy!"

Examples of twitter text look like this:

## [1] "When you meet someone special... you'll know. Your heart will beat more rapidly and you'll smile for no reason."
## [2] "First Cubs game ever! Wrigley field is gorgeous. This is perfect. Go Cubs Go!"                                  
## [3] "i no! i get another day off from skool due to the wonderful snow (: and THIS wakes me up...damn thing"          
## [4] "I'm coo... Jus at work hella tired r u ever in cali"

The text data files are 150-200 MB in size, so working with these files in R is a major consideration in designing the application.

File ApproxMegabytes Lines
./final/en_US/en_US.news.txt 206 1010242
./final/en_US/en_US.blogs.txt 210 899288
./final/en_US/en_US.twitter.txt 167 2360148

Text Mining Approaches

There are several packages available for the R programming language to perform common text mining procedures. This project will rely on the tm package for many of these tasks. This package is well-documented and frequently used for this type of analysis.

In order to generate a list of common two, three, and four words phrases, we will use the following workflow:

  1. Make a corpus from the text data sources provided, employing sampling to manage the size of data sets
  2. Pre-process the text sources for analysis (remove punctuation, capitalization, non-text characters, etc.)
  3. Tokenize the text: break down the corpus into a list of words or phrases
  4. Build a table that contains the count of the words/phrases in each “document” in the corpus

Here are some results from preliminary analysis. Approximately 42,000 lines of the Twitter text data were sampled at random to create a corpus. This corpus was then tokenized into single “words”.

## [1] "Number of tweets:  42400"
## [1] "Number of terms:  37777"

The resulting term-document matrix looks like this. This is only part of the matrix, with 11 words and 6 tweets.

inspect(one_gram_tweets[11105:11115,11105:11110])
## <<TermDocumentMatrix (terms: 11, documents: 6)>>
## Non-/sparse entries: 0/66
## Sparsity           : 100%
## Maximal term length: 15
## Weighting          : term frequency (tf)
## 
##                       Docs
## Terms                  11105 11106 11107 11108 11109 11110
##   expectations             0     0     0     0     0     0
##   expectations”          0     0     0     0     0     0
##   expected                 0     0     0     0     0     0
##   expecting                0     0     0     0     0     0
##   expects                  0     0     0     0     0     0
##   expedias                 0     0     0     0     0     0
##   expedient                0     0     0     0     0     0
##   expel                    0     0     0     0     0     0
##   expendables              0     0     0     0     0     0
##   expenditures             0     0     0     0     0     0
##   expense                  0     0     0     0     0     0

Note that none of the words in this set appear in the documents selected for display. This is one of the typical characteristics of a large term-document matrix - it is sparse. We will find that there are a few terms that appear hundreds of times in this corpus, while most occur less frequently. Here are the words that appear at least 1,000 times in this corpus:

findFreqTerms(one_gram_tweets, lowfreq = 1000)
##  [1] "about"  "all"    "and"    "are"    "back"   "but"    "can"   
##  [8] "cant"   "day"    "dont"   "for"    "from"   "get"    "good"  
## [15] "got"    "great"  "have"   "how"    "its"    "just"   "know"  
## [22] "like"   "lol"    "love"   "more"   "new"    "not"    "now"   
## [29] "one"    "our"    "out"    "see"    "some"   "thanks" "that"  
## [36] "the"    "there"  "they"   "this"   "time"   "today"  "too"   
## [43] "was"    "what"   "when"   "will"   "with"   "you"    "your"

We can plot the frequency distribution of the terms in this corpus to get a more complete picture.

One can see from this figure that of the 37777 terms tokenized in the corpus sampled, 23115 appeared only once in the corpus, which is 61.2 percent.

This indicates that if we eliminate the words that only appear once, we can reduce the size of the term document matrix by more than half, which may help with processing efficiency without significantly impacting the accuracy of the prediction model being built. Of course, this requires the sample corpus used to large enough and have enough diversity to assure that the dictionary we build for prediction is sufficiently robust.

Another technique that we will use is relating which terms are frequently found with others. Here is an analysis performed using the same 42,000 line sample of Twitter data. We can investigate which terms are related to the word “taylor”.

findAssocs(one_gram_tweets, c("taylor"), .15)
## $taylor
##       swift    barnette  cheapquick   dribbling entertainer  hemorrhage 
##        0.26        0.18        0.18        0.18        0.18        0.18 
##      hookin      joanna  livingston        loko     midcity      newsom 
##        0.18        0.18        0.18        0.18        0.18        0.18 
##      sencio    stockett      swifts        tate     tyshawn   wuuuddupp 
##        0.18        0.18        0.18        0.18        0.18        0.18 
##        yhur 
##        0.18

Another way to illustrate word relationships is by building a term by term matrix. The resulting table shows the number of times within the sample that a particular word was found in the same tweet as another.

##         Terms
## Terms    get getting going good got great had haha happy has have
##   being   26       7     5   20   5     9   9    6     4  10   28
##   best    28       8     7   31  21    21  16    3    17  22   47
##   better  49      29    16   33  17    12   6   10     6  11   52
##   but    162      38    63  131  71    67  74   37    29  63  240
##   can    170      11    30   48  31    36  16   20    18  25  138
##   cant    86      12    18   28  22    30   9    8     9  16   75
##   come    44       8    12   24  12    20   7    4    13  13   56
##   could   32       5     4   13   8    14  13    7     3   6   61
##   day     86      17    53  111  29   106  52    9   193  29  147
##   did     58       5    12   27  19    30  16   11     8   6   40
##   dont   114      13    35   46  25    15   9   17    16  22  182

These are just two of the approaches used for quantifying the relationship of terms in a corpus.

Considerations and Next Steps

We have completed initial efforts at text mining and building prototypes of some of the building blocks of the prediction algorithm. Some of the challenges that need to be addressed are discussed below.

First, there is the matter of designing a prediction algorithm itself. To start, our approach will be to design a simple model that works, but may not be very accurate. We will improve upon the first version as time allows. We anticipate the first version of the prediction model will accept the user input, compare it to lists of two, three and four words phrases we derive from term-document matrices, and suggest the answer based on the most probable word based on look-up table results. For example, if the user inputs “orange” and our analysis indicates that “juice” was the most frequent two-word phrase starting with “orange” (as opposed to “peel” or “shirt”), that will be the answer provided. We may build improvements as time allows on this model by leveraging some machine learning prediction models for analyzing our corpora, particularly clustering algorithms.

Processing time and resources have a huge impact on the design of the project. We have built two-gram, three-gram, and four-gram term-document matrices from sampled news, blog and text data, eliminated the most infrequent terms, and then made hash tables of these terms (which are much faster to use in a Shiny app). Constructing a single four-gram lookup table can take many hours using the corpora provided for this project, so we have built batch sampling and term-document matrix combination procedures to break down this task into more manageable steps. Likewise, performing associated word look-ups using the tools in the tm package take too long with large term-document matrices, so the project will need to design faster, more computationally efficient ways to perform such functions in the Shiny app.

Additionally, for the term adjacency (related term) aspect of the prediction model, we may use publicly available data sets rather than trying to make one from scratch from the corpora. The R programming language is not ideal for performing this work on large corpora and time/resources are limited. An example resource is http://clic.cimec.unitn.it/composes/semantic-vectors.html.

Other work that has yet to be completed includes designing the user interface and server model for the Shiny app, and dealing with issues such as eliminating profanity terms from the prediction model, and making final determinations on how to handle user input where the terms are not in the prediction model and whether to produce one result or multiple results (e.g., top five) for the prediction.

===