Introduction

The goal of this project is to build a predictive text application, which takes a phrase of one or more words as input and predicts the next word as output. So for example, if the user types "I went to the", the application might predict that the most likely next word is "store".

Data

The data available for this project is a set of text documents in four languages: English, German, Finnish, and Russian. Each of the four language sets include text obtained from Twitter, news articles, and blogs. However, it is not known how specifically where the text came from (which Twitter handles, which news sources, which blogs, etc.)

Data Exploration

A brief summary of the data is presented below. So far, only the English files have been analyzed, thus the analysis below is limited to those documents.

Here are plots of the 10 most frequent words for each of the document types (using a 10% sample of the documents):

plot of chunk unnamed-chunk-1plot of chunk unnamed-chunk-1plot of chunk unnamed-chunk-1

Notice that the Twitter data has the most personal language ("you", "i"), the news data has more formal language, and the blogs data is somewhere in between.

Here are plots of the 10 most frequent "bigrams" (word pairs) for each of the document types (using a 10% sample of the documents):

plot of chunk unnamed-chunk-2plot of chunk unnamed-chunk-2plot of chunk unnamed-chunk-2

Again, the Twitter data has the most personal language ("i was", "i have"), the news data has the most formal language, and the blogs data is somewhere in between.

Here is a list of the most frequent "trigrams" (3-word phrases) for each document type (using a 10% sample of the documents):

Here is a list of the most frequent "quadgrams" (4-word phrases) for each document type (using a 10% sample of the documents):

Predictive Modeling Approach

Here is a brief summary of how the predictive model will work:

Here is a brief summary of how the data will be treated:

Because the model will have to make predictions in near real-time, all predictions will be pre-calculated and stored in a lookup table format so that the predictive process merely involves locating a match.

< END OF DOCUMENT >