Data Science Specialization: Capstone Project (Milestone)

Introduction

The goal of this project is to build a predictive text application, which takes a phrase of one or more words as input and predicts the next word as output. So for example, if the user types "I went to the", the application might predict that the most likely next word is "store".

Data

The data available for this project is a set of text documents in four languages: English, German, Finnish, and Russian. Each of the four language sets include text obtained from Twitter, news articles, and blogs. However, it is not known how specifically where the text came from (which Twitter handles, which news sources, which blogs, etc.)

Data Exploration

A brief summary of the data is presented below. So far, only the English files have been analyzed, thus the analysis below is limited to those documents.

The Twitter document contains 2,360,148 lines, 30,373,583 words, and 167,105,338 characters.
The news document contains 1,010,242 lines, 34,372,530 words, and 205,811,889 characters.
The blogs document contains 899,288 lines, 37,334,131 words, and 210,160,014 characters.

Here are plots of the 10 most frequent words for each of the document types (using a 10% sample of the documents):

plot of chunk unnamed-chunk-1

Notice that the Twitter data has the most personal language ("you", "i"), the news data has more formal language, and the blogs data is somewhere in between.

Here are plots of the 10 most frequent "bigrams" (word pairs) for each of the document types (using a 10% sample of the documents):

plot of chunk unnamed-chunk-2

Again, the Twitter data has the most personal language ("i was", "i have"), the news data has the most formal language, and the blogs data is somewhere in between.

Here is a list of the most frequent "trigrams" (3-word phrases) for each document type (using a 10% sample of the documents):

Twitter: "thanks for the"
news: "one of the"
blogs: "one of the"

Here is a list of the most frequent "quadgrams" (4-word phrases) for each document type (using a 10% sample of the documents):

Twitter: "thanks for the follow"
news: "for the first time"
blogs: "the end of the"

Predictive Modeling Approach

Here is a brief summary of how the predictive model will work:

Given a phrase such as "I went to the", the model will locate all instances of the phrase containing the previous three words ("went to the"), and will determine the most common next word (and its frequency).
The model will also locate all instances of the phrase containing the previous two words ("to the"), and will determine the most common next word (and its frequency).
The model will also locate all instances of the phrase containing the previous word ("the"), and will determine the most common next word (and its frequency).
A prediction will be chosen from among those three top words, with a higher predictive "weight" given to the longer phrases and the higher frequencies.
If the predicted word is "offensive", the next most likely word will be predicted instead.

Here is a brief summary of how the data will be treated:

A "word" is being defined as a continuous sequence of characters containing only letters, numbers, and apostrophes. All other characters are being removed.
Common words ("stopwords") such as "a", "an", and "the" are being kept in the data.
Word case (uppercase/lowercase) is ignored.

Because the model will have to make predictions in near real-time, all predictions will be pre-calculated and stored in a lookup table format so that the predictive process merely involves locating a match.

< END OF DOCUMENT >