The goal of this project is to create a predictive text model that assists a user in typing on a constrained mobile device. The model will consume a set of text culled from multiple sources to learn the style of written language and provide contextually relevant suggestions for next words as the user types.

Data

The model will consume data from three sources. Each of these sources has significant stylistic differences which will provide the model with an understanding of various context in which written language is used.

Blogs

The first source of text is pulled from multiple internet blogs. This data set likely has a diverse set of authors from professionals to amateurs with varied backgrounds and education. There is likely some level of formality in this written text, but the degree of which will also vary greatly.

The data file contains a single blog document on each line. There are 899,288 blog documents contained wthin the data. The longest blog entry contains 40,833 characters and the shortest contains 1 character.

News

The second source of text is pulled from news stories written by professional journalists. This source is likely to contain text with the greatest degree of formality and professionalism. The authors of this text all likely have a similar background and education.

The data file contains a single news article on each line. There are 1,010,242 news documents contained wthin the data. The longest news article contains 11,384 characters and the shortest contains 1 character.

Twitter

The last source of text is pulled from Twitter. This source is likely to contain text with the lowest level of formality. The authorship and content is likely to be extremely diverse

The data file contains a single tweet on each line. There are 2,360,148 tweets contained wthin the data. The longest tweet contains 11,384 characters and the shortest contains 1 character.

Exploration

Common Terms

A word cloud is an image composed of words in which the size of each word indicates its frequency. This is a relatively simple technique to understand the types of words used in each data source. All common words such as ‘the’ and ‘of’ have been removed to gain a better understanding of the text.

Wordcloud containing words from the Blogs text.

The Blogs text contains a diverse vocabulary. The relative similarity in size of each of the words in the cloud show this broad vocabulary. This seems to support the theory that a diverse authorship was responsible for this text.

Wordcloud containing words from the News text.

The News word cloud has a less diverse vocabulary with a few words such as ‘said’, ‘year’ and ‘time’ used more frequently. The similar colors and size of many of the words around the fringe also indicate a less diverse vocabulary. This supports the theory that the News text authorship is more homogenous.

Wordcloud containing words from the Twitter text.

The Twitter word has a diverse set of colors and sizes. Unlike the News text this indicates a diverse vocabulary that was likely created by a diverse authorship.

Frequency of Term Occurrence

The frequency of term occurrence counts the number of terms that appear a fixed number of times within the text. For example, there are roughly 80,000 terms that appear once in the blog text. There are roughly 1,000 terms that appear 10 times within the blog text, a far lower number. As would be expected this value decreases rapidly as shown by the figure.

The number of terms that have a fixed number of occurrences in the text.

It should be noted that the News text contains significantly fewer terms that appear only once within the text as compared to the Blog and Twitter sources. There is a less diverse vocabulary within the News compared to the other sources. This would make sense given that most of these are written by journalists with similar backgrounds and education. This similarity of authorship is not the same in the other data sources.

N-Grams

Breaking phrases into n-grams is a common method for analyzing text. An n-gram text model is a simple and common model for predicting text. For example, consider the sentence “I eat green eggs and ham.” The 1-grams of this sentence are “I”, “eat”, “green”, “eggs” , “and”, “ham”. The 2-grams of this sentence are “I eat”, “eat green”, “green eggs”, “eggs and”, “and ham”.

Conclusion

The predictive text model requires a diverse training data set for it to be broadly applicable and generalizable in multiple contexts. The input data provided including the Blogs, News and Twitter data have been shown to vary significantly from one another. This will provide a sufficiently diverse input for the predictive text model.

Milestone Report

Nick Allen

November 13, 2014