Introduction

This report provides a brief overview of the exploratory analyses conducted for the purpose of constructing a text prediction model and a text prediction app. The model and app are constructed to fulfill the requirements for the JHDSS Capstone Project course. A text prediction app would assist a user in typing text by providing meaningful suggestions for text completion; an example would be a user typing in “baba baba black” – “sheep” would then be the (or one of the) suggestion(s) for the next word.

Data

In order to build a prediction model and an app for a given language, data would be required in that language to discover and learn features about the language. The data provided for the project can be downloaded from: here. The data contains text corpora gathered from 3 types of sources {news, blogs, twitter} in 4 different languages {English, German, Finnish,Russian}. The zipped data source is approximately 562 MB in size. Only the english sections of the corpora are explored and addressed in this report.

The english section of the data comes in 3 files: en_US.blogs.txt, en_US.news.txt, and en_US.twitter (in the en_US subfolder). The filenames indicate the source of the text data.

Data Exploration

The line counts and the word counts for the english data are summarized by the following table. Note that the word counts are approximate and are governed by the tokenization scheme employed.

Data Source Line Count Total Word Count Unique Word Count
Blogs 899,288 36,636,565 471357
News 1,010,242 33,256,428 343319
Twitter 2,360,148 28,821,930 490359

These figures include stopwords (words such as: a, an, the, at, be etc.) and profanities.

The following figure illustrate the same data graphically.

plot of chunk unnamed-chunk-2

The following graphs indicate word frequencies for the top 30 most frequently occuring words in the data.

plot of chunk unnamed-chunk-3

plot of chunk unnamed-chunk-4

plot of chunk unnamed-chunk-5

Next steps

The next steps will involve: