JHDSS Capstone Project – Milestone Report

Introduction

This report provides a brief overview of the exploratory analyses conducted for the purpose of constructing a text prediction model and a text prediction app. The model and app are constructed to fulfill the requirements for the JHDSS Capstone Project course. A text prediction app would assist a user in typing text by providing meaningful suggestions for text completion; an example would be a user typing in “baba baba black” – “sheep” would then be the (or one of the) suggestion(s) for the next word.

Data

In order to build a prediction model and an app for a given language, data would be required in that language to discover and learn features about the language. The data provided for the project can be downloaded from: here. The data contains text corpora gathered from 3 types of sources {news, blogs, twitter} in 4 different languages {English, German, Finnish,Russian}. The zipped data source is approximately 562 MB in size. Only the english sections of the corpora are explored and addressed in this report.

The english section of the data comes in 3 files: en_US.blogs.txt, en_US.news.txt, and en_US.twitter (in the en_US subfolder). The filenames indicate the source of the text data.

Data Exploration

The line counts and the word counts for the english data are summarized by the following table. Note that the word counts are approximate and are governed by the tokenization scheme employed.

Data Source	Line Count	Total Word Count	Unique Word Count
Blogs	899,288	36,636,565	471357
News	1,010,242	33,256,428	343319
Twitter	2,360,148	28,821,930	490359

These figures include stopwords (words such as: a, an, the, at, be etc.) and profanities.

The following figure illustrate the same data graphically.

plot of chunk unnamed-chunk-2

The following graphs indicate word frequencies for the top 30 most frequently occuring words in the data.

plot of chunk unnamed-chunk-3

plot of chunk unnamed-chunk-4

plot of chunk unnamed-chunk-5

Next steps

The next steps will involve:

splitting the data into training, testing, and validation datasets;
constructing upto 4-gram models with a smart back-off strategy possibly using frequently associated terms;
implementing a profanity filter to prevent any profane word suggestions from being offered to the user;
developing and deploying a Shiny app that provides a proof-of-concept.

JHDSS Capstone Project – Milestone Report

November 16, 2014.

Introduction

Data

Data Exploration

Next steps