Synopsis

Due to the explosive popularity of smart phones, many people spend a lot of time on their mobiles devices for email, social networking, shopping, banking and many other activities. Several companies have built smart keyboards that make it easier for users to type efficiently on their mobile devices. SwiftKey is one of the pioneered companies in the field of predictive text modeling. Their keyboard presents three options for what the next word might be.
The goal of this capstone project is to analyze a large corpus of text documents to discover the structure in the data and how words are put together. It will cover cleaning and analyzing text data, then building and sampling from a predictive text model.
The first step in building a predictive model for text is understanding the distribution and relationship between the words, tokens, and phrases in the text. The report presents the basic relationships in the data to help building our linguistic models.

Data and Basic Summaries

The data provided for this capstoe project is from a corpus called HC Corpora (www.corpora.heliohost.org). Only three files (shown below) are used in the Exploratory Analysis and model building. The following table shows their basic summaries.

File name Size (MB) Line Count Work Count Charachter Count
en_US.blogs.txt 200 899288 37334131 210160014
en_US.news.txt 196 1010242 34372529 205811888
en_US.twitter.txt 159 2360148 30373583 167105331
Total 555 4269678 102080243 583077233

Sampling and Cleaning Data

Due to limited resources, 5% random samples from each file were drawn. Then several steps were performed to clean the data, which include removing numbers, punctuation, whitespaces, strange characters, and converting all text to lower case.

Analyzing documents

The frequencies of 1-gram, 2-gram, and 3-gram phrases were calculated. The foloowing figuers show the distributions of word frequencies in each n-gram model.

plot of chunk unnamed-chunk-5plot of chunk unnamed-chunk-5plot of chunk unnamed-chunk-5plot of chunk unnamed-chunk-5plot of chunk unnamed-chunk-5plot of chunk unnamed-chunk-5

Further Analysis, and Predictitive Model Approach

Further experimenting with data sampling will be conducted to evaluate the educate sample ratio for unigram, 2-gram, 3-gram, and 4-gram models. The sampling process is a very important factor to make a trade off between the limited recourses and prediction accuracy. Back off techniques and the fusion of unigram, 2-gram, 3-gram, and 4-gram models will be explored to optimize the accuracy of our prediction application.