Data Science Capstone - Exploratory Analysis Report

Synopsis

Due to the explosive popularity of smart phones, many people spend a lot of time on their mobiles devices for email, social networking, shopping, banking and many other activities. Several companies have built smart keyboards that make it easier for users to type efficiently on their mobile devices. SwiftKey is one of the pioneered companies in the field of predictive text modeling. Their keyboard presents three options for what the next word might be.
The goal of this capstone project is to analyze a large corpus of text documents to discover the structure in the data and how words are put together. It will cover cleaning and analyzing text data, then building and sampling from a predictive text model.
The first step in building a predictive model for text is understanding the distribution and relationship between the words, tokens, and phrases in the text. The report presents the basic relationships in the data to help building our linguistic models.

Data and Basic Summaries

The data provided for this capstoe project is from a corpus called HC Corpora (www.corpora.heliohost.org). Only three files (shown below) are used in the Exploratory Analysis and model building. The following table shows their basic summaries.

File name	Size (MB)	Line Count	Work Count	Charachter Count
en_US.blogs.txt	200	899288	37334131	210160014
en_US.news.txt	196	1010242	34372529	205811888
en_US.twitter.txt	159	2360148	30373583	167105331
Total	555	4269678	102080243	583077233

Sampling and Cleaning Data

Due to limited resources, 5% random samples from each file were drawn. Then several steps were performed to clean the data, which include removing numbers, punctuation, whitespaces, strange characters, and converting all text to lower case.

Analyzing documents

The frequencies of 1-gram, 2-gram, and 3-gram phrases were calculated. The foloowing figuers show the distributions of word frequencies in each n-gram model.

plot of chunk unnamed-chunk-5

Further Analysis, and Predictitive Model Approach