Due to the explosive popularity of smart phones, many people spend a lot of time on their mobiles devices for email, social networking, shopping, banking and many other activities. Several companies have built smart keyboards that make it easier for users to type efficiently on their mobile devices. SwiftKey is one of the pioneered companies in the field of predictive text modeling. Their keyboard presents three options for what the next word might be.
The goal of this capstone project is to analyze a large corpus of text documents to discover the structure in the data and how words are put together. It will cover cleaning and analyzing text data, then building and sampling from a predictive text model.
The first step in building a predictive model for text is understanding the distribution and relationship between the words, tokens, and phrases in the text. The report presents the basic relationships in the data to help building our linguistic models.
The data provided for this capstoe project is from a corpus called HC Corpora (www.corpora.heliohost.org). Only three files (shown below) are used in the Exploratory Analysis and model building. The following table shows their basic summaries.
File name | Size (MB) | Line Count | Work Count | Charachter Count |
---|---|---|---|---|
en_US.blogs.txt | 200 | 899288 | 37334131 | 210160014 |
en_US.news.txt | 196 | 1010242 | 34372529 | 205811888 |
en_US.twitter.txt | 159 | 2360148 | 30373583 | 167105331 |
Total | 555 | 4269678 | 102080243 | 583077233 |
Due to limited resources, 5% random samples from each file were drawn. Then several steps were performed to clean the data, which include removing numbers, punctuation, whitespaces, strange characters, and converting all text to lower case.
The frequencies of 1-gram, 2-gram, and 3-gram phrases were calculated. The foloowing figuers show the distributions of word frequencies in each n-gram model.
Further experimenting with data sampling will be conducted to evaluate the educate sample ratio for unigram, 2-gram, 3-gram, and 4-gram models. The sampling process is a very important factor to make a trade off between the limited recourses and prediction accuracy. Back off techniques and the fusion of unigram, 2-gram, 3-gram, and 4-gram models will be explored to optimize the accuracy of our prediction application.