Introduction

Portable office actually means the works done on the cellphone and the tablet and we need input system to saving our time on typing on them. So a smart and efficient keyboard is required and the core of this input system is a predictive text model. This milestone report is focused on this model, covering the very beginning, namely data collection, to exploratory analysis of the data set.

Data Collection

The data were downloaded from the course website (from HC Corpora) and unzipped to extract the English database as a corpus. Three text documents from the twitter, blog and news were found with each line standing for a message.

Load Data

Summary

The basic summary of the orginal data set is shown as follows:

Summary of the datasets
Dataset Lines Chars Words
blogsdoc 899288 206824382 37570839
newsdoc 1010242 203223154 34494539
twittersdoc 2360148 162096241 30451170

Data Cleansing

The data will be filtered by

Tokenizer

The whole tokenization is aiming at removing meaningless characters and the words with low frequency in the corpus. The final corpus will show the words or n-gram with a high frequency which will be helpful for exploring the relationship between the words and building a manful statistical model.

Exploratory analysis

Figure 1 Histogram of nGrams(Top 10)

Figure 2 WordCloud of nGrams(Top 10)

Interest Findings

Next Steps for the Prediction Application