Portable office actually means the works done on the cellphone and the tablet and we need input system to saving our time on typing on them. So a smart and efficient keyboard is required and the core of this input system is a predictive text model. This milestone report is focused on this model, covering the very beginning, namely data collection, to exploratory analysis of the data set.
The data were downloaded from the course website (from HC Corpora) and unzipped to extract the English database as a corpus. Three text documents from the twitter, blog and news were found with each line standing for a message.
The basic summary of the orginal data set is shown as follows:
| Dataset | Lines | Chars | Words |
|---|---|---|---|
| blogsdoc | 899288 | 206824382 | 37570839 |
| newsdoc | 1010242 | 203223154 | 34494539 |
| twittersdoc | 2360148 | 162096241 | 30451170 |
The data will be filtered by
The whole tokenization is aiming at removing meaningless characters and the words with low frequency in the corpus. The final corpus will show the words or n-gram with a high frequency which will be helpful for exploring the relationship between the words and building a manful statistical model.