Guanglai Li
July 25, 2015
In this project we will build a model to predict the next word after typing one or more words. This kind of models have been widely used for text input in modbile devices such as cell phones and tablets.
data: we will use three text files, 'en_US.blogs.txt', 'en_US.news.txt', and 'en_US.twitter.txt' to build models for English language.
tools: R language is the major tool, combined with Linux commands.
This milestone report summarizes the works we have done that lead to the final goal of the project.
Linux command wc gives the number of lines, words, and bytes directly. It can be called from R. An example code is shown below.
system("wc en_US.blogs.txt")
The folowing table summarizes the number of lines and words for each file.
| files | # lines | # words |
|---|---|---|
| en_US.blogs.txt | 0.90 million | 37.3 million |
| en_US.news.txt | 1.01 | 34.3 |
| en_US.twitter.txt | 2.36 | 30.4 |