A predictive text application was developed using a corpora of English text from blog, news, and twitter sources.
Using a 5-gram dictionary paired with a 'Stupid Backoff' model, the application predicts the next word of sentences from user input with a top prediction rate of 11.51% and top-3 rate of 21.31%.
The final application can be found here.
Below are simple desriptive statistics of the source files used.
Size (MB) Num.of.Lines Min Characters (by line)
blogs 200.42 899288 1
news 196.28 77259 2
twitter 159.36 2360148 2
Mean Characters (by line) Max Characters (by line)
blogs 231.69601 40835
news 203.00243 5760
twitter 68.80281 213