A predictive text application was developed using a corpora of English text from blog, news, and twitter sources.
Using a 5-gram dictionary paired with a 'Stupid Backoff' model, the application predicts the next word of sentences from user input with a top prediction rate of 11.51% and top-3 rate of 21.31%.
The final application can be found here.
Below are simple desriptive statistics of the source files used.
Size (MB) Num.of.Lines Min Characters (by line) blogs 200.42 899288 1 news 196.28 77259 2 twitter 159.36 2360148 2 Mean Characters (by line) Max Characters (by line) blogs 231.69601 40835 news 203.00243 5760 twitter 68.80281 213