Word predicting application
Stefan Loska
4/24/2015
Processing the data
n-gram sets obtained from
http://www.ngrams.info/
for 4-grams: grouped by the initial 3-gram and sorted by frequency
top <=5 words collected, dictionary: 3-gram | word set
analogous procedure for 3- and 2-grams
Algorithm
dictionary (dict) searched for the 3-gram
if >=5 words found (res_3), return; else search dict for the 2-gram
combine all search results and again check if have >=5 words; if not, keep searching
once >=5 collected, return top 5 words
Testing accuracy
data source:
http://www.corpora.heliohost.org/
, downloaded at
link
3 sets: Blogs, News, Twitter
1000 sentences/set, broken at random space character, the following word predicted
criteria: actual word is the top predicted word or one of the 5 words
% of hits:
criteria
Blogs
News
Twitter
top 1
15%
14%
11%
top 5
30%
29%
24%
On-line app
https://stefanloska.shinyapps.io/words