Veronika Nuretdinova
14.12.2014
Prediction of the next word of a phrase
“Word prediction” program provides the most probable words following the input phrase of up to 3 words length.
The program is developed based on the analysis of 10'000 random lines from 3 different texts: blogs, twitter records and news.
Choosing the size of the sample text to build the dictionary, is a compromise between taking a larger text and thus, covering as many frequently used words/expressions as possible, but at the same time, limit the memory used by the program
The number of lines, 10'000, is chosen taken into account limits of application memory (<100MB).
The following operations were done with source files for building the dictionaries:
X Terms count
1 21053 the 20367
2 1521 and 11913
3 17202 rare 10156
4 21049 that 5147
5 8669 for 3962
X Terms count
1 118056 of the 2044
2 85345 in the 1651
3 170858 the rare 1100
4 180172 to the 936
5 119872 on the 860
X Terms count
1 208024 one of the 153
2 292242 the rare of 132
3 8059 a lot of 128
4 134674 i don t 127
5 201978 of the rare 108
The user should insert English phrase into the text input box. The program checks that the user doesnt insert phrases longer than 3 words and doesnt include profanity.
Once the user enters the phrase of length k, the program seeks all phrases of length k+1 starting with input phrase. Because, the lexicon of blogs, news and twitter is different, the user is suggested to choose the source.
Input phrase: “the man” First 5 n-grams which start with “the man” from blog dictionary
X Terms count
8603 288652 the man in 3
8604 288654 the man of 3
21538 288665 the man who 2
295009 288644 the man at 1
295010 288645 the man confessed 1
Then, the program select out the phrases with maximum occurence and provides the most probable following word.
next word
1 in
2 of