Word prediction

Veronika Nuretdinova
14.12.2014

Prediction of the next word of a phrase

Word prediction: program description

“Word prediction” program provides the most probable words following the input phrase of up to 3 words length.
The program is developed based on the analysis of 10'000 random lines from 3 different texts: blogs, twitter records and news.
Choosing the size of the sample text to build the dictionary, is a compromise between taking a larger text and thus, covering as many frequently used words/expressions as possible, but at the same time, limit the memory used by the program
The number of lines, 10'000, is chosen taken into account limits of application memory (<100MB).

Building n-gram dictionaries

The following operations were done with source files for building the dictionaries:

profanity words have been cleaned. I have used the list of common profanity words
symbols have been removed. Words which appear only one in the text are tagged “rare”. This is useful, first, to optimize the dictionary, second, when a user inserts some rare word it would be tagged as rare and the phrase with “rare” would still have >0 probability
n-grams statistics has been built, ie frequency of phrase of 1,2,3,4-word long. R provides special package for linguistic analysis of documents. “tm” package and “RWeka” allow for automatic ngram analysis.

Top n-grams in Blog sample texts

      X Terms count
1 21053   the 20367
2  1521   and 11913
3 17202  rare 10156
4 21049  that  5147
5  8669   for  3962

       X    Terms count
1 118056   of the  2044
2  85345   in the  1651
3 170858 the rare  1100
4 180172   to the   936
5 119872   on the   860

       X       Terms count
1 208024  one of the   153
2 292242 the rare of   132
3   8059    a lot of   128
4 134674     i don t   127
5 201978 of the rare   108

How the application works

The user should insert English phrase into the text input box. The program checks that the user doesnt insert phrases longer than 3 words and doesnt include profanity.

Once the user enters the phrase of length k, the program seeks all phrases of length k+1 starting with input phrase. Because, the lexicon of blogs, news and twitter is different, the user is suggested to choose the source.

Example

Input phrase: “the man” First 5 n-grams which start with “the man” from blog dictionary

            X             Terms count
8603   288652        the man in     3
8604   288654        the man of     3
21538  288665       the man who     2
295009 288644        the man at     1
295010 288645 the man confessed     1

Then, the program select out the phrases with maximum occurence and provides the most probable following word.

  next word
1        in
2        of