Introduction

Main goal of the project was to create application for next word prediction. Data for the project are huge word corpuses that came from different sources: tweets, posts in blogs and news. Main area of application for such kind of application can be in mobile phones for prediction of next word in texting applicattions. There are many approaches for text modelling and word prediction and for this project modified Kneser-Ney algorithm was used.

Model

Application uses n-gram language model for next word prediction. In order to get probabilities of the next word appearance after certain sequence of words Kneser-Ney algorithm was used. According to this algorithm raw text was divided into series of n-grams: bigrams, trigrams, fourgrams and fivegrams. For each group of n-grams probability of appearance of certain type of n-gram was calculated with regard to probability of lower n-gram.

Application

Application works on simple algorithm. After typing some text in text field and pressing “Next word” button it does simple processing of entered text: convert all letters to lowercase, remove all punctuation and all words except last four. After that it takes these words and looks for the same fourgram with maximum probability in previously prepared table. If there is no fourgram it takes last three words and looks for trigram with maximum probability. If there is no trigram it looks for bigram or for the last word in the end.

Strengths and Enhancements

Kneser Ney algorithm and n-gram models of language have several strengths and weakneses. It’s good for prediction of frequently used word combination and quite fast for prediction. However this approaches is quite inaccurate for real language prediction because it can understand only relations between last 4-5 words while in real language words can be connected through several sentences. Much better results can be achieved with neural nets, particularly nets with LSTM (long short term memory) structure, but this approach requires huge computational powers for training of such nets.