Ng Bo Lin
08 September 2017
The idea behind developing an application to predict the next word lies in its added efficiency; we are able to spend less time on typing as the application is able to predict the next most probable word(s) based on what the usera have entered previously.
The prediction algorithm that lies behind the application merges 2 conventional language model, the Stupid Backoff Algorithm and the Kneser-Ney Smoothing Algorithm.
In order to build something which was fast yet accurate, I opted for a mixed language model, making use of the speed that the Stupid Backoff Algorithm is known for, as well as the accuracy that the Kneser-Ney smoothing algorithm is known for.
When the user imputes a text input (n) of length 1 (e.g. I, You, We):
For text inputs of length 2 or more (e.g. You are, We have, It is):
The prediction algorithm first tries to predict for the next word by finding the most common n-gram from, conditional on the first n-1 words provided, using a Kneser-Ney Smoothing Algorithm.
If no such n-grams are found, it uses the last n-2 words of the text input imputed ('am not'), and attempt to search for the most common n-1-grams, conditional on the first n-2 words provided by the user.
You can find more information on the methodology as well as the source code written in R here.
To use the application, all you have to do is to provide text inputs in the text box ('You' is set by default). The input is then cleaned, and analysed to see which are the most probable terms found. Then, the prediction algorithm outputs the most common word as well as the probability of it occurring, along with other probable words that go along well with the input.
To access the web application, you can click here.
To test the performance of my word prediction model, I took 3 different tweets (see Tweet 1, Tweet 2, Tweet 3) from my favourite tweeter of all time and tried to see if my word prediction model could predict the next word in his tweet.
The prediction algorithm was able to get 1 out of 3 tweets correctly (Tweet 3), scoring an accuracy of 33.3%. You can find out more here, under the real examples segment.