16/08/2015
The Coursera Data Science Capstone project is to create a shiny application for next word prediction.
The basis of the project was a corpus from [www.corpora.heliohost.org] comprised of blog posts, news articles and tweets.
From this dataset we were to create an algorithm to predict the next word of a given sentance or phrase
The model was built using an N-gram language model approach. The principle of Markov Chains was used to restrict this to a tri-gram approach - i.e. the model looks at most at the last three words of the sentance.
A simplified back off approach was used - first the model checks for full matches of the last 3 words, the most frequently occuring quadgram starting with those 3 words would be selected. If no quadgram contained the phrase the algorithm backs of to tri-grams using the last 2 words, if there are still no matches it backs off to bi-grams using the last word and if there are still no matches it reverts to the most common uni-gram (“the”).
The application is simple to operate - It has a single input for the user to enter their phrase in the side panel. The main panel contains two outputs; The first echos back the users input, the second shows their phrase with the additional predicted word. The app can be found at [https://sleol.shinyapps.io/NextWordApp].
Due to time constraints the algorithm developed was kept simple. The following represent ideas that I would like to explore in the future to improve the application