Alan C. Bonnici
March 2015
The application stored at https://chribonn.shinyapps.io/dsscapstone-003/ allows a person to type a phrase and have a number of suggestions returned. The user can control the number of words that the application suggests.
When the user hits the submit button in the string is loaded into the prediction engine:
The dictionary is built on data from a corpus called HC Corpora. Click here to download. The English language data consisted of blog, news and twitter text. Initial text exploration of the source can be found by vising the Online Text Eploration Report.
Computer processing resources, time, application response time and limitations at shinyapps.io were all factors that dictated how large the final dictionary could be. After many trials the optimal size was determined to be a dictionary based on a random sample 3000 sentences (1000 from each source) processed into 8 tokens.
This solution is a statistical one rather than rule based. Rather than attempt to build a complex rule-based engine in which one attempts to define mechanistically what should come next, the solution analysed existing conversations and from just 3000 sentences attempts an Auto-prediction function.
Information on this approach can be found in the section NLP using machine learning in Wikipedia .
The promise behind such a model is that given a population-representative database written in any language the prediction algorithm should be able to predict in that language.