Paul van der Kooy
17 December 2016
The objective of the project is to develop a text prediction application like those used on mobile telephones and app's like WhatsApp.
Exploratory data analysis
Use different sources (Twitter, News & Blogs) to find common text sequences
The approach described in the following article: http://rpubs.com/paulkooy/229890
Build a text prediction model A model based on N-grams was developed an tested to find optimal setting for performance. Words are predicted using tables of frequently used word combinations (N-Grams). N is the number of words in a sequence. Preference is given to the highest N-gram with the highest frequency in a representative set of sample texts (the corpus)
Build a text prediction application on the Web The Web based application implements the model into a public available user interface, which suggest the next word of entered text and maintains a prediction score
Use the following link to start the end product: https://paulkooy.shinyapps.io/capstoneNLP/
Code for this work can be found on GitHub: https://github.com/paulkooy/Data-science-capstone
Further improvements considered but not implemented because the results were not significant or due to time pressure to deliver the product in the allocated time.
agrep or Levenshtein distance to overcome small differences in spelling caused by typo's or grammarMost drawbacks of the N-grams method are related to its statistical background and missing context of the provided text.
"lot of food" versus "lot of different""Often I go running, because I love ???"."love you" before "love it".predictScore = The percentage correctly predicted words
maxNgrams = The highest level of N-grams used for the test
nGramsUsed = The percentage of the N-grams data used for this test