Barbara M
January 2018
Word prediction is a useful tool for text entry - eg using a mobile phone. The data for building the model were downloaded from http://www.corpora.heliohost.org/aboutcorpus.html
The csmcu package was used to build 1- to 4-gram models with a training dataset of 50% of the combined twitter, news and blogs text files containing a total of 2.1 million text messages The cmscu package (count-min-sketch conservative update) enables building very large and rapidly queried n-gram models that would normally be too large for R memory. It creates hash-table indexes for the n-grams which do not store the words themselves, thus memory efficient. The package is not on CRAN, but can be downloaded using devtools::install_github(“jasonkdavis/r-cmscu”, subdir = “cmscu”). See the tutorial at http://davevinson.com/cmscu-tutorial.html
The method for text cleaning and n-gram building was adapted from code in Dave Vinson's cmscu tutorial code.
Stopwords were not removed as they are likely to be present in the input phrase.
A separate 1-gram dictionary was created to use as the completion word for the phrase. This dictionary was additionally cleaned with hunspell and stopwords removed. This is to give a cleaner set of words for selecting the next word. This dictionary contains 144K words.
More information about hunspell can be found at https://cran.r-project.org/web/packages/hunspell/hunspell.pdf
The model itself is very simple. The input phrase (3 last words) is pre-pended to each term in the 1g-dictionary to create a 4 word phrase. A cmscu query function is then used to find the most common occurrence of this phrase in the 4-gram model. If a match is not found using 3 last words then it checks with last 2 words in the 3-gram model.
Model Performance: I attempted to run the benchmark.R program but could not get it to work. Instead I compared the accuracy of my model to the course quizzes. In quiz 2 the model achieved 40% correct answers. These were for questions 1, 2, 3 and 7. In Quiz 3 the model achieved 30% correct answers. These were for questions 5, 6 and 8.
This suggests an average accuracy of 35%
The user enters an input phrase which is subjected to the same cleaning process as used in building of the n-grams. The user then clicks the “Submit” button. At least 3 words must be entered.
The app returns the word with the maximum count by comparing up to the last 3 words of the input phrase combined with a dictionary term with the n-gram model against the count of times it occurs in the n+1 gram model.
The App can be found at: https://moloneb.shinyapps.io/CapstoneProject2/