T Fanselow
26 June 2019
This project addresses the problem of word prediction: Guessing which word will come next in a sequence.
Word prediction systems facilitate quicker and easier typing, and hence improve user productivity and application accessibility.
For example, see SwiftKey's predictive keyboard on smart phones: https://www.microsoft.com/en-us/swiftkey
Algorithmically, the problem is addressed by analysing corpora of natural language text, and capturing statistical relationships between common words and phrases.
The language model is reprensented as an ngram tree. For example, given the corpus:
A big dog
A big cat
The following tree would be created for use in prediction:
Node Level Frequency
-------------------------------
|-a 1 2
| |-big 2 2
| |-dog 3 1
| |-cat 3 1
|-big 1 2
| |-dog 2 1
| |-cat 2 1
|-dog 1 1
|-cat 1 1
The prototype prediction app is hosted on shinyapps.io:
https://tim-fan.shinyapps.io/word_prediction/
Try it out!
Type a few words in the text input box, and hit enter. A prediction for the next word will be displayed.
The page also shows the matched sequence (n-gram) from the user input, and a view which words follow that sequence in the prediction tree.
The predictor uses a tree data structure, as this was expected to:
In practice the tree size and hence ngram length and predictave accuracy was limited by allowable memory usage on the shinyapps server (1GB), combined with the fairly high memory usage of the data.tree library (https://cran.r-project.org/web/packages/data.tree/vignettes/data.tree.html#memory).
The model currently hosted on shinyapps showed predictive accuracy of 8% on a held-out test set of 2,000 tweets.
Future work will be directed towards more efficient use of memory, in order to make predictions based on a much more extensive language model.
For full source code, see https://github.com/tim-fan/coursera_datascience_capstone