Lyle McMillin
9/29/16
Overall objective of this project: A Shiny app that takes as input a phrase (multiple words) in a text box input and outputs a prediction of the next word.
To build this project we were provided the following data sets in 4 different languages (Russian, Finnish, Dutch and English). As English is my native language, only the English datasets (twitter, blogs, news) were utilized for this project.
Statistics for the English datasets are as follows:
The datasets were downloaded from the following link: Capstone Dataset
My algorithm utilizes a group of bigram, trigram and 4-gram data frames that were created using the Quanteda package against 100% of the data.
The n-gram files were then filtered taking results that had a frequency higher than 4 (i.e. 5 or greater).
The algorithm utilized in the shiny app takes the last three words entered and begins trying to successfully match a 4-gram.
If that fails a step down is implemented and the process repeats taking the highest 3-gram match against the last two words.
If that fails another step down takes place to find a bigram match agains the last word in the string.
If that fails a default word (otherwise known as the most common unigram) “the” is returned.
To make the application run at its most optimal speed, I implemented two design ideas that were critical to the success of this application:
In tests, the algorithm returned, on average, a result in < 4 milliseconds. Since the algorithm is executed twice each time (once for the single word results and once for the data table) the average total time to execute should be less < 10 milliseconds for most uses.