Kristen Wedel
9-12-2017
The data used in this project comes from the website: https://web-beta.archive.org/web/20160930083655/http://www.corpora.heliohost.org/aboutcorpus.html.
Files include:
75,000 samples were taken from the combined data set.
Cleaning the data consisted of:
The data was then converted to term document matrices with 1, 2, 3, 4 and 5 n-gram models. When a sixth was added, speed decreased.
A backoff method was then used to predict the next word. The model first tries the 5-gram model, then 4-gram, then 3-gram, then 2-gram and then 1-gram models to make predictions.
The application is located at: https://kristywedel1.shinyapps.io/TextPred/
Please note: The application may take a minute to load. Future enhancements will be primarily focused on the application response time.