Aliza Shoop
July 11, 2018
This project focused on trying to create a library of text from which we can make educated guesses for a word prediction app. This project involved the following sources:
– tm, dplyer, stringr, stylo
Breakdown of the lines for each dataset
Summary of the top 3-grams in combined sampled data
The final model uses a backoff approach by matching words in a string to lines in an n-gram library and finding the next word by highest frequency. If no match is found in the longest n-gram (6-grams), then the function moves to the 5-gram dictionary (until the 2-gram library) to find a match until one is found. The accuracy is higher for a larger dataset, however, in the interest of saving time and space, a smaller dataset is used for the App. There are times when the app was unable to find a prediction. This is likely due to insufficient data in the n-gram libraries. The next step here would be to add more data and drop the lines which had very low frequencies.
Resources