Matt Dancho
2016-12-27
Predicting the next word in a phrase…
tm, RWeka, and multidplyr. Two models developed:1. Simple n-gram backoff and pick highest frequency: Less accuracy, but no internal computation.
2. Stupid backoff with n-gram scoring: Better accuracy, but requires internal computation.
Model 1 was selected based on best combination of accuracy and speed.
1000 randomly sampled n-grams from a holdout set were tested. The final model could analyze 1000 samples in about 11 seconds. The overall accuracy was 12.8%. The model tended to perform best on 4-grams (15% accuracy) versus 2-grams (9.3% accuracy).
user system elapsed
0.72 0.13 10.89
# A tibble: 3 × 4
n samples correct acc
<dbl> <int> <dbl> <dbl>
1 2 323 30 9.3
2 3 369 51 13.8
3 4 308 47 15.3
Enter a word or phrase in the prediction field
Watch as the top predictions are presented in terms of n-gram frequency