Leo Yang
2018-06-16
We mostly use Hadley’s tidyverse / tidytext package for tokenization, filtering and aggregation; For pairwise word correlation within a sentence, we use the pairwise_cor from Hadley’s widyr package; Also the tm package is used for word stemming;
Due to limited computer memory, we only sample ~ 40% of the complete data for modeling process. Ngrams that contain letters + apostrophe (‘) and above certain frequencies are kept, which vary depending on the size of the model; For pairwise correlation, we use tm::stemDocument to stem the words to reduce the word space;
NGram Models:
We use tidytext::unnest_tokens for ngram tokenizations; For each ngram model (1, 2 or 3), we apply filters as specified in Data Preprocess section to remove undesired ngrams; Then we count the ngram and normalize it as prediction probability; The probabilities from the three ngram models are added with a weight as shown below:
| ngrams | weight |
|---|---|
| bigram | 1 |
| trigram | 5 |
| fourgram | 10 |
Pairwise Correlation Model:
To calculate meaningful words correlation within a sentence, we go through the following steps: 1)tokenize the texts as sentences; 2) remove stop words using tidytext::stop_words collection; 3) use tm::stemDocument to stem the words; 4) Finally we calculate the pairwise correlation using widyr::pairwise_cor.
The App contains four sections mostly as shown on the left. An input field for entering the phrase; A predict button for model triggering; The output section includes the next word prediction, i.e. the word with the highest probability and also top 10 suggestions with their corresponding probabilites; The model calculation typically takes from less to a few seconds;