Trained on a large corpus of English language text, this highly accurate text prediction app takes just 2.25 seconds to load and then responds within 50 milliseconds of user input to suggest the next word.
The app is at https://katbailey.shinyapps.io/nlp_project/
When you type text in, any time you hit the space bar it will take up to 4 previously entered words and use these to predict the next one.
Contractions Words like “don't” and “shouldn't” are in the model without their apostrophes so before outputting them as suggestions the app puts the apostrophe back.
If it has no prediction based on 4 words, it backs off to 3, then 2, then 1. As a last resort it will predict the word “the”.
The prediction model was built on a corpus of English language texts from blogs, news stories and tweets. N-gram frequencies were extracted and then some decisions had to be made to establish a balance between prediction accuracy and speed.
10 different models were created, from unigrams up to 5-grams and applying filtering or no filtering. The filtering was to remove n-grams of very low frequency to save on object size, and also for 4-grams and 5-grams to remove those starting with the definite or indefinite article. All but the unigram model backed off to the next model down.
Perplexity A function was written to calculate the perplexity of different models by feeding in a test set which had been held out from the original corpus, i.e. not part of the training corpus.
The 10 different models were evaluated based on this perplexity calculation. As perplexity is inversely proportional to word probabilities, lower perplexity means it calculated a higher propability on the words in the test set.
Filtering to remove infrequent n-grams did affect perplexity but increasing the n-gram level made up for this and the filesize of the final models was greatly reduced.
Given that the app had to predict just a single word given the input, it was possible to greatly reduce the model data needed. For a given n-gram, we just needed the single most probable next word.
ngrams$w1 <- prev_words # Contains the string that excludes the last word
ngrams$w2 <- last_word # Last word only
# Group by previous words and arrange by descending frequency
ngrams <- ngrams %>% group_by(w1) %>% arrange(desc(freq))
ngram_list <- tapply(ngrams$w2, as.factor(ngrams$w1), function(x) { cutoff <- min(cutoff, length(x)); y <- as.character(x); y[1:cutoff] })
This snipped shows how for a given set of n-grams, say 4-grams, we grouped them by the context words (in this case the 3 previous words) and then ordered them by frequency. We then took the first “last word” and used that as the prediction for the previous words.