pguillemi
9/11/2022
Writing takes time
Predicting words saves time
– People usually write using a large but not infinite set of expressions
– A large corpus of texts can be broken down into comprehensive chunks (n-grams)
– Most common n-grams can be used for prediction: given n first words, it can be identified which few ones usually come after
About the app
This app aims at speed without sacrificing accuracy
– Thorough input data pre-processing
– Simple yet effective scoring system and prediction functions
– App with neat interface and two modes to select from
Text corpus consisted of ~70 M words from twitter, blogs and news
– A custom function was created to capture 95% of most used words in each source
– Only those words that appeared in three sources were kept, which yields a dictionary of ~10.000 words
– Profanity was removed
Extensive n-gram creation… filtered
– All of the corpus was used to create prediction n-grams with custom functions and file parsing, using tidytext and all of tidyverse packages
– 2 to 6-word n-grams were extracted and parsed in n-gram ~ result combos
– Only those that appeared more than once were kept
Simple scoring - robust prediction
– Words in functions were internally replaced by numbers, to optimize data file sizes
– Each n-gram ~ result combo was given a unique numerical score based on number of words in n-gram (1 to 5), frequency in corpus (a proportion in the range of 0 to 1) and 1 point deduction for common stop-words
– Prediction function makes use of data.table’s package speedy joins and filters, which finish wrapping and adding up scores for each result to provide next probable words
A shiny app was created, with two modes
Trigger mode: a text input is entered, prediction runs after pressing the “Go!” button
Real-time mode: text input is evaluated in real time, and predicts in two ways
– When last character is a space, it uses the full string for prediction of a whole word
– When last character are letters or numbers, they considered as the beginning of next words, and results ordered by score are updated accordingly
– A small time debounce had to be added to prevent overflowing of server memory. That limitation isn’t required when running local app locally
Press predicted words and walk around predictions
– Predictions are shown in buttons, that can then be pressed to update input directly!
Easy text input reset
– Just press the reset button to clear text input
App: https://pguillemi.shinyapps.io/Word_predictor_pguillemi
Milestone report: https://rpubs.com/pguillemi/957851
(Note: this report analysed 90% of most common words, afterwards, app’s dictionary was upgraded to cover 95% of most common words)
Questions and comments: pablo.guillemi@gmail.com