Totally Accurate Mindreader!

David J. Bauer
2019-09-29

Note:
PERFECTLY CROMULENT TECHNOLOGIES is not a real company and the text prediction tool is neither a mindreader nor totally accurate. Apologies for the misrepresentations; please enjoy the presentation and the Shiny tool.

OVERVIEW

I developed a predictive text Shiny tool that provides text suggestions following user input. Such tools are routinely employed in cell phone messaging applications, website search fields, e-mail clients, word processors, and other such environments. The development process involved the following steps:

Construct and tokenize a corpus of English-language text to identify common words and short phrases.
Clean the tokens and organize them into a data frame.
Construct a Shiny app that accepts and cleans user text input.
Compare this input with the data frame of tokens and output suggested words.

PREDICTION ALGORITHM DETAILS

About 3.5 million samples of English text sourced from blogs, Twitter, and news sites were used to generate tokens with an emphasis on “ngrams”: short phrases of 2 to 5 words.

I compiled all of the ngrams into a data frame and organized it such that I can filter the data based on the beginning word(s) and thereby identify the last word. These predicted words are revealed in order of frequency with the most common option first.

For example, the 4-word phrase “The rest of the” constitutes the beginning portion of 5-grams that end with the options “day”, “season”, and “time”, in order of commonality. The prediction algorithm simply takes a phrase of any length and identifies the most common ending word choices for the final 1-4 words in the phrase.

SHINY TOOL DETAILS

The primary challenge in the Shiny tool development involved generating code to quickly and continually clean the text input provided by a user and compare the input to the ngram data frame. An example of the heart of this process appears below:

output$top10 <- renderText({tok[Start == word(trimws(tolower(gsub("[^[:alnum:] ']", "", input$txt))), -4, -1), End[[1]]]})

This code takes the text input and removes any characters that are not alphanumeric, space, or apostrophe. That result is coverted to lower case and the leading and trailing whitespace is trimmed. Then the final words are selected beginning at the fourth word from the end. This string is then passed to the Start column of the ngrams dataframe, which is already ordered by frequency, and the corresponding End value is identified. An End value of 1 is the most frequently occuring subsequent word for the given string. This output is then provided to the user. This process only takes a few milliseconds.

FUTURE CONSIDERATIONS

Further development of this tool is possible and ongoing refinement is recommended with an eye towards the following:

Language use changes constantly so the tokens should receive regular updates by scraping text from popular contemporary sources and generating new ngrams.

More complex tokens are also possible, such as skipgrams. Inclusion of these should enhance the prediction algorithm.

Additional prediction models exist and should be evaluated against the model employed here. These often involve vectorizing tokens to quantify relationships based on co-occurrences within a corpus.

Additional balancing of speed, accuracy, and memory depending on employment context. The current tool is fast and uses little memory but the tradeoff for these qualities is reduced accuracy.

Link to the tool: Totally Accurate Mindreader!