PERFECTLY CROMULENT TECHNOLOGIES
David J. Bauer
2019-09-29
Note:
PERFECTLY CROMULENT TECHNOLOGIES is not a real company and the text prediction tool is neither a mindreader nor totally
accurate. Apologies for the misrepresentations; please enjoy the presentation and the Shiny tool.
I developed a predictive text Shiny tool that provides text suggestions following user input. Such tools are routinely employed in cell phone messaging applications, website search fields, e-mail clients, word processors, and other such environments. The development process involved the following steps:
About 3.5 million samples of English text sourced from blogs, Twitter, and news sites were used to generate tokens with an emphasis on “ngrams”: short phrases of 2 to 5 words.
I compiled all of the ngrams into a data frame and organized it such that I can filter the data based on the beginning word(s) and thereby identify the last word. These predicted words are revealed in order of frequency with the most common option first.
For example, the 4-word phrase “The rest of the” constitutes the beginning portion of 5-grams that end with the options “day”, “season”, and “time”, in order of commonality. The prediction algorithm simply takes a phrase of any length and identifies the most common ending word choices for the final 1-4 words in the phrase.
The primary challenge in the Shiny tool development involved generating code to quickly and continually clean the text input provided by a user and compare the input to the ngram data frame. An example of the heart of this process appears below:
output$top10 <- renderText({tok[Start == word(trimws(tolower(gsub("[^[:alnum:] ']", "", input$txt))), -4, -1), End[[1]]]})
This code takes the text input and removes any characters that are not alphanumeric, space, or apostrophe. That result is coverted to lower case and the leading and trailing whitespace is trimmed. Then the final words are selected beginning at the fourth word from the end. This string is then passed to the Start column of the ngrams dataframe, which is already ordered by frequency, and the corresponding End value is identified. An End value of 1 is the most frequently occuring subsequent word for the given string. This output is then provided to the user. This process only takes a few milliseconds.
Further development of this tool is possible and ongoing refinement is recommended with an eye towards the following:
Link to the tool: Totally Accurate Mindreader!