Lisa Seghers
Jan 12, 2019
This is a fairly basic app that predicts the next word based on text entered into a textbox. The algorithm is a Katz-Backoff strategy that draws on quadgrams, trigrams, bigrams, and unigrams, and their respective frequencies. The ngrams were created from a dataset comprising blog posts, tweets, and news articles. 75% of the dataset was obtained, tokenized, cleaned, and formed into ngrams of lengths up to 4 terms long.
The cleaning process involved removing punctuation and numbers and converting all letters to lowercase. Both stemming (reducing to the root of a word) and stop-word removal (removing common words) were considered but ultimately not used. After conversion to ngrams, the least frequently appearing ngrams were omitted from the datasets (criteria: must appear at least 10 times and in at least 2 documents), which was intended to reduce the instance of typos and foreign words as well as reduce the overall size of the ngram files for faster app performance.
The Katz Backoff strategy determines the probable next word by examining progressively shorter ngrams until a best fit is determined. A database of ngrams of varying lengths and their respective frequencies is pre-established, and the suggested next word is taken from the most commonly appearing word at the end of a given ngram.
In other words, the last three words of entered text are taken to form the first three words of a quadgram, with the fourth word of the quadgram representing the proposed next word. A search of a database of quadgrams determines if any quadgram exists that starts with the first three words and if so, the fourth word of the most frequently appearing quadgram is the prediction. If no quadgram exists in the database, the program then truncates to the last two words, and the search process repeats with a dataset of trigrams.
The Word_Predict1 Shiny App has a simple design: Users enter text in the box provided, and the next predicted word shows up in blue. The default response to a blank or non-text entry is the message “Please Enter Text.” The textbox is reactive and will produce a predicted word as soon as a pause in typing occurs. The predicted word updates as more or new text is entered. One fun application is to add the predicted word to the end of the text in the Text Input box and see where it takes you.
A limitation of the created ngram datasets is that some entries may produce a predicted-word loop because of insufficient variation in available ngrams and the display of only a single predicted word. To illustrate, if the user types the phrase “Here we go”, the predicted word is “again”, which if added to the entry string (“Here we go again”) results in the predicted word “and.” Adding and to the string “Here we go again and” will result in the predicted word “again” and so forth.
This is one of the simplest methods to obtain a predicted word, and does not take into account any of the content of the entered text. It merely takes the most frequently occurring ngram matching the longest string (up to three words) from the entered text and returns the last word. It always provides a predicted word to a text entry of longer than one character, even if that word is the most frequently occuring unigram (“the”) and it operates very quickly. Like most simple text prediction strategies, its accuracy is extremely poor for text entries that require a content-matched result or proper noun.
A better, though slower, method would be to integrate some level of semantic analysis into the program in order to narrow the set of matching ngrams to those that are relevant by content to the entered text. Another strategy is to allow the user to upload a personalized database of ngrams from their own generated texts, tweets, posts, etc, so that the prediction is more tailored to an individual's likely word usage.