Lisa Seghers
Jan 12, 2019
This is a fairly basic app that predicts the next word based on text entered into a textbox. The algorithm is a Katz-Backoff strategy that draws on quadgrams, trigrams, bigrams, and unigrams, and their respective frequencies. The ngrams were created from a dataset comprising blog posts, tweets, and news articles.
75% of the provided dataset was read, tokenized, cleaned, and formed into ngrams of lengths up to 4 terms long. The cleaning process involved removing punctuation and numbers and converting all letters to lowercase. Both stemming (reducing to the root of a word) and stop-word removal (removing common words) were considered but ultimately not used. After conversion to ngrams, the least frequently appearing ngrams were omitted from the datasets (criteria: must appear at least 10 times and in at least 2 documents), which was intended to reduce the instance of typos and foreign words as well as reduce the overall size of the ngram files for faster app performance.
The Katz Backoff strategy determines the probable next word by examining progressively shorter ngrams until a best fit is determined. In other words, the last three words of entered text are taken to form the first three words of a quadgram, with the fourth word of the quadgram representing the proposed next word. A search of a provided database of quadgrams determines if any quadgram exists that starts with the first three words and if so, the fourth word of the most frequently appearing quadgram is the prediction. If no quadgram exists in the database, the program then truncates to the last two words, and the search process repeats with a dataset of trigrams.
The Word_Predict1 Shiny App has a simple design: Users enter text in the box provided, and the next predicted word shows up in blue. The default response to a blank or non-text entry is the message “Please Enter Text.” The textbox is reactive and will produce a predicted word as soon as a pause in typing occurs. The predicted word updates as more or new text is entered. One fun application is to add the predicted word to the end of the text in the Text Input box and see where it takes you. (Note that this may result in a predicted word loop.)
This is one of the simplest methods to obtain a predicted word. It produces a predicted word for any entered text and is very fast. However, it merely takes the most frequently occurring ngram matching the longest string (up to three words) from the entered text and returns the last word. Therefore, like most simple text prediction strategies, its accuracy is extremely poor for text entries that require a content-matched result or proper noun.
A better, though slower, method would be to integrate some level of semantic analysis into the program in order to obtain more relevant predictions. Another strategy is to allow the user to upload a personalized database of ngrams from their own generated texts, tweets, posts, etc, so that the prediction is more tailored to an individual's likely word usage.