2026-05-16

Overview of the App

This App takes an inputed phrase (up to three words) and predicts the next word within the sequence.

The App was constructed through several major steps including:

  • Data cleaning and preparation
  • Exploratory analysis
  • N-gram creation
  • Model construction
  • App development

Descriptions of the training data, modeling technique, and an example of the App output for the input string “I like your” can be found in the next several slides.

Training Data

The data used for the construction of the prediction model come from the (Swiftkey) HC Corpora data set. This data set is obtained from three main sources (Blogs, News, and Twitter) and is available in four languages (English, German, Finnish, and Russian). For the purposes of this analysis, only the English version was utilized.

Due to the large size of the data sets, a random selection of lines was selected from each of the 3 sources. To keep the relative importance of each source similar to the original data, a proportional sample (12%) was selected from each source.

Before modeling could begin, the data was cleaned to remove any numbers, punctuation, URLs, separators, symbols, or profanity. Stop words were included in the final model to retain context and improve the predictive accuracy.

Modeling

The prediction model utilizes an N-gram language model to predict the next word in the submitted phrase based on the 4, 3, & 2-gram matches.

The N-gram comparison sets were created with the Quanteda package in R. To minimize memory and computational time within the App itself, the N-gram sets were computed outside of the App with the tokenized data being convert to DFM files and saved as RDS files for use in the App. To minimize the occurrence of rare N-grams each of the 3 files was trimmed to have at least 2 (4 & 3-gram) or 3 (2-gram) occurrences of each N-gram.

Probabilities for the selection of final suggestions were based on the Stupid Backoff method which estimates the probabilities for unseen N-grams by backing off to smaller N-grams.

Sample App ouput

The App has three main components. In the upper panel, the user will enter their phrase and the App will output a 4-tab summary indicating the top 10 predictions and the best N-gram matches for the original phrase.

The second panel completes the input phrase with the best prediction.

The third panel show a graphical representation of the SB probabilites for the 20 best predictors.