Katie Martins
February 4, 2018
The goal of this project was to create a Shiny app that takes as input the first few words of a sentence and predicts the next word.
The training dataset was built using 50% of the blog corpus, and 40% each of the news and twitter corpora. The data were tokenized into n-grams, from unigrams up to 5-grams, with the following processing steps:
The algorithm starts by taking the last four words of the user input and searching for a match among the 5-grams in the model. If a match is found, the scores for each possible next word prediction are computed as the number of times that word follows the 4-gram input divided by the total number of times the 4-gram input occurs. Then, the algorithm moves to 4-grams, and computes the scores analogously, with a constant discount factor of 0.4. The backoff process continues for trigrams and bigrams, with a discount of 0.4 at each lower n-gram level. The predicted next word is the word with the highest score.
If no 5-gram matches are found, the algorithm searches 4-grams, then trigrams, then bigrams. If no match is found among any of the n-grams, the predicted next word will be the most frequent unigram.
Accuracy of the model was assessed by taking 1,000 lines from each of the corpora (news, twitter, and blogs). The lines used for testing were not used in building the model. The first four words of each line were used as input, and the predicted next word was compared to the actual next word.
In about 15% of cases, the next word prediction matched the actual next word.
In about 30% of cases, the actual next word was within the top five next word predictions.