Capstone: Word Prediction

2024-09-11

Summary of Data Set

The three text files in our data set contain blog posts, news posts, and tweets.

The blog file contains 899,288 posts, and 37,546,806 words,
of which 319,546 are unique.

The news file contains 77,259 posts, and 2,674,561 words,
of which 86,601 are unique.

The twitter file contains 2,360,148 tweets, and 30,096,649 words,
of which 367,972 are unique.

N-Gram Text Analysis

In text analysis, the term token is used to mean the smallest unit of text you are trying to analyze. Usually, that would be a single word, but it could be a pair of words or a phrase.

When tokenizing text into groupings of more than one word, the sequence of words is called an n-gram, which is n words that appear consecutively in the text. For instance, text tokenized into two-word phrases are called a 2-grams, or bigrams. Similarly, 3-grams, or trigrams, are text tokenized into phrases of three words. And 4-grams, or quadrigrams, are four-word groupings.

Many recommender systems, like the one I built, use n-grams to suggest the next word in a sequence.

N-Gram Predictive Modeling

After my initial analysis, I decided to implement a simple n-gram model for text prediction. After combining the data into a single corpus, I pre-calculated the frequencies of all of the bigrams, trigrams, and quadrigrams. Using these frequencies, I made three n-gram prediction models.

To predict the most likely next word in a given sequence, each model makes a prediction. If two or more of the n-grams models agree, I nominate that word as the most likely next word. If the models don’t agree, I use a normalized, weighted likelihood, which I calculated with the validation data.

Here’s a link to my application:

http://mhthom2.shinyapps.io/WordPredictionApp

Prediction Results

To evaluate my prediction model, I randomly selected phrases from the test data and made predictions with all of the models.

The best performing predictor was the trigram model, but the accuracy of all the models were all below 25%. As I noted in the exploratory analysis, the scarcity of data makes accurate prediction difficult.