2024-07-23

Summary of Data Set

The three text files in our data set contain blog posts, news posts, and tweets.

The blog file contains 899,288 posts, and 37,546,806 words,
of which 319,546 are unique. It takes 115 unique words for 50% coverage,
and 6,778 words for 90% coverage of the blog posts.
The news file contains 77,259 posts, and 2,674,561 words,
of which 86,601 are unique. It takes 220 unique words for 50% coverage,
and 8,440 words for 90% coverage of the news posts.
The twitter file contains 2,360,148 tweets, and 30,096,649 words,
of which 367,972 are unique. It takes 131 unique words for 50% coverage,
and 5,555 words for 90% coverage of the twitter posts.

Twenty Most Common Words

The top words common to all three corpora are:
 the and to a of i in that is it for with on

So 13 of the 20 most common words are top words in all three corpora, and all of them are so-called stop words, inconsequential words with little analytical value.

Most Common Non-Stop Words

The most common words in each corpus without stop words:

The top words common to all three corpora are:
 one just like can time get now new

So 8 of the 20 most common words are top words in all three corpora.

Distribution of Word Proportions

Analyzing the non-stop words that cover 95% of the texts, we plotted a histogram of the distribution of the proportion of times that these words appear in their respective corpora.

The challenge of creating a text prediction model is apparent from the distributions. Most of the words appear infrequently, so the training data will be sparse.

N-Gram Analysis

In text analysis, the term token is used to mean the smallest unit of text you are trying to analyze. Usually, that would be a single word, but it could be a phrase or a pair of words.

When tokenizing text into phrases of more than one word, the sequence of words is called an n-gram, which is n words that appear consecutively in the text. For instance, text tokenized into two-word phrases are called a 2-grams, or bigrams. Similarly, 3-grams, or trigrams, are text tokenized into phrases of three words.

Most recommender systems,like the one we’re planning to build, use n-grams to suggest the next word in a sequence.

Identifying Most Common Bigrams

For our initial analysis of the corpora, we found the most common bigrams.

A couple of the bigrams, like los angeles and san francisco appear in two of the lists, but none appear in all three.

Distribution of Bigram Proportions

Like the distribution of the single word proportions, we plotted a histogram of the distribution of the proportion of times that the bigrams with 95% coverge appear in their respective corpora.

Similar to the individual words, the distribution of the bigrams is sparse and will make training a prediction model challenging.

Identifying Most Common Trigrams

In our analysis, we also found the most common trigrams.

In this case, none of the top trigrams are common to other corpora.

Distribution of Trigram Proportions

We also plotted a histogram of the distribution of the proportion of times that the given trigrams with 95% coverage appear in their respective corpora.

And distributions again indicate that the training data is sparse.

N-Gram Predictive Modeling

After our initial analysis, we decided that we will implement a simple n-gram model for text prediction. After combining the data in all three corpora, we’ll pre-calculate the frequencies of all of the words as well as the bigrams and trigrams. We’ll then determine the proportion of times that each occurs in our combined corpus and store the values in a simple look-up table.

In our predictions, these proportions will be treated as the probability that a given combination of words will occur. To save time and space, combinations of words with a probability below a yet-to-be-determined threshold will be removed from the table.

Predictive Modeling Steps

The prediction model will work as follows:

  • The user will type in one or more words into a text box and will click on a Make Prediction button.

  • If the user inputted a single word, the most probable bigram starting with that word is returned. If no-such bigram exists in the table, a word will be selected at random as weighted by its probability of appearing in the corpus.

  • For phrases of two or more words, the most probable trigram starting with the second word from the end will be returned. If there is no entry for this combination, the most probable bigram starting with that same word will be returned. If there is no-such bigram, then a word will be randomly selected as weighted by its probability.