This report aims to perform an exploratory analysis on a set of text data, which comprises three compiled English texts from different sources (blogs, news and twitter). This analysis purposes understanding frequencies and distributions of words and their relationship, through the frequency of word pairs and triplets. Besides that, this report also aims to accomplish the building of a basic n-gram model for next word prediction based on the previous one, two or three words.
First of all, let’s see the frequency of words considering the total extension of the corpora and then grouping them according to their text type. Next, we will do the same for word pairs and triplets, but not for all of the corpora. For time and computing resources reasons, we are going to take a sample from the corpora in order to visualize the frequencies of word pairs and triplets.
Because of the amount of memory required to analyse the relationship between words regarding to their frequency in pairs and triplets combining, we will sample the corpora at 10% of total lines from each type of text.
##
## Loaded texts sampled with 10 % proportion
##
## Summary:
## text_id total_lines sampled_lines prop
## 1 blogs 899288 89929 0.1
## 2 news 1010206 101021 0.1
## 3 twitter 2360148 236015 0.1
We could see above that there are 915,645 unique words in the entire corpora. Now let’s see how many unique words we need from the most frequent ones to cover both 50% and 90% of the corpora, by using cumulative sum of word proportions in a frequency descending sorted order.
And it is interesting! We just need 4,267 unique words to cover 50% of the corpora, which is a low proportion from its total (0.47%). To cover 90%, we need 55,984 unique words, 6.11%.
Based on the previous exploratory analysis, it was implemented a n-gram model for next word prediction. Basically, it was conceived and structured as following:
For prediction, it considers up to three previous words to predict the next word. Example: with input “I love you”, it looks for tetragrams starting with “I love you” to predict the fourth word.
Markov Chain assumption The model implicitly uses a Markov property - it assumes that the next word depends only on the previous words, not on the full text history. This is the fundamental assumption behind n-gram models.
Backoff strategy for unobserved n-grams When a specific n-gram isn’t found, the model uses a backoff strategy, which means that it tries highest-order first, falling back until it finds a solution. This ensures the model always returns a prediction, even for completely unseen word combinations. Here is the prediction flow:
The model was built to make three predictions for every input. Below we see some test inputs with different lengths and their respective three predicted next words. In the last input, which is three-word long, the model looks for a 4-gram but it fails, then it looks for a trigram based on the last two words of the input. That’s the backoff strategy in action.
##
## Predicting next word for: 'my'
## Input words: my
## Number of input words: 1
##
## Trying bigram model with last word: 'my'
##
## Using 2 -gram model
## Found 3 predictions
## 1. life
## 2. favorite
## 3. own
##
## ----------------------------------------
##
##
## Predicting next word for: 'I love'
## Input words: i, love
## Number of input words: 2
##
## Trying trigram model with context: 'i love'
##
## Using 3 -gram model
## Found 3 predictions
## 1. you
## 2. the
## 3. it
##
## ----------------------------------------
##
##
## Predicting next word for: 'the best'
## Input words: the, best
## Number of input words: 2
##
## Trying trigram model with context: 'the best'
##
## Using 3 -gram model
## Found 3 predictions
## 1. of
## 2. way
## 3. thing
##
## ----------------------------------------
##
##
## Predicting next word for: 'going to'
## Input words: going, to
## Number of input words: 2
##
## Trying trigram model with context: 'going to'
##
## Using 3 -gram model
## Found 3 predictions
## 1. be
## 2. have
## 3. the
##
## ----------------------------------------
##
##
## Predicting next word for: 'could you please'
## Input words: could, you, please
## Number of input words: 3
##
## Trying tetragram model with context: 'could you please'
## Trying trigram model with context: 'you please'
##
## Using 3 -gram model
## Found 3 predictions
## 1. follow
## 2. help
## 3. do
##
## ----------------------------------------