Milestone Report

Executive summary

This report aims to perform an exploratory analysis on a set of text data, which comprises three compiled English texts from different sources (blogs, news and twitter). This analysis purposes understanding frequencies and distributions of words and their relationship, through the frequency of word pairs and triplets. Besides that, this report also aims to accomplish the building of a basic n-gram model for next word prediction based on the previous one, two or three words.

Basic summaries

First of all, let’s see the frequency of words considering the total extension of the corpora and then grouping them according to their text type. Next, we will do the same for word pairs and triplets, but not for all of the corpora. For time and computing resources reasons, we are going to take a sample from the corpora in order to visualize the frequencies of word pairs and triplets.

Quick overview on the words frequency

Total number of lines and sampling

Because of the amount of memory required to analyse the relationship between words regarding to their frequency in pairs and triplets combining, we will sample the corpora at 10% of total lines from each type of text.

## 
##  Loaded texts sampled with 10 % proportion
## 
##  Summary:
##   text_id total_lines sampled_lines prop
## 1   blogs      899288         89929  0.1
## 2    news     1010206        101021  0.1
## 3 twitter     2360148        236015  0.1

Quick overview on the words pairs and triplets

Coverage

We could see above that there are 915,645 unique words in the entire corpora. Now let’s see how many unique words we need from the most frequent ones to cover both 50% and 90% of the corpora, by using cumulative sum of word proportions in a frequency descending sorted order.

And it is interesting! We just need 4,267 unique words to cover 50% of the corpora, which is a low proportion from its total (0.47%). To cover 90%, we need 55,984 unique words, 6.11%.

Basic n-gram model

Based on the previous exploratory analysis, it was implemented a n-gram model for next word prediction. Basically, it was conceived and structured as following:

N-gram orders consideration The model builds and uses 4 orders of n-grams and their frequences:

Unigrams (1-word): single words
Bigrams (2-word): pairs like “love you”
Trigrams (3-word): triplets like “I love you”
Tetragrams (4-word): four-word sequences like “I love you very”

For prediction, it considers up to three previous words to predict the next word. Example: with input “I love you”, it looks for tetragrams starting with “I love you” to predict the fourth word.

Markov Chain assumption The model implicitly uses a Markov property - it assumes that the next word depends only on the previous words, not on the full text history. This is the fundamental assumption behind n-gram models.
Backoff strategy for unobserved n-grams When a specific n-gram isn’t found, the model uses a backoff strategy, which means that it tries highest-order first, falling back until it finds a solution. This ensures the model always returns a prediction, even for completely unseen word combinations. Here is the prediction flow:

Try tetragram (4-gram): if found, use it
If not found, backoff to trigram (3-gram)
If not found, backoff to bigram (2-gram)
If not found, backoff to unigram (1-gram)
If still nothing, use most common words

Testing the model

The model was built to make three predictions for every input. Below we see some test inputs with different lengths and their respective three predicted next words. In the last input, which is three-word long, the model looks for a 4-gram but it fails, then it looks for a trigram based on the last two words of the input. That’s the backoff strategy in action.

## 
## Predicting next word for: 'my'
## Input words: my 
## Number of input words: 1 
## 
## Trying bigram model with last word: 'my'
## 
## Using 2 -gram model
## Found 3 predictions
## 1. life
## 2. favorite
## 3. own
## 
## ----------------------------------------
## 
## 
## Predicting next word for: 'I love'
## Input words: i, love 
## Number of input words: 2 
## 
## Trying trigram model with context: 'i love'
## 
## Using 3 -gram model
## Found 3 predictions
## 1. you
## 2. the
## 3. it
## 
## ----------------------------------------
## 
## 
## Predicting next word for: 'the best'
## Input words: the, best 
## Number of input words: 2 
## 
## Trying trigram model with context: 'the best'
## 
## Using 3 -gram model
## Found 3 predictions
## 1. of
## 2. way
## 3. thing
## 
## ----------------------------------------
## 
## 
## Predicting next word for: 'going to'
## Input words: going, to 
## Number of input words: 2 
## 
## Trying trigram model with context: 'going to'
## 
## Using 3 -gram model
## Found 3 predictions
## 1. be
## 2. have
## 3. the
## 
## ----------------------------------------
## 
## 
## Predicting next word for: 'could you please'
## Input words: could, you, please 
## Number of input words: 3 
## 
## Trying tetragram model with context: 'could you please'
## Trying trigram model with context: 'you please'
## 
## Using 3 -gram model
## Found 3 predictions
## 1. follow
## 2. help
## 3. do
## 
## ----------------------------------------