2023-10-28

Source data

For this project, the course provided three datasets (news, twitter, and blogs) in text format. I joined the three datasets and extracted a sample with 20% of the whole data, this sample was then divided into train and test sets. The train set was used to find bi-grams, three-grams, four-grams and five-grams. Here we can see the most frequent bi-grams and three-grams.

Cleaning and trimming data

In order to make the model faster and reduce memory consumption some decisions were taken during the development of the model. I will list the final desitions.

  1. The words that appear once and are present in the English dictionary were considered unknown words, these were replaced by “UNK” and their probability was computed the same way that the rest of the words.
  2. From the remaining words I removed all the words with a frequency lower than 220.
  3. Some n-grams were also deleted in the next way:
    • bigram with a frequency less or equal to 60.
    • three-gram with a frequency less or equal to 18

Four-gram and five gram were also reduced in the process, however, they were not considered at all in the final model.

Model creation

I used the Katz Backk-off approach with good turing as smoothing technique. First, I created a function to perform Good Turing, it redistributes the counts of the n-grams. After calling the function we have a table with n-gram count, the new estimated probability, and a variable d or amount of discounting. I did this for bi-gram, three-grams, four-grams and five-grams.

In the next step a function was created to predict the next word given a phrase, this function takes the phrase and the n-grams previously created and assigns a probability to every word present in the corpus, this is the probability of being the next word given its history, using the Katz back-off approach. In the shiny app application, the model will check for the history of the phrase in the three-gram, when it is not present, it will then check the bi-gram.

Accuracy and perplexity

Using the same Katz back-off approach I found the probability of the bi-grams, three-grams, four-grams and five-grams generated with a small portion of the test set. Then I estimated the perplexity, I found the values given in the table below.

ngrams perplexity
two-gram 1351.5693
three-gram 1000.1566
four-gram 984.8729
five-gram 979.0458


To check the accuracy, I separated the final word in each test n-gram and used the model to find the three words with the highest probability given the first part of the phrase, the original words and predicted words were used to compute the accuracy of the model, these values ranged from 0,21 to 0,27, considering the first three predicted words. The final model has an accuracy of 0.26.

Final product

I developed a simple Shinny App where you can predict a word given a phrase, you can access the Shiny app here, in the left panel you should type a phrase, checking the spelling is correct. After a few seconds, the right panel will show the three words predicted by the model.

References