Introduction

The goal of the capstone project is to create an algorithm that predicts the three most likely words to follow a user’s input. FYI I hid most of the codes for the report to be concise. More info: https://www.coursera.org/learn/data-science-project/home/welcome

Data

I first started by loading the data and the necessary libraries

In the above code, I loaded the data, combined the three files together and for this assignment just included a sample of 1/1000th of the entire dataset to make the code load faster. I separated the data in botha training (75% of data) and testing set (25% of data). As can be seen below, the number of lines in the combined file is approx. 1000 times more than that of the sample data

I continued by removing numbers, symbols etc and split both training and testing sets into a unigram, bigram, trigram, and quadgram. Here below I plotted the four n-grams.

Algorithm

I then started building the algorithm. I used the Kneser-Ney (KN) algorithm. The idea behind the algorithm has its roots in absolute discounting (frequent n-grams are discounted in order to save some probability mass for the smoothing algorithm to distribute to the unseen n-grams). I used a discounted value of 0.75. KN essentially focuses on the likelihood of a word to appear as a novel continuation and thus be a better estimate for example for new words that are unseen in our corpus.

Words

I then end up predicting the 3 most likely words using the probabilities for each n-gram, calculated with the KN algorithm.

Input Output
“of” “the”, “a”, “my”
“i am” “not”, “the”, “a”
“one of the” “most”, “many”, “last”

Perplexity

I finish by calculating the perplexity (or goodness) of my model for all four n-grams.I remove from my calculation probabilities that are extremely low (less than 0.05% – I also removed those values from my KN probabilities). As can be seen below, the increase in the number of n-grams reduces the perplexity but the difference between trigram and quadgram is not that large.

Perplexity
Unigram 783.5
Bigram 82.6
Trigram 53.5
Quadgram 50.1