Heyang Gong
4/26/2016
This is a prediction algorithm instruction. The algorithm gives the predicted last word of a sentence. The core idea of this algorithm is the smoothed trigrams model for natural language processing.
Trigrams model is to predict the words with counting of 3-grams, \( P(w_1,w_2,...,w_n) \) is the probability of the words ocurring in sequence. Then the word after \( w_1, w_2 \) would be: \[ P(w|w_1,w_2) = \frac{count(w_1,w_2,w)}{count(w_1,w_2)} \]
We use a method of backoff to bigrams and ungrams to smooth those probabilities. which means:
\[ P'(w|w_1,w_2) = \lambda_1 P(w|w_1,w_2) + \lambda_2 P(w|w_1) + \lambda_3 P(w) \]
In my algorithm, we set \( \lambda_1 = 0.6 \), \( \lambda_1 = 0.3 \), \( \lambda_1 = 0.1 \). Since we are going to predict the last word, so I set \( P(w) \) be the probability among the last words in the training data.
We would choose the word with the highest probility, and the caculation of the probability is very easy, but the process of find highest probability is very computational.
So we choose to do some resample to get a list of words, and get the most frequent words as the prediction. And the large number's law ensures it reasonable.
The following are steps of my algorithm:
There is always a tradeoff between the computation and accuracy, good model should have both low computation and high accuracy.
In this project, the idea of resample is so beautiful! Make the model both low computation and high accuracy.
Also, using the basic functions to build this function is very good.