prediction_algorithm

Heyang Gong
4/26/2016

Introduction

This is a prediction algorithm instruction. The algorithm gives the predicted last word of a sentence. The core idea of this algorithm is the smoothed trigrams model for natural language processing.

trigrams model

Trigrams model is to predict the words with counting of 3-grams, \( P(w_1,w_2,...,w_n) \) is the probability of the words ocurring in sequence. Then the word after \( w_1, w_2 \) would be: \[ P(w|w_1,w_2) = \frac{count(w_1,w_2,w)}{count(w_1,w_2)} \]

Smoothing

We use a method of backoff to bigrams and ungrams to smooth those probabilities. which means:

\[ P'(w|w_1,w_2) = \lambda_1 P(w|w_1,w_2) + \lambda_2 P(w|w_1) + \lambda_3 P(w) \]

In my algorithm, we set \( \lambda_1 = 0.6 \), \( \lambda_1 = 0.3 \), \( \lambda_1 = 0.1 \). Since we are going to predict the last word, so I set \( P(w) \) be the probability among the last words in the training data.

Bootstramp

We would choose the word with the highest probility, and the caculation of the probability is very easy, but the process of find highest probability is very computational.

So we choose to do some resample to get a list of words, and get the most frequent words as the prediction. And the large number's law ensures it reasonable.

Realization

The following are steps of my algorithm:

find a proper size of 3-grams started with \( w_1,w_2 \), get the words, resample it.
find a proper size of 2-grams started with \( w_2 \), get the words, resample it.
find a proper size of words in the end of a sentence, resample it.
union those words, find the most frequent word as the prediction

Conclusion

There is always a tradeoff between the computation and accuracy, good model should have both low computation and high accuracy.

In this project, the idea of resample is so beautiful! Make the model both low computation and high accuracy.

Also, using the basic functions to build this function is very good.