Final Project Presentation

12/03/14

Prediction model

For the prediction I used Katz's back-off model with Good Turing smoothing. The model goes through the following steps:

1) Calculate total frequencies of the word appearing after the highest order of N-gram (trigram in case of this application). If the total frequency is zero, calculate same frequency for (N-1)-gram. The process continues, until non-zero frequency has found for \( i \)-gram. If the word is out-of-vocabulary, application generate random stopword;

2) Estimate conditional probabilities \( P(\omega_n|\omega_{n-1},...,\omega_{n-i}) \), where \( i \) is the order of N-gram from the step 1.

The probability \( P(\omega_n|\omega_{n-1},...,\omega_{n-i}) \) is estimated by the following formula: \[ P(\omega_n|\omega_{n-1},...,\omega_{n-i})=\frac{(c+1)}{K}\frac{N_c}{N_{c+1}}, \] where \( c \) - frequency of word appearing after N-gram, \( N_j \) - frequency of frequency (N+1)-grams appeared \( j \) times, \( K \) - total (N+1)-gram count;

Prediction model (Cont.)

3) Multiply probability from the step 2 with discounting function: \[ D=\prod_{k=j}^i{d_k}=\prod_{k=j}^i{(1-\sum_{\omega_n}{\alpha(\omega_n|\omega_{n-1},...,\omega_{n-k})})}, \] where \( j \) - highest order of N-gram, \( \alpha(\omega_n|\omega_{n-1},...,\omega_{n-k})=(c+1)\frac{N_c}{N_{c+1}} \) - discounting rate, which has interpretation of the scale parameter of Good Turing frequency between N-grams;

4) Among the set of words choose the one with the highest probability from step 3. This word serves prediction to the next word.

App description

alt text

This app predict top 3 most probable words based on the ngram model. The probabilities showed along with the words. If there are not enough words to base prediction, it takes random frequent word, which is denoted by *. Button places chosen words into the textarea.

alt text2

Text input only accept english based text. When typing the sentence manually, there might be several seconds lag between new predictions due to optimization issue.

App description (Cont.)

Shortcomings:

Optimization issue, further refactoring required;
Algorithm cannot predict stopwords, since they are removed at the data cleaning stage;
Out-of-vocabulary problem, throwing random frequent words doesn't replace the solution.