WordPredict - Capstone Project

Wendy Sarrett
06/15/17

WordPredict purpose is to use nlp methods (In particular a “stupid” backoff algorithm) to take a phrase and predict the next word

A key part of the work was creating the n-gram tables…we were limited to 1gb due to shinyio's limitations.
We used SQLite to store our tables. The work of creating the tables was done offline so the actual prediction is very fast. Our tests indicate < 2 seconds on average.
Experimentation showed that using the most common phases for the n-gram tables led to better result, trading off what percentage of the data is initially to create the n-grams vs what percentage of most common feature are selected for the final tables.
We also tried the “Katz” backoff method but the results were slighly worse in terms of %correct, sum of log probabilites, etc.

Our Model was based on “Stupid” Backoff and We started with 5-grams and backed off from there
If the ngram was a match calculate the probability for each item (count/total count)*discount and select the most common. The discount is initially 1 (ie. no discount)
If not found, go to the n-1 gram and the discount is the previous discount*0.4 and prop = (count/total count)*discount. If you fail to match the 1-gram return “not found”
Key was selecting the right training set..settled with 80% of the news, 80% of the blogs and 70% of the tweets.From that we took 30% highest feature n-grams from each level
Accuracy was very much impacted by the size of the database. If the database was over 1gb the accuracy was greatly improved (some tests were 70-80% correct.) My final database was 924mb. This makes sence given the algorithm depends on matching an ngram in the database.

Correct	Incorrect	Percent Correct	Not Found	Correct Sum Ln/N	Incorrect Sum Ln/N
837	2110	28.40	53	1.60	3.70

The data we used was divided into a training (75%) and test set (25% - for creating 5-grams for testing) The training set was what was used to build the ngram database our algorithm relied upon.
We ran a test with 3000 random 5-grams selected from the test set. We tracked percent corect and the log sum (cost function) for correct verses incorrect answers. Note we left out those tests where nothing was found (53 out of the 3000 sample.) This was 1.77% of the input and mostly due to bad data (misspellings,odd character at the end of the word, etc.)
The fact that the cost function for “correct” (1.60) is much less than the incorrect (3.70) indicates that it predicted correctly when it had a higher probability of being correct
As mentioned, we tried a version of “Katz” backoff but the results were slightly worse.

Here is what WordPredict looks like

Image of WordPredict

To use: – Enter a phrase without puntuation or an double quotes – Click on Submit – The predicted “Next Word” will appear under “Next Word”

Outwardly it appears very simple but the hard work was in creating the ngram tables and writing the backoff algorithm