Coursera Data Science Capstone presentation

Arshad Ullah
Oct 17th, 2019

Predict Word App

Presenting the Predict Word App, an app that will predict the next word in a sentence as you type it in. The app will show a choice of five words, you can click on any word to append it to your sentence. The app can be found in this link: https://azullah.shinyapps.io/predictWord/

Some of the unique features of the app are as follows.

  • The predicted words (top 5) are displayed as buttons which on clicking appends the word to input text
  • Input text field can accept a sentence or phrase of any length. Words are predicted as you type the text.
  • A CLEAR button is included to clean out the input text field
  • Performs better than Micosoft Windows 10 prediction (Based on testing of canned sentences provided in a similar prediction app created by Phil Ferriere)

Note: initially the app may take about 20-25 secs to load the datatables into the global environment. Once the data is loaded you will see the predicted words for the initial test sentence on the screen. The datatables are available across sessions once loaded. The response time thereafter is around 20 ms.

The Model

The basic language model used is the N-grams language model as explained in Chapter 3 of the Jurafsky and Martin Book on Natural Language Processing.

The model that was implemented is a 5-gram language model with “stupid backoff” method for backing off to lower order N-grams when a certain N-gram does not exist in the model corpora. A fixed alpha value of 0.4 was used to calculate the predicted probability.

The normalized probability of each N-gram starting from the Uni-gram to the Penta-gram was pre-calculated during the model building process using uni-grams, bi-grams and tri-grams as the basic building blocks. The predicted probability is calculated during the actual prediction function execution at runtime.

To reduce the model size (due to restrictions in shiny hosting server and developing machine limitations) N-gram higher than 3 (Quad-grams and Penta-grams) were limited to only terms/features with minimum frequency of 2.

The training data to build the final model sampled randomly about 20% of the Blogs and News and 10% of the Twitter training data.

The Data

The corpora for the model was obtained using the data provided by Swiftkey as part of the project.

File Lines
en_US.blogs.txt 899288
en_US.news.txt 1010242
en_US.twitter.txt 2360148

The above data was split into roughly 80% for training, 15% for evaluation and 5% for testing purposes. An in-mermory Corpus (tm::VCorpus) was created and some pre-processing and cleaning of the data was done (using tm and quanteda packages and regex), such as:

  • Stripping extra whitespaces
  • Converting the text to lower case
  • Remove “non-printable” characters and non-ASCII characters
  • Remove punctuation and numbers
  • Remove profane or swear words using lexicon::profanity_banned word list
  • Remove URL, twitter handles and symbols

Building and testing the model

The corpus was then tokenized into N-grams using quanteda:tokens. The document term/feature matrix was then converted to a data.table using a tidytext dataframe as an intermediate step. The rest of the model building process uses the data.table data structure, one for each N-gram. The final tables in the app are also datatables. It provides fast searches on indexed keys which is useful for the model performance. The final tables were stored as .RDS files which uses gzip to compress the object data.The final disk space occupied by the model files was 124.5 MB

Testing

Extensive testing and benchmarking was done on the model using the test data extracted from the original files. Three types of testing and benchmarking were done:

  • Comparing last word in each sentence with the predicted words.

Using 60,000 news, blogs and tweets the accuracy achieved was 22% on the top-5 words

  • Creating N-grams similar to the model and passing the N-1 words from each N-gram to the prediction function and comparing the output to the last word in that N-gram. Only N-grams with frequency greater than 1 were tested.

Testing cont'd

Using 60,0000 news, blogs and tweets from the test dataset, following accuracy was achieved:

N-gram Accuracy
Bi-grams 19.35%
Tri-grams 49.74%
Quad-grams 68.46%
Penta-grams 69.24%
Metric Result
Overall top-3 score: 16.94 %
Overall top-1 precision 13.21 %
Overall top-3 precision 20.09 %
Average runtime: 19.99 msec
Number of predictions: 63237
Total memory used: 809.01 MB
File Size on Disk : 124.5 MB