Word Prediction Presentation

Word Prediction

Data Science Capstone Project John Hopkins/ Coursera

Mick Guy
March 24, 2017

Introduction

The purpose of the capstone is to become familiar with natural language processing and text mining techniques to build language models and prediction algorithms.

The final project is to develop an application that can predict the next word given some input text from U.S. english blogs and twitter corpora. The data is from a corpus called HC Corpora and can be downloaded from https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip

Word Prediction Algorithm

The algorithm uses the Katz Back-off method through the chain from quadgrams to unigrams. The data consists of 4 tables. The counts were discounted and then probabilities calculated using the formula. The equation for the Katz's back-off model is:

\[ {\displaystyle {\begin{aligned}&P_{bo}(w_{i}\mid w_{i-n+1}\cdots w_{i-1})\\[4pt]={}&{\begin{cases}d_{w_{i-n+1}\cdots w_{i}}{\dfrac {C(w_{i-n+1}\cdots w_{i-1}w_{i})}{C(w_{i-n+1}\cdots w_{i-1})}}&{\text{if }}C(w_{i-n+1}\cdots w_{i})>k\\[10pt]\alpha _{w_{i-n+1}\cdots w_{i-1}}P_{bo}(w_{i}\mid w_{i-n+2}\cdots w_{i-1})&{\text{otherwise}}\end{cases}}\end{aligned}}} \]

The input text is processed by converting to lowercase, extra spaces removed, the text is converted to Window-1252 (fails sometimes) or ASCII, punctuation and other characters/symbols are replaced. See next slide for more details.

Word Prediction Algorithm cont'd

Entities “num,tel, pct, pctile, obit, year, Sc, email, url, addr” were created using gsub. pct = percent, pctile = percentile, obit is when year - year is found, Sc = symbol currency.

The logic is that most numbers, addresses, emails within the corpora will be unique, by classifying them it will lead to a better representation of the corpora and therefore improve the prediction capabilities.

The qdap package is used to replace contractions.

The length of the input string is checked and sent to the appropriate chain ie. if only 3 words are input then the chain will begin at trigrams otherwise it will check quadgrams first, if the quadgram is not found it will back-off to trigrams and so forth.

Word Prediction Application

The application predicts the next word when a phrase is entered.

Enter some text into the text area and click the predict button. The predicted word will be appended to the entered text. A drop down will provide up to 5 of the most likely words. Selecting a word from the drop down will replace the predicted word in the text area.

You may continue to click the predict button to generate sentences. Some predictions may return tagged entities such as sc (symbol currency), num, tel and so forth. These entity predictions are placeholders to indicate to the user that a currency, number, telephone number etc is the most likely prediction.

Note: The accuracy has been somewhat reduced for the shinyapps.io version as the tables were scaled down due to performance and memory issues.

The app can be found on shinyapps