Proxipense

Next Word Analytics

Soren Heitmann
Data Science Specialization Capstone Project
Johns Hopkins Department of Biostatistics through Coursera

Overview

This presentation introduces Proxipense, a service offering the next generation in predictive next word analytics!

Proxipense is free to try at: https://neros.shinyapps.io/proxipense

What is predictive next word analytics?

Predictive next word analytics is a specific application of Natural Language Processing (NLP) analytics, which uses data science and statistical methods to model probable next-word(s) following a given input text. For example, "what is the next word that I will..." say? write? shout?

Presently, smart phones, tablets and similar devices demonstrate a growing area in which this type of analysis is used: write the start of a sentence and receive next-word prompts that help you save time, lower typo rates and reduce finger stress.

What is Proxipense

Proxipense is an interactive online applcation that demonstrates next word predictive analytics.

Type a word or phrase into the input box and Proxipense will automatically respond with suggested words, those words it predicts you're likely to type in next. Proxipense suggests the top-5 words. And it highlights the single top-1 word predicted, displaying it in large font and underlined.

If you complete your phrase, Proxipense will register the next-word you've typed it and match it against the top-5 words it previously suggested: if there's a match, Proxipense counts this as a correct guess and notes the word in a list of "correct predictions". Otherwise, the incorrect word is noted in a list of "incorrect predictions". To use this feature, pause after the word you'd like to receive predictions for; once predictions are given, then type the next word to see if there's a match.

Proxipense boasts a top-3 accuracy rate of about 20% on average! It's top-1 accuracy rate averages around 12%.

How does it work?

Proxipense demonstrates the application of data science methodology by turning raw data into a usable data product. Here, approximately 3 million lines of text are extracted from online resources, including various blogs, Reuters news feeds and Twitter.

The application normalizes input text to Latin characters (for English input), addresses punctuation, numeric text and phrase/sentence differentiation to identify individual words. It also attempts to obfuscate profanity using common dictionaries and provide users friendly predictions.

Normalized words are placed into a 5-column matrix to construct n-grams, a term used in NLP analytics to describe a sequence of n-observed words, here, up to 5 words. The frequencies of n-grams observed in the raw text data are the basis for predictive analytics.

Proxipense uses two methods to predict: Kneser-Ney smoothing; and Part-of-Speech Tagging, which are employed in a back-off model.

What methods does it use?

Kneser-Ney Smoothing

The ngram probabilities have two sources of bias: (1) The number of potential unseen ngram word/sentence combinations is potentially infinate, so a may not be representative when rare unobserved ngrams word pairs may be high (2) Word frequencies are not evenly distributed, but affected by word context.

Kneser-Ney smothing uses a technique of conditional probability and absolute discounting to address these issues. The problem is often exemplified with the phrase, "I can't see without my reading ...". A good prediction might be "glasses." But the word "Francisco" is more commonly observed in overall language use, and "I can't see without my reading Francisco" is a terrible prediction. In context, while "Francisco" may be more frequent overall, it also is generally seen only following the word "San", as in "San Francisco," and not, for example following the word, "reading" which would more commonly preceed, "glasses".

Proxipense uses the Kneser-Ney smoothing algorithm to help select highest probability next-word. This paper by Martin Christian Korner was helpful in implementing this approach. A general discussion is also available here.

Part of Speech (POS) Probabilities

Although word/sentence combinations can be effectively infinite, overall grammar pairings are more finite. Part of Speech (POS) using the openNLP package tags observed sampled words as as nouns, verbs, adjectivies, pronouns and so on. An input text's grammar ngram is thus less biaed by unseen grammar combinations. Proxipense applies next-POS grammar probability to Kneser-Ney probabilities to select words that are both likely given the word's conditional probability and the next-word's most likely grammatical construction.

Back-Off Models

The more input words, the less likely that input phrase was observed in the sample. If a 4-word input (to predict the 5th ngram) is not found, the model will "back off" and seek more common 3-word ngrams; and so on. Assuming 4gram input, "I go to the [store]" is not observed, the 3gram "go to the [...]", 2gram, "to the [...]" and 1gram, "the [...]" are queried for likely next words using Kneser-Ney. Meanwhile, POS probabiliy is absolute and not backed-off since a given grammar ngram will almost always be observed. In this example, a Noun most likely follows the 4gram input, "I go to the [...]"; while unfound words may back-off to "the [first]", the combined Knesey-Ney & POS score helps select, for example, a noun and "I go to {the [store]}" predicted over another common pairing with less common grammar, like "I go to {the [first]}"