Word-O-Matic 3000

JNabonne
04/02/2019

Context / About

Capstone project for JHU Data Science Specialization
Goal: to create a predictive model to guess next word
Dataset given by SwiftKey with english text from:
1. Blogs (~210MB)
2. News (~205MB)
3. Twitter feed (~170MB)
Very little guidance from professors
(just the main objective, couples of links about NLP, ngram and backoff algorithm and some warnings about memory challenges!)
Very challenging and rewarding: learn a lot!

Result: a nice Shiny app

The application is available on shiny server here

Usage is quite straight forward:

start typing a sentence in the left section
wait a few second to get some proposition (in blue on the right side)

App will offer you:

the best answer from my algo (cf. next slide for more detail)
as a bonus, a text-to-speech button to hear it
if any, other alternative answers found by the model

There is also a special tab briefly explaining the app

The GUI

Below is a screenshot of the Shiny app GUI with the two sections:

left: where to type in your words
right: the result tab with the answer in blue
tips: you can swith tabs to gather some more info

The Model and the algorithm behind

Following the EDA (cf. milestone report), quanteda is used to create various ngram (from 1-grams up to 5-grams) which are manually enriched with statistics and probability calculus ; this is the base of the model.
The algo used is a stupid-backoff version (simplified version of the Katz'one with much better performances) that, when given a sentence, will:

count the number of words n in it
look in the (n+1)-gram data if it can find the next word
if not it will 'back-off': take input the last (n-1) words & look in n-grams
if not, back-off again and again until either finding an answer
or ending up at unigram level (1-gram)
the algo will return the best candidate possible

Example if you type in 5 words “please father I want to”:
it will only keep “father I want to” and look in 5-grams for an answer ; if nothing…
it will look in 4grams for “I want to”, 3grams for “want to” & 2grams for “to”
if nothing works, it will dumbly return the most probable word from the corpus