JNabonne
04/02/2019
Dataset given by SwiftKey with english text from:
Very little guidance from professors
(just the main objective, couples of links about NLP, ngram and backoff algorithm and some warnings about memory challenges!)
Very challenging and rewarding: learn a lot!
Below is a screenshot of the Shiny app GUI with the two sections:
Following the EDA (cf. milestone report), quanteda is used to create various ngram (from 1-grams up to 5-grams) which are manually enriched with statistics and probability calculus ; this is the base of the model.
The algo used is a stupid-backoff version (simplified version of the Katz'one with much better performances) that, when given a sentence, will:
n in it Example if you type in 5 words “please father I want to”:
it will only keep “father I want to” and look in 5-grams for an answer ; if nothing…
it will look in 4grams for “I want to”, 3grams for “want to” & 2grams for “to”
if nothing works, it will dumbly return the most probable word from the corpus