Word Prediction App

Nikolay Dobrinov
02.09.2018

Word Prediction Algorithm and App

Objective.

- Create a shiny app and publish it
- Provide user-guide documentation for the app; attach it to the app
- Write a 5 page presentation to pitch your app

Data.

- Blogs/News/Tweets corpora from SwiftKey
- 4.3 million lines of text; over 100 million words

APP provides to the user the functionality to

- Input a phrase, no matter how long, and obtain a prediction
- View 'text-message' like predictions of top 5 most likely words
- View up to 1000 top predicted words sorted by Katz Probability

Data Pre-processing Algorithm

Sub-sample the data

- 70% train, 15% validation, 15% test samples

Prepapre the data for Katz Back-off approach.

Data cleaning

- Steps typical for NLP data pre-processing like remove duplicates, remove profanity, puctuation, email addresses, httml links, all words to lower case, remove extra white space, etc...
- For more detail see the links provided on the second to last slide

Tokenization and nGram Generation

- Generate 1,2,3,4 grams; sort each by highest frequency

Calculate Good-Turing counts for k<=5, calculate GT conditional probabilities

Model Algorithm

Katz Back-off approach
The user inputs text/phrase of any length, and the algorithm cleans the phrase as it cleaned the training corpora
Check for matches to ngrams of the last few words using Katz backoff from highest possible ngram backwards
Katz alhpa and Katz probability (GT prob * Katz alpha) for each predicted word are calculated based on the situation
The generated predictions from all ngrams are sorted in descending order by Katz Probability
Link to the Shiny App
Link to detailed model description and scripts to replicate the model

App view of front page

shiny app view

PK tweet