Word Prediction App

Nikolay Dobrinov
02.09.2018

Word Prediction Algorithm and App

  • Objective.

    - Create a shiny app and publish it
    - Provide user-guide documentation for the app; attach it to the app
    - Write a 5 page presentation to pitch your app
    
  • Data.

    - Blogs/News/Tweets corpora from SwiftKey
    - 4.3 million lines of text; over 100 million words
    
  • APP provides to the user the functionality to

    - Input a phrase, no matter how long, and obtain a prediction
    - View 'text-message' like predictions of top 5 most likely words
    - View up to 1000 top predicted words sorted by Katz Probability
    

Data Pre-processing Algorithm

  • Sub-sample the data

    - 70% train, 15% validation, 15% test samples
    
  • Prepapre the data for Katz Back-off approach.

  • Data cleaning

    - Steps typical for NLP data pre-processing like remove duplicates, remove profanity, puctuation, email addresses, httml links, all words to lower case, remove extra white space, etc...
    - For more detail see the links provided on the second to last slide
    
  • Tokenization and nGram Generation

    - Generate 1,2,3,4 grams; sort each by highest frequency   
    
  • Calculate Good-Turing counts for k<=5, calculate GT conditional probabilities

Model Algorithm

  • Katz Back-off approach

  • The user inputs text/phrase of any length, and the algorithm cleans the phrase as it cleaned the training corpora

  • Check for matches to ngrams of the last few words using Katz backoff from highest possible ngram backwards

  • Katz alhpa and Katz probability (GT prob * Katz alpha) for each predicted word are calculated based on the situation

  • The generated predictions from all ngrams are sorted in descending order by Katz Probability

  • Link to the Shiny App

  • Link to detailed model description and scripts to replicate the model

App view of front page

shiny app view

PK tweet