Nikolay Dobrinov
02.09.2018
Objective.
- Create a shiny app and publish it
- Provide user-guide documentation for the app; attach it to the app
- Write a 5 page presentation to pitch your app
Data.
- Blogs/News/Tweets corpora from SwiftKey
- 4.3 million lines of text; over 100 million words
APP provides to the user the functionality to
- Input a phrase, no matter how long, and obtain a prediction
- View 'text-message' like predictions of top 5 most likely words
- View up to 1000 top predicted words sorted by Katz Probability
Sub-sample the data
- 70% train, 15% validation, 15% test samples
Prepapre the data for Katz Back-off approach.
Data cleaning
- Steps typical for NLP data pre-processing like remove duplicates, remove profanity, puctuation, email addresses, httml links, all words to lower case, remove extra white space, etc...
- For more detail see the links provided on the second to last slide
Tokenization and nGram Generation
- Generate 1,2,3,4 grams; sort each by highest frequency
Calculate Good-Turing counts for k<=5, calculate GT conditional probabilities
Katz Back-off approach
The user inputs text/phrase of any length, and the algorithm cleans the phrase as it cleaned the training corpora
Check for matches to ngrams of the last few words using Katz backoff from highest possible ngram backwards
Katz alhpa and Katz probability (GT prob * Katz alpha) for each predicted word are calculated based on the situation
The generated predictions from all ngrams are sorted in descending order by Katz Probability
Link to detailed model description and scripts to replicate the model