Next Word Prediction

Yanhua Hou
03/30/17

Presentation for Coursera Data Science Capstone Project

These slides serve as a presentation for the predictive text model built for the project of Data Science Capstone.

Aim: build an shiny app to predict the next word users will enter following the users' specified word sequences.
Content
- Data source and processing
- Algorithm for the next-word prediction
- ShinyApp
Link
- The shiny app developed for this assignment is available at: (https://phyhouhou.shinyapps.io/NextWordPredictor/)

Data downloaded from a corpus called HC Corpora consists of docs in English from three sources: Twitter, Blogs and News articles. Select randomly 5% samples from each file and combine them as a whole. Take 80% of the data as training, 10% as testing and 10% as validation data.
Data cleaning involving
- eliminating emojis,urls
- replacing abbreviations, contractions with full forms
- transforming words into lower case
- removing badwords, words containing numbers
- replacing punctuations with spaces
- removing numbers and 1 or 2 or3-letter meaningless words
- removing extra white spaces
Building a n-gram dictionary
- creat word combinations: 1-gram (covering 98% of vocabulary), 2-gram, 3-gram, 4-gram and 5-gram (for higher-order grams, filter grams that appear only once) and save them.

A stupid backoff scheme is implemented in this next word prediction.
Input from users are: partial phrase and the number of matches ('n_res') to show.
Phrase will be cleaned and the last 'nlastW' words will be extracted ('nlastW' indicates the number)
Search 'n_res' candidates from min(nlastW+1,5)-gram to bigram. Remove duplicates. In case insufficient matches found or in case words entered are out of vocabulary, supplement from most frequent unigrams.
Rank candidates by comparing Stupid Backoff score.
Return a data frame containing nextword, score and order of ngram used for prediction.