Bowen Zhang
June 30, 2020
This is the Capstone project for the Data Science Specialization Track offered by John Hopkins University. This project involves building a predictive text app that uses predictive analytics to suggest the next word a user inputs. The data for this project was provided by SwiftKey, a leading text suggestion application for mobile phones. Download
Capstone Deliverables:
The model used was a simple N-gram back-off model. The ngrams were created using a 5% random sample of the HC Copora dataset from SwiftKey.
The data was cleaned and tokenized into 4-grams (unigrams, bigrams, trigrams, fourgrams). Each of these datasets were transformed into data frames with each column divided into single words and the frequency of the combination of those grams.
The input could then be taken and searched by each word in each of the ngrams. Using back-off, we would first limit the input to 3 words by taking the tail of the phrase, and then based on length we would search the n+1 gram data for any matches. If there were no matches, we would then search again by taking the tail minus 1 and searching the n+1 gram for the shortened input.
This was done with the quanteda package in R.
The Shiny App: (http://bzhang93.shinyapps.io/Ngrams-Text-Predictor/)
Breakdown: