Chris Woods
30th August 2019
Completed for the Capstone Project for John Hopkins University, Data Science Specialisation.
This presentation describes the Text Predictor App I created for the Data Science Specialisation run by the John Hopkins University via Coursera who partnered with SwiftKey to apply data science in the area of natural language processing.
The objective of this project was to build a working model that would predict the next word based on those previously input. The data used in the model came from a corpus called HC Corpora, made up of three files (tweets, news articles and blog posts).
The model is delivered via a Shiny App which can be accessed via https://chriswoodssays.shinyapps.io/TextPredictorApp/.
I developed the algorithm by creating n-grams (combinations of n words) found within the training text, e.g. 'I live in'. I initially planned to do this on the basis of 2, 3 and 4 grams but for reasons of performance within Shiny, limited it to 2 and 3 grams. I also used a 20% sample of each of the 3 files (Twitter, News and Blogs).
I used the Quanteda package to create the 2 and 3 grams. For performance reasons, I did this in advance, saving them to files which are included as part of the App. Profanities (not exhaustive), punctuation, hyphens, Twitter handles, symbols and n-grams that only appear once are removed. The latter removes a very high volume, allowing a greater focus on those that appear more consistently.
The App takes the last 1 or 2 words input and uses the Stupid Backoff Method to score entries in the n-gram files and select those with the highest probability. See final slide for an example.
Once the App is loaded, the user must simply type a word or phrase into the text box. Whenever one or more matches are found, the top 10 will be shown on the chart. A higher number may be shown if there are words with the same score within the top 10.
Stupid Backoff Example
If 'I live in' is input and the 'live in the' 3-gram appears more times than any other 'live in…' 3-gram in the training data, then this will have the highest probality. If there are no such occurrences in the data, it will 'back off' to 2-grams that start with 'in'.