Text Predictor App

Chris Woods
30th August 2019

Completed for the Capstone Project for John Hopkins University, Data Science Specialisation.

Project Summary

This presentation describes the Text Predictor App I created for the Data Science Specialisation run by the John Hopkins University via Coursera who partnered with SwiftKey to apply data science in the area of natural language processing.

The objective of this project was to build a working model that would predict the next word based on those previously input. The data used in the model came from a corpus called HC Corpora, made up of three files (tweets, news articles and blog posts).

The model is delivered via a Shiny App which can be accessed via https://chriswoodssays.shinyapps.io/TextPredictorApp/.

Developing the Algorithm

I developed the algorithm by creating n-grams (combinations of n words) found within the training text, e.g. 'I live in'. I initially planned to do this on the basis of 2, 3 and 4 grams but for reasons of performance within Shiny, limited it to 2 and 3 grams. I also used a 20% sample of each of the 3 files (Twitter, News and Blogs).

I used the Quanteda package to create the 2 and 3 grams. For performance reasons, I did this in advance, saving them to files which are included as part of the App. Profanities (not exhaustive), punctuation, hyphens, Twitter handles, symbols and n-grams that only appear once are removed. The latter removes a very high volume, allowing a greater focus on those that appear more consistently.

The App takes the last 1 or 2 words input and uses the Stupid Backoff Method to score entries in the n-gram files and select those with the highest probability. See final slide for an example.

Using the App

Once the App is loaded, the user must simply type a word or phrase into the text box. Whenever one or more matches are found, the top 10 will be shown on the chart. A higher number may be shown if there are words with the same score within the top 10.

Text Prediction App

What I would do next

Look at options for increasing accuracy by increasing the sample size and also adding 4-grams
Provide a choice of which training set to use, as they will be quite different (especially Twitter vs the others)
Improve the profanity checking

Stupid Backoff Example

If 'I live in' is input and the 'live in the' 3-gram appears more times than any other 'live in…' 3-gram in the training data, then this will have the highest probality. If there are no such occurrences in the data, it will 'back off' to 2-grams that start with 'in'.