Text Prediction Using N-Grams and Katz Back-off

C Innes
June 2020

Objective

The objective of the Data Science Capstone Project in the Data Science Specialisation provided by John's Hopkins University on Coursera is as follows:

To create a shiny application which will predict the next word given a string of text as an input.

On the following slides I will detail the steps used to get to the final application as well as an outline of how to use the finished product.

The App

For a functioning version of the app, please visit https://caffles90.shinyapps.io/SwiftkeyPrediction/

The above Shiny web app uses a combination of n-gram tokenisation and katz-back off Model in order to produce a prediction of the next word, given a string of length n.

The model will take the last 3 words of a given string, and identify the most probable next word. If the string is shorter than 3 words, or the string is not existent in the train data set, the model will loop through to the last 2 words, and last word. If there is no result found for the given word, it will return the most popular single word.

Data Exploration

Prior to the production of this app, a deep dive of the data was completed with frequency plots outlining the most common 1-word, 2-word, and 3-word phrases (n-grams). The documetation for this earlier stage can be found here https://rpubs.com/Caffles90/DSC_Milestone

Once this was completed the data sets were transformed into Data Frequency Matrices (DFM), and the resulting data frames were used to predict the next word based on the frequency of the given phrase in the DFM.

Instructions for Usage

The application is fairly straight-forward to use as it is assumed this will be primarily used on mobile devices.

At the top left, there is a box which is captioned “Type here:”, you would simply over-type the existing text in the text box with the string that you would like the next word predicted for.

The word which has the highest probability score will be displayed on the right hand side of the app.

Please note that if you have input non-English words the app will return an error.