Data Science Capstone: Ngrams Text Prediction

Bowen Zhang
June 30, 2020

Introduction

This is the Capstone project for the Data Science Specialization Track offered by John Hopkins University. This project involves building a predictive text app that uses predictive analytics to suggest the next word a user inputs. The data for this project was provided by SwiftKey, a leading text suggestion application for mobile phones. Download

Capstone Deliverables:

  • Create an web application with Shiny App in R
  • Application must allow users to input phrases and obtain predictions for the next word
  • Develop a prediction algorithm that can take text and outputs a single word, or top words

Model and Algorithm

The model used was a simple N-gram back-off model. The ngrams were created using a 5% random sample of the HC Copora dataset from SwiftKey.

The data was cleaned and tokenized into 4-grams (unigrams, bigrams, trigrams, fourgrams). Each of these datasets were transformed into data frames with each column divided into single words and the frequency of the combination of those grams.

The input could then be taken and searched by each word in each of the ngrams. Using back-off, we would first limit the input to 3 words by taking the tail of the phrase, and then based on length we would search the n+1 gram data for any matches. If there were no matches, we would then search again by taking the tail minus 1 and searching the n+1 gram for the shortened input.

This was done with the quanteda package in R.

Shiny App

The Shiny App: (http://bzhang93.shinyapps.io/Ngrams-Text-Predictor/)

Breakdown:

  1. Input your phrase into the input textbox and hit the “Predict” button.
  2. The input is then cleaned, tokenized, and truncated (max of 3 words), and returned as a vector of words
  3. The predict function is called on the clean input
  4. Based on the length of the input, the n+1 gram is called. EG: fourgrams for max input
  5. If no predictions found, truncate input by shaving off the first word, then the next ngram is called. (Eg: if fourgrams returned NA, search trigrams)
  6. If no predictions in trigrams, then truncate input to last word and try bigrams.
  7. If no predictions are found in bigrams, then return no matches found.