1. Coursera Data Science Capstone: Word Prediction

David Larue, david_m_larue@compuserve.com
August 16th, 2019

2. Introduction

3. Algorithm

  • The developed code uses the Backoff algorithm.
  • Uses n-grams from the training set to product sample input (up to 3 words) and possible predictions.
  • Correlates (input plus single predictions) by frequency, keeping top 5.
  • Given an input, it is tokenized, cleaned, searched for among the ngrams, and the top choice returned.
  • The retrieval is merely by specifying the input phrase, given as a list of words, as a key to the (large) dictionary.

4. Implementation

  • Goal was a deployment on shinyapps.io
    • So outer framework is R in a Shiny App environment.
  • Inner implementation was in Python.
    • During the month between course 9 and the capstone, Python was studied.
    • Main data structures were dictionaries, which proved efficient for lookups in large tables.
    • Python package NLTK (Natural Language Tool Kit) was used for tokenization.
    • R package Reticulate was used to deploy Python code inside an R session.
  • Sample Code.
    • Inserts an additional tuple of a word and a frequency into a list of predictions
    def ins(y,v,ng,mp):
        ind=v[0][0:ng]
        if (ind in y and len(y[ind])<mp and v[1]>=3):
            y[ind]=y[ind]+((v[0][ng],v[1]),)

5. Evaluation

  • User Interface.
    • Minimal but functional.
    • 5 seconds to initial display; another 10 until ready; subsecond response to inputs.
  • Command Line Interface.
    • Able to generate, save database of input/predict ngrams from text file.
    • Able to run prediction algorithm with database against phrases and against large input files.
  • Summary of full run against trial dataset is below.
Database Kept Only frequencies this large 3
All 3 word ngrams from 4/5 of sample text 78,922,609
Number of correct predictions on first try 12,417,704
Percentage of first try correct predictions 15.7%
Number of correct predictions with five tries 22611598
Percentage of correct predictions in first five 28.%