Coursera Data Science Capstone Presentation

Author: Jeremiah Lowhorn

Date: 1/13/2016

Synopsis

This slide show examines the methodologies I used to complete the Coursera Data Science Capstone Project on Coursera. We were tasked with creating a Shiny application that took user text input and predicted the next word in a sentence. We were provided with large text files that were required to analyze and use natural language processing to construct an algorithm to predict words. Below are you can find more information.

Methodoligies for My Algorithm

  • The algorithm I created begins by parsing out the sentence that is input by the user into ngrams
  • Ngrams are groups of words that can be used to predict the next word in a sentence
  • The algorithm begins by examining the final four, three, two, and unigrams in the provided sentence
  • Data tables exist in the shiny environment that contain five, four, three, and twograms with the last word in the ngram in a separate column that the preceding words.
  • The ngram from the sentence is then used to find the highest frequency ngram in the tables.

Methodologies for My Algorithm

  • If the ngram does not exist in the fivegram table the model will work its way down until a match is found in the smallest table. Example of a Fivegram table to the right
  • Example: If you were to type “at the end of” the algorithm would output the word “the”

alt text

Methodologies for My Algorithm

  • If no match is found in any of the tables the word “and” will be used
  • If there are no words in the user sentence the word “the” will be used
  • If the preceeding word in the sentence is a stop word, all stop words in the final column of the data table will be subset out of the tables so that two stop words cannot be predicted consecutively
  • During preprocessing all profane words are removed from the text document

The Application

  • The application takes user input located under the “Next Word Predictor Input” title and predicts a word and a sentence in the boxes below
  • The initial load up may take a few seconds but once it has loaded it should work very quickly unless a stop word is used. Example of the application to the right

alt text