November 27, 2019

Coursera Data Science Capstone - Final Project Submission

  • This is the final, peer-graded project for the Data Science Specialization at Coursera.
  • The primary assignment is developing a Shiny app that takes as input a phrase (multiple words) in a text box input and outputs a prediction of the next word.
  • This capability can be applied to several use cases such as simplifying texting on a smartphone, assisting disabled users, or development of an AI chat-bot to reduce customer support costs.
  • The training data set included text from news websites, twitter feeds, and blog posts. Using R-based text mining and natural language, an algorithm was created to predict the next word of inputed text.

Learning Algorithm

  • We used the N-Gram function to develop our algorithm. N-Grams are continguous, sub-sequenced of length n of a given sequence.
  • The N-Gram function takes in a sequence (vector), text in this case as input.
  • The N-Gram function returns a positive integer giving the length of contiguous sub-sequences to be computed.
  • For example, 2-grams for the sentence “The cow jumps over the moon” are: “the cow”, “cow jumps”, “jumps over”, “over the”, “the moon”.
  • The N-Grams models were cleansed and tabulated using text from news articles, twitter posts, and blogs.
  • The resulting data set (corpus) is comprised of a 1-grams, 2-grams, 3-grams, through 6-grams.

Prediction Model

  • We used Katz’s back-off model as our next-word prediction model.
  • This model first searches the 6-grams in the corpus for a prediction, then “backs-off” to the 5-grams if the first search is unsuccessful.
  • The process continues backwards to the 4-grams, 3-grams, and 2-grams.
  • If the 2-gram search is unsuccessful, then the most frequent 1-grams in the corpus are output as the predicted word.

App Usability

  • The screenshot below illustrates the Shiny next-word prediction application.
  • Use of the app is intuitive. Simply type in or copy text into the input box and the predicted text will display.

References