Capstone Summary

Data Science Specialization

Johns Hopkins University and Coursera

Di Yang

April 21, 2016

Background

Goal The Capstone is the final project in the series for the Data Science Specialization offered by Johns Hopkins University through Coursera. In partnership with Swiftkey, the main goal of the Capstone is to predict the next word based on a culmination of all classes in the specialization.

Deliverables The main deliverable of the Capstone is the application, built using Shiny. The secondary deliverable is this presentation, built using R Presentations.

How to use the Shiny app

  1. Open the Shiny app at http://diyaaang.shinyapps.io/top3nlp
  2. Enter a word, phrase, or sentence
  3. The application spits out the top three results for the next predicted word
  4. Type one of the three results in the textbox or another word of your choice

No need to click a button or a link to get results. All the user needs to do is type and hit the space bar! The user can choose to use one of the three results or any word of their choice.

How it works

The application uses a combination of decision trees and trigram markov chains to predict the next word based on probability. On a scale of 1 to 10, the accuracy of the application is about a 7, mainly because it does an adequate job predicting the next word based on probability, but not so great based on semantics or popular terms.

For example, if the user typed in “Winnie”, the top three results for the next predicted words are:

  • TIME
  • ROUND
  • AND

None of these results are what a typical user would expect (THE). So if the user types in THE, the top three results are automatically updated:

  • FIRST
  • SAME
  • BEST

Interestingly, if the user forgoes the results and types in POOH, the top three results contain the next human-expected word:

  • TIME
  • ROUND
  • AND

At this point, it might just be luck, because if the user types in AND, the top three results are:

  • THE
  • THEN
  • YOU

Ideas for improvement

More more n-grams Since a trigram markov chain was used in the first model, a good possibility would be to use four n-grams or more.

Store user input and build it into the model Since the human brain is infinitely more intelligent than a rudimentary application like this one, it would be interesting to store user input. To take it a step further, using the user input as part of the model would help improve results, hopefully drastically.