Data Science Specialization
Johns Hopkins University and Coursera
Di Yang
April 21, 2016
Goal The Capstone is the final project in the series for the Data Science Specialization offered by Johns Hopkins University through Coursera. In partnership with Swiftkey, the main goal of the Capstone is to predict the next word based on a culmination of all classes in the specialization.
Deliverables The main deliverable of the Capstone is the application, built using Shiny. The secondary deliverable is this presentation, built using R Presentations.
No need to click a button or a link to get results. All the user needs to do is type and hit the space bar! The user can choose to use one of the three results or any word of their choice.
The application uses a combination of decision trees and trigram markov chains to predict the next word based on probability. On a scale of 1 to 10, the accuracy of the application is about a 7, mainly because it does an adequate job predicting the next word based on probability, but not so great based on semantics or popular terms.
For example, if the user typed in “Winnie”, the top three results for the next predicted words are:
None of these results are what a typical user would expect (THE). So if the user types in THE, the top three results are automatically updated:
Interestingly, if the user forgoes the results and types in POOH, the top three results contain the next human-expected word:
At this point, it might just be luck, because if the user types in AND, the top three results are:
More more n-grams Since a trigram markov chain was used in the first model, a good possibility would be to use four n-grams or more.
Store user input and build it into the model Since the human brain is infinitely more intelligent than a rudimentary application like this one, it would be interesting to store user input. To take it a step further, using the user input as part of the model would help improve results, hopefully drastically.