Data Science Capstone Project - Shiny App: Next Word Input Prediction

Igor Hut
March 26, 2017

This is a pitch deck for Shiny application that was developed as the capstone project for Data Science Specialization. This specialization was created and conducted by professors from Johns Hopkins University with the help of industry partners SwiftKey and Yelp, and realized through Coursera MOOC platform.

Overview

  • The idea behind the capstone project is to develop a fully functional Shiny application which enables basic text prediction for English language
  • The data used for algorithm development is actually a subset of a corpus called HC Corpora
  • The integral version of the data set which was used in development of this application can be found HERE. It contains data in four languages : English, German, Russian and Finish. There are three corpora per language which contain data generated by twitter, blogs and news feeds. Only English language corpora was used
  • Next word prediction is based on n-gram frequencies

The App

  • You can check how the app works here

  • The application GUI and usage are rather simple, just enter a word, sentence, or a phrase into the input text-box on the right, and you'll get prediction for a word which should follow your train of thoughts

  • You'll obtain better results by inputting more than one word

GUI and How to Use It

App GUI

Technical Details

  • The algorithm behind the app is based on n-gram modeling
  • The given corpora were initially adequately sampled and cleaned (punctuation removal, lowercasing, white space stripping, removal of numbers, removal of URLs, profanity filtering…)
  • The final corpus was tokenized into n-grams
  • 2-, 3- and 4-gram term frequency matrices were aggregated into frequency dictionaries which are used to predict the next word based on the user text input and the corresponding n-gram frequencies