Datascience_Capstone

Amit Shinde
December 2016


Overview

  • The purpose of this Capstone project was to create a Shiny app which takes an input text or phrase and predicts the next word.
  • Predictive text model is based on HC Corpora corpus which can be downloaded from this website.
  • The purpose of Natural Language Processing (NLP) is to predict the next word based on an input text or a phrase. For e.g this process can be seen in action by typing on phone keyboard or searching on the web.
  • Prediction Model

  • Reads data from blogs, twitter and news feed datasets.
  • Cleans it by removing special characters, whitespaces, numbers, punctuations, stopwords, change to lowercase, etc.
  • Generates unigram,bigram,trigram and quadgram files using the frequency of word/s.
  • Creates small datasets to speed up the performance.
  • How Does This App Work?

  • It is as simple as typing text or phrase and then clicking on the “Submit” button.
  • Input data will be cleansed.
  • Algorithm will be executed to predict the next word based on quadgram, trigram, bigram and unigram dataframes in order.
  • Execution time will be caluculated in milliseconds.
  • Next word will be listed in choices based on it's prediction score. Max 5 words or phrases will be shown in list.
  • Conclusion

    The course was long but totally worth it; the credit goes to Roger D. Peng, Brian Caffo and Jeff Leek from JNU