Data Science Capstone Project

Venkat Sri (vesr)
Feb 2018


Predicting Next Words(s) Project

Peer Graded Assignment

Summary

Scope and Objective of this Project Coursera Data Science Specialization.

The objective of the application is to implement model that prpmpts hint (next set of words), related to the pharse/text that's been entered by the user. The input for this program consists of three datasets twitter, news and blogs from HC Corpora. Data has been cleaned and a subset is used as sample data in R data frames. Back-off alogorithm is used complementing with NLP techniques to crete n-grams. The UI layer has been developed with Shiny package with additional libraries (such as a DT, javascript, HTML Render) to enhance the user experience.

How the app works

Just type a word, phrase or sentence. The app shows what the user has entered, followed by cleansed form. As the main result, until the top five (more probable) n-grams predictions are displayed in a list control. The user can review or swap your input data, and the app will turn back to present more hints to predict. Another tab offers a more extensive documentation. Test

Main steps to achieve next word(s) predictions:

  1. Loading 4 data frames contained n-grams combinations with 4-words, 3-words, 2-words, and 1-word previously generated.
  2. Reading user input (a word or sentence)
  3. Cleansing of user input (lowering, profanities removing, tokenization of input words: the last four)
  4. Call to prediction model function, basically, the Stupid backoff algorithm (a more simplified approach to Katz Backoff):
    • search in the fourgram data frame, if found, shows top 5 most probable matches. Otherwise;
    •    search in the trigram data frame, by the same way above. Otherwise;
    •       search in bigram data frame, by the same way above.
    •          else, at last, if none matching, displays the most frequent words in the unigram data frame.

N-grams excerpts

See 5 lines of “bigrams” and “trigrams” data frames which are loaded by Shiny App.

Error in gzfile(file, "rb") : cannot open the connection