Data Science Capstone

Barbara Mastrandrea
April 23rd, 2020

Data Preparation (Get & Clean Data)

The goal of this exercise is to create a Shiny app that accepts as input a sentence (multiple words) in a text box and generates a prediction of the next word.

  • A subset of the data in the three files (blogs,twitter and news) given for the exercise was sampled and then merged into one.
  • The sample is clean is done by conversion to lowercase, strip white space, and removing punctuation and numbers.
  • The corresponding n-grams are then created (Pentagram, Quadgram,Trigram and Bigram).
  • Subsequently, all the n-grammes thus generated are saved as a csv file.

Algorithm

The algorithm used is quite simple

  • The program evaluates each n-gramm in search of the word (or words entered) and saves it in a vector.
  • The vector thus generated is processed to extract the best match.
  • The original sentence and the predicted word appear in the text box
  • Below the text box is generated a correlation plot of all the n-graham related to the searched word (or phrase).

Expected outcome

It should be noted that the application created here is a small prototype, based on data, which although quite substantial, are however limited. It is therefore possible that you will not find any match with what has been entered. Also, in the case of very common words or phrases, the correlation plot can be somewhat crowded.

The application can be found here.