Data Science Capstone: NLP

3/22/2021

Overview of the App

This application demonstrates the word prediction algorithm developed for the Data Science Capstone. The application can be accessed at https://ydavydenko.shinyapps.io/DataScienceCapstone/

To use the app:

  • Open the URL above
  • Type a sequence of words in the left field under “Text Input”
  • The predicted words will appear in the field “Predicted Word” automatically

App development process: 1) Cleaning the corpus of textual data and creating n-grams for n = 1 to 6 2) Creating frequency tables for the n-grams 3) Developing a prediction algorithm that relies on the conditional probability rules 4) Developing and deploying the app

Steps 1-2: Cleaning the Corpus

The text corpus was created by merging blogs, news, and twitter datasets, removing all non-ASCII symbols, replacing contractions , and converting a random sample of 50 percent of the data in to a corpus.

Next, the data corpus was tokenized and further cleaned. The cleaning procedures included removing the punctuation, symbols, numbers, URLs, separators, hyphens, and space-padding, and converting the tokens to the lower case. Finally, a collection of stopwords that include articles and profanity were removed from the corpus. The quanteda package was chosen for its high-performance processing of large amounts of data and convenience.

The next step was to convert the tokenized data into n-grams for n = 1 to 6 and computing frequencies for each n-gram dataset. The n-grams with frequencies of one or two were removed.

Step 3: Prediction algorithm

The prediction algorithm is based on the probabilistic language modeling that relies on conditional probability where:

P(Wn|W1…Wn-1) = P(W1…Wn)/P(W1…Wn-1) OR P(lunch|eat) = P(eat lunch)/P(eat)

Technically, the algorithm solves the problem of finding an n-gram W1…Wn with the highest probability given an n-gram W1…Wn-1 and returning Wn.

The algorithm solves the problem in the following steps: (1) it transforms the user input (W1…Wn-1) into an n-gram and determines its order (n-1) (2) it finds all matching n-grams of a higher order and chooses the one with the highest probabiliy (frequency) (3) if the n-gram of a higher order is not found, it drops the first word from the user-provided n-gram and repeats the search using a shorter n-grams (4) it returns the Wn from the first found highest probability n-gram

Step 4: Shiny App

Due to the memory limitations, the shiny app relies on 50 percent of the data from the original text data.

The n-gram frequency data files are uploaded to the Shiny server as a part of the application.

For the purposes of this project, the app uses a simple and clean interface that reruns the algorithm and returns the results as the user updates the input data. This is accomplished by using observeEvent() and updateTextInput() functions.

To make the user input consistent with the n-gram datasets, it is cleaned using the same series of procedures used for the original corpus of text data.

Thank you for your time and effort!