3/22/2021
This application demonstrates the word prediction algorithm developed for the Data Science Capstone. The application can be accessed at https://ydavydenko.shinyapps.io/DataScienceCapstone/
To use the app:
App development process: 1) Cleaning the corpus of textual data and creating n-grams for n = 1 to 6 2) Creating frequency tables for the n-grams 3) Developing a prediction algorithm that relies on the conditional probability rules 4) Developing and deploying the app
The text corpus was created by merging blogs, news, and twitter datasets, removing all non-ASCII symbols, replacing contractions , and converting a random sample of 50 percent of the data in to a corpus.
Next, the data corpus was tokenized and further cleaned. The cleaning procedures included removing the punctuation, symbols, numbers, URLs, separators, hyphens, and space-padding, and converting the tokens to the lower case. Finally, a collection of stopwords that include articles and profanity were removed from the corpus. The quanteda package was chosen for its high-performance processing of large amounts of data and convenience.
The next step was to convert the tokenized data into n-grams for n = 1 to 6 and computing frequencies for each n-gram dataset. The n-grams with frequencies of one or two were removed.
The prediction algorithm is based on the probabilistic language modeling that relies on conditional probability where:
P(Wn|W1…Wn-1) = P(W1…Wn)/P(W1…Wn-1) OR P(lunch|eat) = P(eat lunch)/P(eat)
Technically, the algorithm solves the problem of finding an n-gram W1…Wn with the highest probability given an n-gram W1…Wn-1 and returning Wn.
The algorithm solves the problem in the following steps: (1) it transforms the user input (W1…Wn-1) into an n-gram and determines its order (n-1) (2) it finds all matching n-grams of a higher order and chooses the one with the highest probabiliy (frequency) (3) if the n-gram of a higher order is not found, it drops the first word from the user-provided n-gram and repeats the search using a shorter n-grams (4) it returns the Wn from the first found highest probability n-gram
Due to the memory limitations, the shiny app relies on 50 percent of the data from the original text data.
The n-gram frequency data files are uploaded to the Shiny server as a part of the application.
For the purposes of this project, the app uses a simple and clean interface that reruns the algorithm and returns the results as the user updates the input data. This is accomplished by using observeEvent() and updateTextInput() functions.
To make the user input consistent with the n-gram datasets, it is cleaned using the same series of procedures used for the original corpus of text data.