Darwin Lajoie
01/09/2018
The goal of this exercise is to create a product to highlight the prediction algorithm that you have built and to provide an interface that can be accessed by others. For this project you must submit:
A Shiny app that takes as input a phrase (multiple words) in a text box input and outputs a prediction of the next word. A slide deck consisting of no more than 5 slides created with R Studio Presenter (https://support.rstudio.com/hc/en-us/articles/200486468-Authoring-R-Presentations) pitching your algorithm and app as if you were presenting to your boss or an investor.
Step 1: Download the dataset and unzip folder.Check if directory already exists?Lets make a file connection of the twitter data set/blog data set/ news data set.
Step 2: Explore Data Set.
Get words in files/ Get file sizes /Summary of the data sets.
Step 3: Clean and Sample: Preprocessing.
The sample text was "tokenized" into so-called n-grams to construct the predictive models (Tokenization is the process of breaking a stream of text up into words, phrases. N-gram is a contiguous sequence of n items from a given sequence of text).
Step 4: Get the frequencies of the word / Prepare n-gram frequencies / Get frequencies of most common n-grams in data sample.
The n-grams files or data.frames (unigram, bigram, trigram and quadgram) are matrices with frequencies of words, used into the algorithm to predict the next word based on the text entered by the user.
Link to a Shiny app with a text input box that is running on shinyapps.io: