Pao Ying Heng
January 27, 2020
The objective of the Coursera Data Science Specialization Capstone project was to build a predictive text model, and then incorporate it into a Shiny app. The app should be able to take as input a phrase in a text box input, and outputs prediction(s) for the next word.
This slide deck will briefly describe the algorithm used for the text model and how to operate the Shiny app.
The dataset used for this project may be obtained from here: https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip. For this project, only the three documents found in the en_US folder were utilised.
The prediction algorithm of this app utilises the N-gram model. After sampling a portion of each text, a corpus was created, and the text was tokenised and cleaned (including profanity filter). A unigram, bigram and trigram were then computed.
The frequency of each n-gram was aggregated and tabulated into a data table.The resulting data table was then used to predict the next word based on the input entered by a user in the Shiny app.
To account for phrases that the prediction model may not have been exposed to during model training, possible NA's are replaced with the most common unigrams ('the', 'to', 'and').
Click here to use the Next Word Prediction Shiny app
How to use the app:
Start typing your text in the text input box on the left side of the page (the box located under 'Type your sentence here:'). Wait briefly for the predictions of the next word under the 'Output' tab. You will be given three (3) single next word predictions. To reset, simply erase the typed text and re-type your new text in the input box. Have fun!