HengTai Ann
09-Sep-2017
This slide is consisted in the motivation, methodology and manual to use prediction app by Shiny. It was developed as part of the data science specialisation. Purpose of that app is to predict the next word based on one or more previous words.
The task was to analyse and use preexisting corpora to build an app in Shiny. The three given corpora where taken from Blogs, News and Twitter. (Source link:https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip)
After data cleansing of special characters such as $!* and etc. and the corpora were used to create repositories of n-grams. Through 4 difference of n-gram, having enough distinctive data: 1-gram 2-gram 3-gram 4-gram
Due to the enormous size of the result tables all n-grams which occurred 10 times were discarded. This ensured a sensible and agile compromise between accurracy, runtime and memory usage respectively.
Libraries
Algorithm
Repository
The app is hosted here. After a while for loading it shows following GUI:

Algorithm : The a simple Katz's Back-off is a very simple algorithm to predict the next word. For further enhancement different other more sophisticated algorithms should be applied.
Memory : Throughout that task a lot of struggles were due to the high amout of memory used for the tables. Therefore it might be sensible to think about furhter pruning or hashing of word sequences to save space.
Thank you for your attention!