"Azlena Haron"
"April 24, 2016"
Executive Summary
The Capstone Project for the Coursera Data Science provided an auto-complete predictive text model that will come up with a list of words or phrases that are most likely to follow given input string. The dataset used is Coursera-SwiftKey and based on English Database that containts en_US.blogs.txt,en_US.news.txt and en_US.twitter.txt
Objective
i) To developed a Shiny App that takes as input a phrase (multiple words) in a text box input and outputs a prediction of the next word; and ii) To explain the Shiny Apps not more than 5 slides created with R Studio Presenter which can pitching the algorithm and app.
create a data sample from the Corpus and cleaned the sample data by conversion to lowercase, removing punctuation, links, white space, numbers and all kinds of special characters. This data sample was then tokenized into so-called n-grams.
In the fields of computational linguistics and probability, an n-gram is a contiguous sequence of n items from a given sequence of text or speech.
Those aggregated bi-,tri- and quadgram term frequency matrices have been transferred into frequency dictionaries.
The Good-Turing probability has been used to smooth the output for top most likely words for completion. The resulting data.frames are used to predict the next word in connection with the text input by a user
This Apps allows you to enter a custom word or phrase. Once you click “Submit”, the app displays your selected input before and after processing. The apps shown as https://ana68.shinyapps.io/final_projek/
Note : This Shiny App Prototype only run based on sampel data (0.01%) because contraints of RAM