Data Science Capstone Project

"Azlena Haron"
"April 24, 2016"

INTRODUCTION :

Executive Summary

The Capstone Project for the Coursera Data Science provided an auto-complete predictive text model that will come up with a list of words or phrases that are most likely to follow given input string. The dataset used is Coursera-SwiftKey and based on English Database that containts en_US.blogs.txt,en_US.news.txt and en_US.twitter.txt

Objective

i) To developed a Shiny App that takes as input a phrase (multiple words) in a text box input and outputs a prediction of the next word; and ii) To explain the Shiny Apps not more than 5 slides created with R Studio Presenter which can pitching the algorithm and app.

WHY SWIFTKEY?

SwiftKey is one of the most popular smartphone keyboard apps available for both Android and iOS devices
SwiftKey has been installed in more than 300 million devices
SwiftKey estimates that its users have saved nearly 10 trillion keystrokes, across 100 languages, saving more than 100,000 years in combined typing time

THE ALGORITHM PROCESS:

create a data sample from the Corpus and cleaned the sample data by conversion to lowercase, removing punctuation, links, white space, numbers and all kinds of special characters. This data sample was then tokenized into so-called n-grams.
In the fields of computational linguistics and probability, an n-gram is a contiguous sequence of n items from a given sequence of text or speech.
Those aggregated bi-,tri- and quadgram term frequency matrices have been transferred into frequency dictionaries.
The Good-Turing probability has been used to smooth the output for top most likely words for completion. The resulting data.frames are used to predict the next word in connection with the text input by a user

HOW THE APP WORKS:

plot of chunk unnamed-chunk-1

This Apps allows you to enter a custom word or phrase. Once you click “Submit”, the app displays your selected input before and after processing. The apps shown as https://ana68.shinyapps.io/final_projek/

Note : This Shiny App Prototype only run based on sampel data (0.01%) because contraints of RAM