DS Capstone Project Presentation

AF
05/02/2018

Introduction: "So you're saying" app

For more details on authoring R presentations please visit https://support.rstudio.com/hc/en-us/articles/200486468.

“So you're saying”“ is an app that predicts the next word given a phrase inputed by the user
The prediction makes use of a N-gram database, that is a database of is a contiguous sequence of n items from a given sequence of text or speech
It contains contains 3 lists: Tri-gram, Bi-gram and Uni-gram
The background algorithm is based on a text set extracted from news, blogs and Twitter, containing more than 3 million posts

The training set

Trainning corpus is based on random sampling of 10% of the original text set. From the training corpus, circa 300k of Uni-gram, around 3 millions of Bi-gram and about 7 millions of Tri-gram were created

How it works...

The app uses a simple Back-off algorithm, starting from the Tri-gram list. It matches the last 2 words from the input text with the first 2 words in the Tri-gram list. When a match is found, the 3rd word is returned
If still less than 5 matches are found throughout the Tri-gram, Bi-gram lists, the Uni-gram list is used in this order

Conclusion

The corpus has a large range of words and word frequencies with a number of important words. The sample size was large enough to satisfy two laws from linguistics concerning large corpora. In this way, the analysis ran rather quickly on a personal computer and on shinyapps.io. As shown in the running app, all phrases and words submitted to the Katz Backoff model ended up in a prediction in the form of a single word returned.