Data Science Capstone Final Project

Spiro Kolokithas
20th September 2019

The Task at Hand

The presentation outlines the approach undertaken to build a Shinny App predicting the next word for the Data Science Specialization Capstone on Coursera

The application used a Corpus (3 documents) containing English text from Blogs , News articles and Tweets

A predictive algorithm was developed to predict the next word as you type in the Shinny App

Methodology and Pre processing

The entire Corpus (news, blog and text) was in excess of 350mb so a sample was taken given the size and memory required to process the entire Corpus

Subsequently the following prepossessing was completed:

- converted words to lower case
- punctuation and stop words were removed
- Word stemming 
- White space was removed

The end result of was a collection of n grams bigrams (2 words), trigrams (3 words) and quadgrams (4 words) being developed.

The Application

The application is simple and intuitive. The user enters text and as text is entered the next word is predicted dynamically.

Text prediction occurs using an N Gram back off approach. In simple terms it searches for the next word using using the quadgram developed and if it cannot be found it iterates through trig rams and ultimately bi grams.

plot of chunk unnamed-chunk-1

The Application

The application built can be found at:

https://spiro-kolokithas.shinyapps.io/WordPredictor/

The git hub repository can be found at

https://github.com/koloks/Data-Science-Capstone

Thank you for reviewing my work.