Capstone project

Mikael Herve
11 Nov. 2020

Background

The capstone project encompasses all aspect of the material learned during this data science program.
For this last project, we were tasked with developing an algorithm which would best predict what the next word should be when presented with a sentence or a word.
My project was restricted to US-English language but the same can be done for other languages, with a range of additional difficulties.

How the product works

Train data based on a mixture of twitter feeds, news articles and blogs, and only a subset of each is sampled via a binomial function
Data are normalized from punctuation, numbers, obscenity, etc and merged into a final text, then stored into groups of 2 to 5 words in a database
Algorithm attempts to find, starting with the last 4 words in the sentence, which group of 5 words best start with the target, based on probability. If a match is found, the 5th word in the database group will be returned for prediction. If not, same process falls back to last 3 words in the sentence, if not 2 words, if not 1 word.

Shinyapps.io Application

The shiny application can be found at https://mrcherve.shinyapps.io/Capstone_final/

User is expected to enter a sentence or a word
User is provided with a predicted word, together the time it took to recover that word

Performance / Accuracy trade-off

The key trade off of this application is performance vs. accuracy. Performance is seen when 1) loading the database to shiny app, 2) creating the database with a sampling of input data and 3) retrieving word based on input sentence.

We opted to construct the reference database with 2 to 5-gram, thereby only attempting to match the last 4 word in the sentence, at best.
We also opted to feed the data base with a 25% sampling initial twitter/news/blogs and only kept n-gram frequencies above 1 in the database. This is to maximize recovery speed, with the understanding some words may not be accurately found.