Capstone project

Mikael Herve
11 Nov. 2020

Background

  • The capstone project encompasses all aspect of the material learned during this data science program.
  • For this last project, we were tasked with developing an algorithm which would best predict what the next word should be when presented with a sentence or a word.
  • My project was restricted to US-English language but the same can be done for other languages, with a range of additional difficulties.

How the product works

  • Train data based on a mixture of twitter feeds, news articles and blogs, and only a subset of each is sampled via a binomial function
  • Data are normalized from punctuation, numbers, obscenity, etc and merged into a final text, then stored into groups of 2 to 5 words in a database
  • Algorithm attempts to find, starting with the last 4 words in the sentence, which group of 5 words best start with the target, based on probability. If a match is found, the 5th word in the database group will be returned for prediction. If not, same process falls back to last 3 words in the sentence, if not 2 words, if not 1 word.

Shinyapps.io Application

The shiny application can be found at https://mrcherve.shinyapps.io/Capstone_final/

  • User is expected to enter a sentence or a word
  • User is provided with a predicted word, together the time it took to recover that word

Performance / Accuracy trade-off

The key trade off of this application is performance vs. accuracy. Performance is seen when 1) loading the database to shiny app, 2) creating the database with a sampling of input data and 3) retrieving word based on input sentence.

  • We opted to construct the reference database with 2 to 5-gram, thereby only attempting to match the last 4 word in the sentence, at best.
  • We also opted to feed the data base with a 25% sampling initial twitter/news/blogs and only kept n-gram frequencies above 1 in the database. This is to maximize recovery speed, with the understanding some words may not be accurately found.