Word Predictor Capstone Project

James C. Birk
31 MAY 2020

Coursera Data Science Capstone

  • Culminating project to test multiple techniques learned in the Data Science discipline
  • Partnership between Coursera and SwiftKey
  • Can it predict your next word?

The app is found here: https://jamescbirk.shinyapps.io/NgramProcess/

Background

Swiftkey provided a HC Corpora of text comprised of twitter, blogs, and news articles. The corpora were loaded into memory and combined, as well as “cleaned” with several techniques, removing punctuation, capital letters, symbols, and curse words.

The corpus was then tokenized into ngrams. Use of ngrams is a common practive in the field of Natural Language Processing (NLP).

This corpus was tokenized into unigram, bigram, trigram, and quadgram frequency matrices. A predictive backoff model was developed based on those term frequencies.

Shiny App

Besides testing our application of data science techniques through modeling, the capstone also tested our ability to use R to create a viable Shiny App which an everyday user could easily access.

Here is a screenshot of my app: plot of chunk unnamed-chunk-1

Further Work

The App, while easy to use, is not as fast as it could be.

Increasing the sampling of the original corpus while eliminating uncommon words could lead to greater accuracy and speed.

Additionally, I recognize the hardware limitations of an at-home laptop with 16GB of RAM vs. a more robust machine with greater memory capacity.

Furthermore, developments in AI and Natural Language Processing are changing on a near-daily basis. The underlying backoff model used in my App should be updated on a regular basis to reflect improvements in widely-used algorithms.