James C. Birk
31 MAY 2020
The app is found here: https://jamescbirk.shinyapps.io/NgramProcess/
Swiftkey provided a HC Corpora of text comprised of twitter, blogs, and news articles. The corpora were loaded into memory and combined, as well as “cleaned” with several techniques, removing punctuation, capital letters, symbols, and curse words.
The corpus was then tokenized into ngrams. Use of ngrams is a common practive in the field of Natural Language Processing (NLP).
This corpus was tokenized into unigram, bigram, trigram, and quadgram frequency matrices. A predictive backoff model was developed based on those term frequencies.
Besides testing our application of data science techniques through modeling, the capstone also tested our ability to use R to create a viable Shiny App which an everyday user could easily access.
Here is a screenshot of my app:
The App, while easy to use, is not as fast as it could be.
Increasing the sampling of the original corpus while eliminating uncommon words could lead to greater accuracy and speed.
Additionally, I recognize the hardware limitations of an at-home laptop with 16GB of RAM vs. a more robust machine with greater memory capacity.
Furthermore, developments in AI and Natural Language Processing are changing on a near-daily basis. The underlying backoff model used in my App should be updated on a regular basis to reflect improvements in widely-used algorithms.