Data Science Capstone Project

author: Ramirez date: February 5, 2023 autosize: true

Capstone Project Overview

Please visit the following website for an App demo, go here, and try it risk free!

  • This project involves Natural Language Processing. The critical task is to take a user’s input phrase (group of words) and to output a predicted next word.
  • The App predicts a sequence of works as the user types a sentence.
  • This app is similar to the way most smart phone keyboards are implemented today using Swiftkey technology

Project deliverables:

  • Next Word Prediction Model, as basis for an app
  • Next Word Prediction App hosted at shinyapps.io
  • This presentation hosted at R pubs

Retrieving & Cleaning the Data

  • A subset of the original data was sampled from three sources (blogs,twitter and news), which is then merged into one.
  • Next, data cleaning is done by transforming to lowercase letters, stripping white space, and removing punctuation and numbers.
  • The corresponding n-grams are then created (i.e., Bigram, Trigram, Quadgram, and Quintgram).
  • Next, the term-count tables are extracted from the N-Grams and sorted according to the frequency, in descending order.
  • Lastly, the n-gram objects are saved as R-Compressed files (.RData files).

Underlying Algorithm (Next Word Prediction App)

The next word prediction app provides a simple user interface to the next word prediction model.

Key Features:

  1. A simple text box for user input
  2. One sees a predicted next word “output” dynamically, right below user input
  3. The Tabs with the plots of the most frequent n grams in the data-set
  4. Side panel with user instructions

Key Benefits:

  1. Rapid response time.
  2. Method allows for large training sets leading to better next word predictions

Shiny App Link

Further Exploration

  • Further work can expand the main weakness of this approach: long-range context
    1. Current algorithm discards contextual information beyond 5-grams
    2. We can incorporate this into future work through clustering underlying training corpus/data and predicting what cluster the entire sentence would fall into.
    3. This allows us to predict using ONLY the data subset that fits the long-range context of the sentence, while still preserving the performance characteristics of an n-gram and the structure of the prediction model.
    4. To ensure the proprietary nature of the app and algorithm, the R code is available upon request.

References

Tidy Data
“https://cran.r-project.org/web/packages/tidyr/vignettes/tidy-data.html”

Text Mining with R: A Tidy Approach
“https://www.tidytextmining.com/tidytext.html”

Shiny App
“https://zerimar.shinyapps.io/WordCrystalBall/”