Data Science Capstone Project

RamirezJA

2/5/23

Capstone Project Overview

Please visit the following website, risk free, to give the app a try: click here WordCrystalBall Shiny App

- This project involves Natural Language Processing. The critical task is to take a user’s input phrase (group of words) and to output a predicted next word.

- The App predicts a sequence of works as the user types a sentence.

- This app is similar to how many smart phone keyboards are use today using Swiftkey technology.

Project deliverables: 1) Next Word Prediction Model as basis for an app, 2) Next Word Prediction App hosted at shinyapps.io, and 3)This presentation hosted at R pubs.

Retrieving & Cleaning the Data

- A subset of the original data was sampled from three sources (blogs,twitter and news), which is then merged into one.

- Next, data cleaning is done by transforming to lowercase letters, stripping white space, and removing punctuation and numbers.

- The corresponding n-grams are then created (i.e., Bigram, Trigram, Quadgram, and Quintgram).

- Next, the term-count tables are extracted from the N-Grams and sorted according to the frequency, in descending order.

- Last, the n-gram objects are saved as R-Compressed files (.RData files).

Underlying Algorithm

The next word prediction app provides a simple user interface to the next word prediction model.

Key Features

  1. A simple text box for user input

  2. One sees a predicted next word “output” dynamically, right below user input

  3. The Tabs with the plots of the most frequent n grams in the data-set

Key Benefits

  1. Rapid response time.

  2. Method allows for large training sets, leading to better next word predictions.

  3. The Algorithm is expandable to other languages, such as German and Finnish.

Further Exploration

Additional work can expand the main weakness of this approach (long-range context > 4-grams)

  1. We can incorporate this into future work through clustering underlying training corpus/data and predicting what cluster the entire sentence would fall into.

  2. Allows the user to predict using ONLY the data subset that fits the long-range context of the sentence, while preserving the performance characteristics of the n-gram prediction model structure.

References

Tidy Data: https://cran.r-project.org/web/packages/tidyr/vignettes/tidy-data.html

Text Mining with R, A Tidy Approach: https://www.tidytextmining.com/tidytext.html

Shiny App: https://zerimar.shinyapps.io/WordCrystalBall/

To ensure the proprietary nature of the app and algorithm, the R code is available upon request