Data Science Capstone Project

author: Ramirez date: February 5, 2023 autosize: true

Capstone Project Overview

Please visit the following website for an App demo, go here, and try it risk free!

This project involves Natural Language Processing. The critical task is to take a user’s input phrase (group of words) and to output a predicted next word.
The App predicts a sequence of works as the user types a sentence.
This app is similar to the way most smart phone keyboards are implemented today using Swiftkey technology

Project deliverables:

A subset of the original data was sampled from three sources (blogs,twitter and news), which is then merged into one.
Next, data cleaning is done by transforming to lowercase letters, stripping white space, and removing punctuation and numbers.
The corresponding n-grams are then created (i.e., Bigram, Trigram, Quadgram, and Quintgram).
Next, the term-count tables are extracted from the N-Grams and sorted according to the frequency, in descending order.
Lastly, the n-gram objects are saved as R-Compressed files (.RData files).

The next word prediction app provides a simple user interface to the next word prediction model.

Key Features:

Key Benefits:

Further work can expand the main weakness of this approach: long-range context
1. Current algorithm discards contextual information beyond 5-grams
2. We can incorporate this into future work through clustering underlying training corpus/data and predicting what cluster the entire sentence would fall into.
3. This allows us to predict using ONLY the data subset that fits the long-range context of the sentence, while still preserving the performance characteristics of an n-gram and the structure of the prediction model.
4. To ensure the proprietary nature of the app and algorithm, the R code is available upon request.

Tidy Data
“https://cran.r-project.org/web/packages/tidyr/vignettes/tidy-data.html”

Text Mining with R: A Tidy Approach
“https://www.tidytextmining.com/tidytext.html”

Shiny App
“https://zerimar.shinyapps.io/WordCrystalBall/”