Data Science Capstone Project
author: Ramirez date: February 5, 2023 autosize: true
Capstone Project Overview
Please visit the following website for an App demo, go here, and try it risk free!
- This project involves Natural Language Processing. The critical task is to take a user’s input phrase (group of words) and to output a predicted next word.
- The App predicts a sequence of works as the user types a sentence.
- This app is similar to the way most smart phone keyboards are implemented today using Swiftkey technology
Project deliverables:
- Next Word Prediction Model, as basis for an app
- Next Word Prediction App hosted at shinyapps.io
- This presentation hosted at R pubs
Retrieving & Cleaning the Data
- A subset of the original data was sampled from three sources (blogs,twitter and news), which is then merged into one.
- Next, data cleaning is done by transforming to lowercase letters, stripping white space, and removing punctuation and numbers.
- The corresponding n-grams are then created (i.e., Bigram, Trigram, Quadgram, and Quintgram).
- Next, the term-count tables are extracted from the N-Grams and sorted according to the frequency, in descending order.
- Lastly, the n-gram objects are saved as R-Compressed files (.RData files).
Underlying Algorithm (Next Word Prediction App)
The next word prediction app provides a simple user interface to the next word prediction model.
Key Features:
- A simple text box for user input
- One sees a predicted next word “output” dynamically, right below user input
- The Tabs with the plots of the most frequent n grams in the data-set
- Side panel with user instructions
Key Benefits:
- Rapid response time.
- Method allows for large training sets leading to better next word predictions
Further Exploration
- Further work can expand the main weakness of this approach: long-range context
- Current algorithm discards contextual information beyond 5-grams
- We can incorporate this into future work through clustering underlying training corpus/data and predicting what cluster the entire sentence would fall into.
- This allows us to predict using ONLY the data subset that fits the long-range context of the sentence, while still preserving the performance characteristics of an n-gram and the structure of the prediction model.
- To ensure the proprietary nature of the app and algorithm, the R code is available upon request.
References
Tidy Data
“https://cran.r-project.org/web/packages/tidyr/vignettes/tidy-data.html”
Text Mining with R: A Tidy Approach
“https://www.tidytextmining.com/tidytext.html”
Shiny App
“https://zerimar.shinyapps.io/WordCrystalBall/”