Coursera Data Science Specialization Capstone Project
author: Sang Myong Lee date: 2025-02-01
The Project
This project uses Natural Language Processing (NLP).
The critical task is to take a user’s input phrase and output a predicted next word(s).
This presentation features the NLP Next Word Predict application including an instruction tab to
the application user interface and details about the text prediction algorithm.
Project deliverables:
- NLP (Next Word) Prediction Model, is basis for this application.
- NLP Prediction App hosted at shinyapps.io
- This presentation hosted at R pubs
NLP Next Word Prediction Model
The next word prediction model uses the principles of “tidy data” applied to text mining in R. Key model steps:
- Input: raw text files for model training
- Clean training data; separate into 2 words, 3 words, and 4 words n-grams.
- Sort n-grams by frequency, and save them as data repos
- N-grams function: uses a “back-off” type prediction model
- user supplies an input phrase
- model uses the last 3, 2, or 1 word to predict the best 4th, 3rd, or 2nd match in the repos
- Output: next word prediction
Benefits: easy to read code; uses “pipes”; fast processing of training data; able to sample up to 25% of original corpus; relatively small output repos
NLP Prediction Application for Next Word
The next word prediction application provides an easy-to-use user interface to the next word prediction model.
Top Features:
- Text box for user input
- Predicted next word outputs dynamically below user input
- Tabs with plots of most frequent n-grams in the data-set
- Side panel with user instructions
Overall Benefits:
- Fast response
- Method allows for large training sets leading to improve next-word predictions and user experience
Demo Application:
NLP Shiny App Link