Coursera Data Science Specialization Capstone Project

Elijah Appiah
2022-01-08

The Project

This project involves Natural Language Processing. The critical task is to take a user's input phrase (group of words) and to output a predicted next word.

Project deliverables:

Next Word Prediction Model, as basis for an app
Next Word Prediction App hosted at shinyapps.io
This presentation hosted at R pubs

Next Word Prediction Model

The next word prediction model uses the principles of “tidy data” applied to text mining in R. Key model steps:

Input: raw text files for model training
Clean training data; separate into 2 word, 3 word, and 4 word n grams, save as tibbles
Sort n grams tibbles by frequency, save as repos
N grams function: uses a “back-off” type prediction model
- user supplies an input phrase
- model uses last 3, 2, or 1 words to predict the best 4th, 3rd, or 2nd match in the repos
Output: next word prediction

Benefits: easy to read code; uses “pipes”; fast processing of training data; able to sample up to 25% of original corpus; relatively small output repos

Next Word Prediction App

The next word prediction app provides a simple user interface to the next word prediction model.

Key Features:

Text box for user input
Predicted next word outputs dynamically below user input
Tabs with plots of most frequent n grams in the data-set
Side panel with user instructions

Key Benefits:

Fast response
Method allows for large training sets leading to better next word predictions

Shiny App Link

Documentation and Source Code

Tidy Data
“http://vita.had.co.nz/papers/tidy-data.html”

Text Mining with R: A Tidy Approach
“http://tidytextmining.com/index.html”

Shiny App
“https://mblackmo.shinyapps.io/ngram_match/”

Shiny App Source Code repository on Github
“https://github.com/mark-blackmore/JHU-Data-Science-Capstone/tree/master/ngram_match”

Data Specialization Capstone repository on Github
“https://github.com/mark-blackmore/JHU-Data-Science-Capstone”