Coursera Data Science Capstone Project

October 2024

Coursera Data Science Capstone Project

Bryan Igboegwu

October 2024

Introduction

This project uses data science techniques to design a Natural Language Processing (NLP) word prediction algorithm implemented as a Shiny app. The app predicts the next word based on a given phrase.

Word Prediction Shiny App

Link to App: Word Prediction Shiny App

Overview of Data

The data used in this project is derived from the HC Corpora, which can be found at HC Corpora. It consists of:

Blogs
Tweets
News reports

Shiny App - Instructions

Instructions

To use the app: - Enter words separated by a single space. - The next most probable word is automatically displayed on the right side of the screen.

Table Presentation

Input words are matched against the highest order 1, 2, 3, or 4-gram match.
If a match is found, the corresponding n-gram table is displayed.
The number of n-grams in the table is controlled by the slider on the left side of the screen.

Data Preprocessing

Sampling

Each data file was sampled at a rate of 45%.
This was a compromise between coverage and creating n-gram data sets less than 5MB.

Cleaning

The corpus was transformed into four document feature models (DFM) using the quanteda library.
Text was converted to lowercase, and numbers, Twitter symbols, and separators were removed.

N-Gram Data Sets

Each DFM was manipulated to a data frame with leading n-grams and corresponding 1-grams (predicted words).
The data sets were reduced to fit the 5MB size criteria for performance.

Description of Algorithm

Tail Function

The Shiny app employs a back-off model that queries the four n-gram data sets.
As users input words, the app calculates 1, 2, 3, and 4-grams.
If more than four words are entered, the tail functions isolate the trailing words.

N-gram Matching Model

Matches the highest order n-gram against its corresponding data set to predict the next word.
High-order n-grams are biased toward the end of the sentence for greater prediction diversity.
The highest probability of contextual match ensures better grammar and slang usage.
If there are multiple matches, the highest order n-gram is selected.

Conclusion

The Shiny app effectively demonstrates the application of NLP in predicting the next word based on user input, utilizing robust data preprocessing and algorithmic techniques to enhance user experience.

References

HC Corpora: http://www.corpora.heliohost.org/aboutcorpus.html