2021 Word Predictor

15 February 2021

1.0 What is the Word Predictor?

The 2021 Word Predictor;

is a prototype NLP app used to predict the next word on a preceding input.
uses the en_US capstone data set containing a corpus of text from blogs, news and twitter.
allows the user to key in phrases to predict a set of 3 choices for the next word.
was built in R using the SBO package based on N-gram tokenization models.

Link: Capstone Dataset

2.0 App Structure

The app was built with the following scripts:

Data loading, preparation and cleaning. (data_prepare_predict_01.R)
Model creation.
(sbo.prediction_01.R)
Testing and evaluation.
(sbo.prediction_01.R)
Shiny App creation and packaging.
(/Shiny/)

Link: Code @ GitHub

3.0 App Availabilty

Prototype Link: Shiny App

Screenshot

Screenshot

4.0 Discussion

4.1 Key parameters.

Due to computing resource limitations:

only 10% of the total corpus from en_US was used.
a split of 80-20 was used for training and validation data sets.
a maximum of 6-grams in the prediction model.
a maximum of 3 options were returned from the model.

Other Links: Week 2 Milestone Report

4.2 Accuracy & Uncertainty

accuracy	uncertainty
0.23998	0.00174

4.3 Improvements required

To improve the accuracy of the app:

Increase computing resources.
Use the entire en_US corpus.
Use a higher N-gram.