Coursera Data Science Capstone Project: Word Prediction

cfw
11 Aug 2017

This is a presentation on an application created to predict text following a word or phrase. Analysis of text data and natural language processing was used to build the predictive model. The data is from a corpus called HC Corpora (see https://web-beta.archive.org/web/20160930083655/ http://www.corpora.heliohost.org/aboutcorpus.html).

The algorithm predicts the next word in a user-entered text string based on an n-gram model.

Unigrams, bigrams, and trigrams were collected from a subset of cleaned data from news, blogs, and Twitter feeds.

The predicted word is computed based on the linear interpolation of trigrams, bigrams, and unigrams.

A Shiny application was developed to provide an interface that can be accessed by others via the internet.

The application can be found here.

As shown in the diagram, the user begins by typing text without punctuation in the input box.

After a short delay, the predicted next word in the phrase is shown in the output box below.

alt text