Coursera Data Science Capstone Final Presentation

Jun Zhang
May 1, 2020

Objective

This is the presentation of the Coursera Data Science Capstone final project. The goal of this project is to build a prediction algorithm as a result of Shiny App that takes as input a word or a phrase in a text box input, and outputs a prediction of the next word.

Here is the link to my Shiny App.

Backgrounds

The data is from a corpus called HC Corpora and they are available to download through the course website. The data are in .txt formats and they are retrieved from blogs, news, and twitter.

Since the sizes of all three data are pretty large and due to a large computation time, I can only train a subset of the data using random sampling (about 333670 lines of texts).

Algorithm Building

First, we need to clean the data. This is done by removing all non-English characters, converting all characters to lower cases, removing the punctuations, links, whitespaces, and numbers.
Then we tokenize the text into n-grams, specifically 4-gram, 3-gram, and 2-gram. (An n-gram is a contiguous sequence of n items from a given sample of text or speech.)
Once an user enter a word or a phrase, then we first match with the most frequent 4-gram text. If no data are matched, then we try to use 3-gram data. If no 3-gram matches, we use 2-gram data. Finally, if we can't match the 2-gram data, we can predict the next word with the most frequent word “the.”
Since the amount of data trained is really limited due to the performance of my computer, a lot of times we are not able to predict the next word. So, more often we will see the word “the” as our next predicted word.

Example

Below is an example of how this works. If you were entering “Thanks for the,” then the next predicted word is shown as “follow.”

Input Example