Capstone Project

Coursera Data Science Specialisation

Rony Morales
4th Nov 2020

Introduction

So the goal of this is project was to build a front end that can be used by a wide audience to demonstrate how the predictive algorithm works.

For this project was build a back-end N-gram language model / algorithm based on a corpus of text provided by coursera, and after been development, then proceed to incorporate the algorithm in a shiny app that can be tested by potential users.

Description

The base corpus (text) files in order to base the language model was provided by Swiftkey and consisted of three files - blogs(800,000 lines), news(1,000,000 lines) and twitter(2,000,000 lines)

The training set was created by reading from the above corpus about 500,000 lines of blogs, 600,000 lines of news and 1,000,000 lines of twitter. The text was processed by removing numbers from the text as they do not contribute to prediction capability Other steps taaken was converting to lower cases, removal of punctuation etc was taken care within the tokenization process carried out by the tidy text framework and functions therein. The text also contained a lot of hashtags, hyperlinks such as http://, www. etc. During the initial cleaning process, eliminating this from this big a corpus seemed to take a lot of time (over 2 mins) for such a small portion to be removed. Hence it was decided to keep it as such without really impacting the prediction capability.

Description

To build The text Prediction Algorithm, it was used a 4-gram language model .This means that the next word will be predicted based on last 3 words. The following steps were followed:

Tokenization into 1,2,3 and 4-grams by using tidytext package to build the n-gram tables as a one time activity. Tokens with less than 3 counts were removed from the n-gram tables to reduce table size and processing time Build the text prediction algorithm based on Katz back off model Look for observed 4-grams, 3-grams, 2-grams and 1-gram(with most count) as prediction candidates. At each level, apply a discount to extract probability mass from observed n-grams to accomodate unobserved n-grams Calculate the probabilities of the observed n-grams For each level of unobserved n-grams, apportion the extracted probability mass in the ratio of the probabilities of the observed n-grams. As we go down n-grams, the allocated probability becomes lesser. Consolidate the final table consisting of observed 4-grams (based on matching 4-gram hits) and unobserved 4-grams (based on matching 3,2 and 1-gram hits) List the top 3 entries based on probability. Those are the top 3 predictions.

References