Coursera Data Science Capstone Project

Subramanian Shankar
Mar 20 2021

App

Word Prediction Shiny App: https://ssubramanian90.shinyapps.io/shiny/

Summary

This project uses a NLP word prediction algorithm to predict the next word, given a phrase.

The data used in this project consists of blogs, tweets and news reports, found in HC Corpora can be found at http://www.corpora.heliohost.org/aboutcorpus.html

Instructions

Please enter words separated by a single space. The next most probable word is automatically displayed as you keep entering the words.

The input words are matched against the highest order 1, 2 or 3-gram match. If a match is made, the corresponding table of n-grams is displayed. The number of n-grams in the table is controlled by the slider on the left side of the screen.

Data Preprocessing

Sampling and cleanning: Each data file was sampled at a rate of 50%. After sampling into a single file, the corpus was transformed into four document feature models (DFM) using the quanteda library. The transformation forced all text to lower case, removed numbers, twitter symbols and separators.

N-Gram Data Sets: Each DFM was subsequently manipulated to a data frame consisting of a leading n-gram and following 1-gram.

Description of Algorithm

The model matches the highest order n-gram against its corresponding data set to render the following 1-gram (next word). The high order n-grams are biased to the end of the sentence which is where the prediction is most diverse. The highest order n-gram also offers the highest probability of contextual match which is critical for grammar and slang. If there are multiple matches, the highest order n-gram is selected.

Thank You