June 09 2020

Introduction

-The goal of the application is to implement a predictive model that provides clues on one or more words, consistent with the sentence that has been entered by its user. After conducting data cleaning, sampling and subseting, we gather all data into R databases.

-Using techniques from Text Mining ™ and NLP, creating a set of word combinations (N-grams). They are the main supports of Katz Backoff’s algorithm that predicts the next word. Adaptations and heuristics have been specially developed to improve this Shiny application.

Getting & Cleaning the Data

To build the word prediction algorithm, we first process and clean the data according to the steps below:

From the three sources (blogs, twitter and news), a subsets of the original data is sampled and merged into one. Then the data is cleaned by conversion to lower case, removal of white space, and deletion of punctuation and numbers. The respective n-grams are then created (Quadrams, Trigrams and Biggrams). Afterwards, the term count tables are extracted from the n-grams and sorted according to frequency in descending order. The n-gram objects are then saved as R-Compressed files (.RData files).

Shiny Application

Word Prediction Model

Based on the Katz Back-off algorithm, the prediction model for the following word is based on the Katz Back-off algorithm. Here is an explanation of the next word prediction flow:

compressed data sets containing n-grams sorted by descending frequency are loaded first. User-entered words are cleaned in the same way as before the next word prediction. For the prediction of the next word, the Quadgram is used first (the first three words of the Quadgram are the last three words of the user-supplied sentence). When no Quadgram is found, it returns to the Trigram (the first two words of the Trigram are the last two words of the sentence). When no Trigram is found, return to Bigram (the first word of Bigram is the last word of the sentence) When no bigram is found, you have to go back to the most common word with the highest frequency: “the”.

Thank you for reviewing my capstone project!