April 4th 2020

Introduction

This is an R Markdown presentation to pitch the Shiny App you can use at https://jordicabral.shinyapps.io/Next_Word_Prediction/

The presentation is part of the Final project Capstone of Data Science Specialization from Johns Hopkinks University by Coursera

The purpose of this application is to predict the next word after user enters a sentence without the last word. The application predicts the last word using N-Gram backoff model. The model gets trained using training data set obtained from twiter, blogs, and news data sets.

Methodology and approach

The core of this application is analysis of text data and NLP (Natural Language Processing), analyzing a large corpus of text documents to discover the structure in the data and how words are put together, then building and sampling from a predictive text model.

We used a Tokenization methodology for NLP: Identifying appropriate tokens such as words, punctuation, and numbers, writing a function that takes a file as input and returns a tokenized version of it.

For cleaning the dataset, we made a profanity filtering (removing profanity and other words you do not want to predict).

Shiny App

To use the application, it is only needed to put a sentence without the last word in the input box and click on predict, and it returns the most probability word to be used.

To predict the next word, it have been built a basic n-gram model, for predicting the next word based on the previous 1, 2, or 3 words.

Additional information and Links: