Data Science Specialization capstone project - Predicting next word

Alejandro Cadavid Romero

The application presented is the capstone of the coursera data science specialization. The capstone itself intend to consolidate all the knowledge adquire trough the specialization course in a Natural Language Proccessing project. Special thanks to the facilitator of the projects. Cousera, Johns Hopkins Bloomberg School of Public Health Professors and Swiftkey

Introduction and project overview

Main Objective

Given the foundations of a data science project flow by the whole specialization (from the tools, programing expertise, getting the data, to modeling and building data products), the goal is to build an application with an embebed Natural language proccessing model to predict words given a text context.

Specific objectives

Use the technical and non-technical knowledge given in the specialization to build a shiny app with an embedded NLP model.
Use differents language models to build an app that brings speed and accuracy in predicting the next word given a text context.

About the data and the model used

About the data

The data was archived by heliohost.org and retrieved via Wayback Machine and can be downloaded from this link. About corpus

About the model used

The main corpus was cleaned and proccessed into n-grams tables (unigrams, bigrams and trigrams) for the final model. The model used was the Katz-Backoff that relys on the absolute discounting and backoff methods for estimating the probabilities for words in an unseen context. \[ \begin{aligned} \bullet P(w_1|w_1^{n-1}) & \approx \prod_{k=1}^{n}P(w_n|w_{n-N+1}^{n-1})\quad \text{General n-gram equation}\\ \end{aligned} \\ \bullet P_{Backoff}(w_n|w_{n-N-1}^{n-1}) = \boldsymbol{\{} {\begin{align} P^{discounted}(w_n|w_{n-N+1}^{n-1}),\ \text{if}\qquad C(w_{n-N+1}^{n}) > 0 \\ \\ \alpha(w_{n-N+1}^{n-1})P_{backoff}(w_{n-N+2}^{n-1}),\ \text{otherwise}\end{align}} \]

APP use

The app contains the previous models embedded to predict the next word of the inputed text. For its use, the user must enter the text in the input text box, on the left of the UI.
Application Screenshot

Next, the user must click the predict next words button, and the app will retrieve a table with the 5 most likely words (with their respective probabilities) to continue the text and also a graph of the word's probabilities.

Reference resources

The app is hosted on shinyapps.io
Data Science Specialization: https://www.coursera.org/specialization/jhudatascience/1
Dan Jurafsky web page and his text on Natural Language Proccessing https://web.stanford.edu/~jurafsky/
Videos on Katz Backoff Modeling