Alejandro Cadavid Romero
The application presented is the capstone of the coursera data science specialization. The capstone itself intend to consolidate all the knowledge adquire trough the specialization course in a Natural Language Proccessing project. Special thanks to the facilitator of the projects. Cousera, Johns Hopkins Bloomberg School of Public Health Professors and Swiftkey
Given the foundations of a data science project flow by the whole specialization (from the tools, programing expertise, getting the data, to modeling and building data products), the goal is to build an application with an embebed Natural language proccessing model to predict words given a text context.
Use the technical and non-technical knowledge given in the specialization to build a shiny app with an embedded NLP model.
Use differents language models to build an app that brings speed and accuracy in predicting the next word given a text context.
The data was archived by heliohost.org and retrieved via Wayback Machine and can be downloaded from this link. About corpus
The main corpus was cleaned and proccessed into n-grams tables (unigrams, bigrams and trigrams) for the final model. The model used was the Katz-Backoff that relys on the absolute discounting and backoff methods for estimating the probabilities for words in an unseen context. \[ \begin{aligned} \bullet P(w_1|w_1^{n-1}) & \approx \prod_{k=1}^{n}P(w_n|w_{n-N+1}^{n-1})\quad \text{General n-gram equation}\\ \end{aligned} \\ \bullet P_{Backoff}(w_n|w_{n-N-1}^{n-1}) = \boldsymbol{\{} {\begin{align} P^{discounted}(w_n|w_{n-N+1}^{n-1}),\ \text{if}\qquad C(w_{n-N+1}^{n}) > 0 \\ \\ \alpha(w_{n-N+1}^{n-1})P_{backoff}(w_{n-N+2}^{n-1}),\ \text{otherwise}\end{align}} \]
The app contains the previous models embedded to predict the next word of the inputed text. For its use, the user must enter the text in the input text box, on the left of the UI.
Next, the user must click the predict next words button, and the app will retrieve a table with the 5 most likely words (with their respective probabilities) to continue the text and also a graph of the word's probabilities.