Shiny web text prediction app

Maximiliano Fernandez
12/19/2020

Introduction

This work is the final project from Coursera data science specialization from John Hopkins University (https://www.coursera.org/learn/data-science-project). The object of the project is to create a shiny web application that uses a text prediction algorithm to predict possible next words based on the words provided by the user. This is similar as how the swift key prediction program works in the cellphones. The university worked together with the SwiftKey company in order to create the course final project. The data used was provided by SwiftKey from https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip

App creation and algorithm

The app allows the user to type one or more words which will then be used by a prediction algorithm to try to guess three possible words. The idea is to replicate what the dictionary from ours cellphones does when we type in Whatsapp, Telegram or one of theses mobile apps.

Packages like tidytext, stringi and NLP were used, to modify the data set and find the most common n-grams (more information https://en.wikipedia.org/wiki/N-gram). Moreover, the probabilities of occurrence of each n-gram was calculated and incorporated for use in the app. The algorithm uses the input from the user and first searches the most common n-grams using stupid backs-off model. https://en.wikipedia.org/wiki/Katz's_back-off_model)

Restrictions

To create a powerful prediction algorithm, big data sizes are required. However, becasue text analysis algorithms, like n-gram creating, are a common power consuming process. As a consequence, I used only a small 7% sample from the original data also because shinny.io is very restrictive to memory usage and with larger data sample, I could not load the app in the web. Please take this into consideration when using the app. If I were to improve the efficiency and accuracy of the app I would use a larger sample and run it on a private server with access to more memmory.

Link to app and more information

Shiny web app link: https://maximiliano-fernandez.shinyapps.io/text_pred/

Report from the first analysis of the data: -https://github.com/maxinegueruela/Coursera_Capstone/blob/main/Milestone_report_week2_v2.Rmd

Link to code in GitHub

https://github.com/maxinegueruela/Coursera_Capstone