Maximiliano Fernandez
12/19/2020
This work is the final project from Coursera data science specialization from John Hopkins University (https://www.coursera.org/learn/data-science-project). The object of the project is to create a shiny web application that uses a text prediction algorithm to predict possible next words based on the words provided by the user. This is similar as how the swift key prediction program works in the cellphones. The university worked together with the SwiftKey company in order to create the course final project. The data used was provided by SwiftKey from https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip
The app allows the user to type one or more words which will then be used by a prediction algorithm to try to guess three possible words. The idea is to replicate what the dictionary from ours cellphones does when we type in Whatsapp, Telegram or one of theses mobile apps.
Packages like tidytext, stringi and NLP were used, to modify the data set and find the most common n-grams (more information https://en.wikipedia.org/wiki/N-gram). Moreover, the probabilities of occurrence of each n-gram was calculated and incorporated for use in the app. The algorithm uses the input from the user and first searches the most common n-grams using stupid backs-off model. https://en.wikipedia.org/wiki/Katz's_back-off_model)
To create a powerful prediction algorithm, big data sizes are required. However, becasue text analysis algorithms, like n-gram creating, are a common power consuming process. As a consequence, I used only a small 7% sample from the original data also because shinny.io is very restrictive to memory usage and with larger data sample, I could not load the app in the web. Please take this into consideration when using the app. If I were to improve the efficiency and accuracy of the app I would use a larger sample and run it on a private server with access to more memmory.
Shiny web app link: https://maximiliano-fernandez.shinyapps.io/text_pred/
Report from the first analysis of the data: -https://github.com/maxinegueruela/Coursera_Capstone/blob/main/Milestone_report_week2_v2.Rmd
Link to code in GitHub