João Ramos
17th April 2020
This slide deck will briefly pitch an application for predicting the next word based on an expression or phrase.
This app regards to the last step of the final course of the Coursera Data Science Specialization (John Hopkins University). This specific course is supported by Swiftkey.
The main goal of this capstone project is to develop a ready-to-use app that can accept input as text and in order to predict the next word.
A corpus of text from blogs, news and tweets was provided.
Along the way, data cleaning, exploratory analysis and performance analyses were performed that ultimately culminated in the creation of this app.
All the import of text, cleaning, processing and building of the algorithm was done with a variety of well-known R packages.
1. The text provided was imported and sampled (5%) due to computational limitations - mainly due to the memory limites of a free account at shinyapps.io.
2. All of the sampled text was cleaned (i.e. converted everything to lowercase, removed URLs or unnecessary whitespace and special characters).
3. The output of the previous steps was tokenized into what are called n-grams. These are basically little snippets of text with a wide-ranging number of words (as defined by the person building the algorithm) so as to, in this case, try and get a more accurate prediction of the next word since it can take into account expressions, rather than single words.
4. A function was designed to process the input text similarly to the sampled text, and make a prediction of the next word.
The next word prediction app is hosted @shinyapps.io
This pitch deck is located @Rpubs