Presentation of a text prediction application for the final course of the John Hopkins Data Science Specialisation
Author: Jorik Schra
Date: 20-06-2019
For the last course of the data science specialisation of John Hopkins University, participants were required to create a text prediction application that can predict the next word when a sentence is being written. This is my attempt at creating such an application
The application runs a model on the back-end which was trained on text data from Twitter, blogs and the news. On the following slides, the process of developing this application will be further explained.
As mentioned on the previous slide, the application uses text retrieved from Twitter, blogs and the news. The datasets can be downloaded here
Taking a 10% sample from each text source, the text data was further cleaned, applying the following transformation:
Using the cleaned text data, the next step was to loop over all the text to obtain bigrams, trigrams and quadgrams and generate frequency tables based on the occurence of these throughout the text. These serve as the basis for making predictions.
Next, a simple N-gram model was built, which works as follows. For the given text input, evaluate:
The resulting application can be found here
To use it, simply input a sentence in the box in the left panel. As soon as you do, the three most likely words are returned in the main panel of the application.
For a detailed report on how the text was preprocessed and transformed into frequency tables, check this link