Coursera Data Science Capstone Project

Presentation of a text prediction application for the final course of the John Hopkins Data Science Specialisation

Author: Jorik Schra

Date: 20-06-2019

The Project

For the last course of the data science specialisation of John Hopkins University, participants were required to create a text prediction application that can predict the next word when a sentence is being written. This is my attempt at creating such an application

The application runs a model on the back-end which was trained on text data from Twitter, blogs and the news. On the following slides, the process of developing this application will be further explained.

The Input

As mentioned on the previous slide, the application uses text retrieved from Twitter, blogs and the news. The datasets can be downloaded here

Taking a 10% sample from each text source, the text data was further cleaned, applying the following transformation:

Remove punctuation
Remove numbers
Transform all text to lowercase
Apply a profanity filter
Strip all leading- and trailing whitespaces

Generating frequency tables & predictions

Using the cleaned text data, the next step was to loop over all the text to obtain bigrams, trigrams and quadgrams and generate frequency tables based on the occurence of these throughout the text. These serve as the basis for making predictions.

Next, a simple N-gram model was built, which works as follows. For the given text input, evaluate:

What the last three words are. If the string is shorter than 3 words, use all words provided.
Search the quadgrams frequency table for matches on the last three words. Return the 3 most frequent results.
If step 2 did not produce 3 suggestions, search the trigrams for matches. Add these to the predictions, at most up to 3.
If step 3 did not produce 3 suggestions, search the bigrams for matches. Add these to the predictions.

The result

The resulting application can be found here

To use it, simply input a sentence in the box in the left panel. As soon as you do, the three most likely words are returned in the main panel of the application.

For a detailed report on how the text was preprocessed and transformed into frequency tables, check this link