Coursera Data Science Capstone Project

MJ
17 June 2017

Introduction

Next word predictor application is built for the Capstone Project of the Coursera Data Science Specialization.

This application provides a user friendly web interface that allows the user to save typing by predicting the next word that user will write in a text written in English and adding the capability of add it to the text by clicking on it.

It has been developed using Shiny using Shinydashboard as well to improve its appearance and it is available thanks to shinyapps.io.

Methodology

The methodology followed to build the model consisted of the following steps:

Obtain data from Twitter, news and blogs published in the web.
Clean the data, for instance, deleting profanities, emojis, urls or icons.
Tokenize the data in groups of N words called N-grams and compute how often each N-gram appears, useful as well for data exploration. Low frequency N-grams are discarded for efficiency and to avoid noise.
Investigate algorithms that could be used for next word prediction.

Algorithm description

The algorithm selected taking into account as well execution time and memory requirements is the Stupid back-off algorithm.

If the user has not writen anything, the most frequent word in English is shown. As soon as the user starts writing, each time spacebar is pressed, previous words are used to understand which could be the next word that the user would like to write using up to 5-grams data.

Each candidate word will have a score that is computed using a back-off factor of 0.4.

Finally, the user will be able to see the option with the highest score and click on them to add it to the text.

Enjoy... it!

Enjoy Next word predictor!

Important note: during Capstone project, the original version offered top 3 predicted words but last submission should offer only one, so it has been modified accordingly.