28 August 2019

Introduction

  • The Johns Hopkins Data Science Specialization partnered with Swiftkey and developed a project in which they provided a large number of unstructured texts written in English;

  • The goal of this project was to develop a web application able to predict the next word given a sequence of words;

  • Using a sample of around 1 million texts, it was depveloped an algorithm using Natural Language Processing and Text Mining techniques;

  • The potential applications of such functionality is manifold:

    1. Speed up user’s typing by suggesting the next word;

    2. Search query autocomplete;

  • The app can be loaded by clicking here.

Methodology

  • The methodolgy consists in a model based on n-grams;

  • Specifically, it is implemented a Katz’s back-off model version which uses up to 5-grams;

  • For each sequence of words supplied, the algorithm:

    1. First, count how many words were given;
    2. If, for instance, 4 or more words were typed, it searches for the last 4 words supplied in all 5-grams looking for 5 matches;
    3. If 5 candidates are found, then they are returned ranked by their score;
    4. If less than 5 candidates are found, the method searches for the last 3 words in all 4-grams;
    5. If 5 candidates are still not found, then use the last 2 words in all 3-grams and so on;
    6. This continues until it has found at least 5 candidates;
  • In general, if k words are supplied, then it triggers the Katz’s back-off algorithm starting with k+1 n-gram;

Application

  • The application is hosted at this link;

  • The application works as follows:

    1. The user starts typing words and the Enter text below box;
    2. A list of 5 prediction is displayed and the Predictions box ranked by their score;
    3. The user has the option to choose among these 5 candidates at the Choose Your Next Word box;
    4. After choosing the next word, the text is updated automatically back at the Enter text below box and then the next 5 candidates are displayed.

Here a screenshot of the app: