Coursera Data Science Capstone Project

Michael Ebner
12/03/2019

This slidedeck is a pitch for an applications which allows the user to select the word from a list of predicted words which have been predicted based on the user input. The applications has been developed in the course of the capstone project of the Johns Hopcins Coursera Data Science specialisation, with cooperation from Swiftkey.

The Model - The corner stones

  • The training set is sampled from large text courpera three different sources: Twitter, Blogs and News. Other data sources can be added if more memory is available.
  • The text data has been cleaned from special characters and profanity.
  • The alorythm can deal with intra word contractions and - different to many other models - even with digits.
  • For prediction the data has been tokanized into n-grams (from 1 to 5). The next word probability calculation uses the n-grams to predict the next work. The underlying model itself is based on the Katz's back-off model.
  • In order to boost performance less relevent n-grams have been elemenated based on a threshold which takes the cummulative probability of the n-grams into account.

The Model - What's under the hood?

  • The prediction is based on the Katz's back-off approach
  • In a nutshell the model counts the number of times each n-gram occurrs in the training data (k). This history is then used to assign a likelyhood to each n-gram. In case no result was found the algorithm backs-off to models with fiewer n-grams (e.g. if the 4-grams storage leads to 0 results, the model will move on to the 3-grams storage, and so on).
  • In order to boost performance the cumulative likelyhood for n-grams has been used to delete very unlikely word combinations.

How to use the app

  • The app can be found here
  • It's a simple-to-use app. There is one text field for user input on the left and the space where the results show up on the right. The results will show you the probabilyt for each result + the type of n-gram (1 to 5) used for the calculation.

Resources