Coursera Data Science Capstone Project

Laica Noguera
August 16, 2019

About the App

  • Blogs, News and Tweets corpora from SwiftKey
  • There are 4.3 million lines of text. Hence, over 100 million words

The App

Model Algorithm

  • Uses Katz Back-off Approach with Good Turing Smoothing
  • The model itself takes a user unput text/phrase, cleans it, and depending on the length of the phrase executes a Katz back-off algorithm from the highest possible ngram, starting from the 4gram.
  • Katz back-off simply starts looking for a match in the highest possible n-gram, it takes all matches available there and then starts looking for extra matches that have not appeared in the higher level ngram, but appear in lower level ngrams.