Next Word Prediction App Pitch

Daniel Baquero
2020-06-07

Introduction

The goal of the Capstone Project for the JHU Data Science Specialization in coursera is to create a fast, accurate, and responsive next-word prediction algorithm. To accomplish this, all of the obtained knowledge during the specialization is required and more.
The given corpora, the text database, to construct the app is a collection of more than 4 million text lines from blogs, news, and twitter.
My personal goal of the project is to create a model as small and fast as possible. This way, the app can be used in no so smart phones and other low tech gadgets.

The app can be found here.
The original data base can be found here

How To Use The App

The use of the app is pretty straightforward. Open the app. Then, start writing or pasting the text in the text box. Once the text is ready just click on the submit button!

  • The app only returns the most probable word to goes next.
  • Accuracy may seem bad as the n-gram database is trimmed as much as possible.
  • The total space needed to run the app is less than 146 Mb.
  • Response time is almost less than 0.1 seconds.

How It Work's

  • The app use a n-grams model to predict next word.
  • N-grams are series of consecutive words found in a data base called corpus.
  • The corpus is composed of more than 4 million text lines from blogs, news, and twitter.
  • Stop words are included in the model as this words are constructors of language.
  • The maximum n-gram used is a 4-gram. This means that the app just take the last three, or less, words to make the prediction. The model keeps going back to lower level n-grams table if it can't find a consecutive series of words.

What are the Advantages and Performance of the Model

  • The main advantage of the model is the small required space and responsive time.
  • With less than 150 Mb the whole model can be carry over and implemented in low resource gadgets.
  • The elapse time for a prediction is less than 0.1 seconds. The plot below shows the performance in terms of needed time.

plot of chunk unnamed-chunk-1