8 de mayo de 2018

Introduction

The application named Guessing next word in a sentence is done to show the model of Natural Language Processing created during the Data Science Captone Project Course.

This course is developed by the John Hopkins University, available in Coursera. The data used in this project was given by SwiftKey company.

Natural Language Processing Model

  • The next word is guessed using the frequency of combinations of two, three and four words, which is known as N-Gram model, with N=2,3,4.

  • The frequency is calculated using text extracted from blogs, news and twitter that SwiftKey gives for this projects").

  • The algorithm try to use the higher N-gram, if it fails it uses the N-1 grams.

  • The word 'it' is used when there are no hint for guessing of no pattern is.

Performance

  • The data preparation for this model takes around 40 minuts.
  • This process takes the original data composed by news, blogs and twitters that have 2.5Millons of lines and 550MB, and it selects 50.000 random lines to create the frequency tables with the combination of two, three, and four words.
  • The output of this process is the data used by this app. The total size is 3Mb.
  • The time required by the app to search the next word in the frequency tables is 0.30seg in average.

Shiny application