Guessing next word in a sentence

8 de mayo de 2018

Introduction

The application named Guessing next word in a sentence is done to show the model of Natural Language Processing created during the Data Science Captone Project Course.

This course is developed by the John Hopkins University, available in Coursera. The data used in this project was given by SwiftKey company.

Natural Language Processing Model

The next word is guessed using the frequency of combinations of two, three and four words, which is known as N-Gram model, with N=2,3,4.
The frequency is calculated using text extracted from blogs, news and twitter that SwiftKey gives for this projects").
The algorithm try to use the higher N-gram, if it fails it uses the N-1 grams.
The word 'it' is used when there are no hint for guessing of no pattern is.

Performance

The data preparation for this model takes around 40 minuts.
This process takes the original data composed by news, blogs and twitters that have 2.5Millons of lines and 550MB, and it selects 50.000 random lines to create the frequency tables with the combination of two, three, and four words.
The output of this process is the data used by this app. The total size is 3Mb.
The time required by the app to search the next word in the frequency tables is 0.30seg in average.

Shiny application

The shiny application is available in https://mcastrol.shinyapps.io/guessing_next_word/
You can also download it from github repository https://github.com/mcastrol/dataScienceCaptoneProject
Enter the sentence. While you are entering, the app shows the next word and the word found it in the frequency tables for two, three and four words,

App user interface
Thank you for your attention