Capstone Project - Data Science Specialization

Andre Morato
06/14/2017

The application

  • This presentation explains the app created to predict next word in a sentence provided by the user.

Using the app:

  • Just write a sentence in text box and wait for a table with most promissing next word.
  • Time eleapsed is less than 2 seconds.

Link for the applicarion:

https://amdmorato.shinyapps.io/Nextwordprediction/

Link for GitHub repository containing all codes:

https://github.com/andmorato/Capstone_Project

How it works?

  • The n-gram theory is applied to obtain words to be suggested to user.
  • Main principle is matching the last n-1 words of sentence provided with n-1 words of database. The suggested word will be the last one of n-grams with higher probability.

Example (n=4):

  • User sentence: I would like to
  • Last n-1 word: “would like to”
  term.1 term.2 term.3 term.4   Pkn
1  would   like     to    see 0.130
2  would   like     to   know 0.083
3  would   like     to  think 0.057
4  would   like     to     be 0.051
5  would   like     to   have 0.045
  • By database search, showed above, the suggested word will be “see”.

What if the sentence contains a word that is out of app dictionary?

Example (n=4):

  • User sentence: I'm a huge pokemon fan and blastoise is my
  • Last n-1 words: “blastoise is my”

Problem:

  • Blastoise is out of database. So, there will be no match among user sentence and database.

Solution:

  • Use a lower order n-gram. In this case, n=3.

Example (n=3):

  • User sentence: I'm a huge pokemon fan and blastoise is my
  • Last n-1 words: “is my” [There are matches and suggested word is “favorite”]

The final model?

The selection of model parameters was driven by two boundaries:

1) Time elapsed among user entry and app response.

  • The maximum waiting time was set to be no more than 2 seconds for good user experience.

2) Accuracy.

  • Accuracy was measured by quiz #2 and quiz #3 of Capstone Project course site.

Final model parameters:

  • N-gram of size 4.
  • No stopword removal.
  • Use of 1% of total data provided.

For more detailed information, access the Documentation tab in application.