Data Science Capstone

Andrei Boulgakov
22 June 2020

How model works

The model works on full statements (no stopwords).

  • Predict last word by full sentence
  • Given discount, checks unobserved ngrams
  • Back off to the shorter ngram
  • Combines top 10 results

Performance

No marketing here! It works quite slow.
Bigger the database of ngrams, slower is the system.

  size time count
1  4.2  2.2   193
2 90.5  7.8  3045

Where size is in Mb for bi-gram database, count - num of unique bigrams for “love”

Technical details

To deal with the mutual exclusive speed and occuracy I selected smaller database
Internally I use sqldf for reading concerned ngrams only
No smart indexing etc as senior software developer can think of

User guide

Enter a sentence into input text box
You can change Discount for getting more unobserved ngrams
Top graph is the final one with top 10 predicted words The last graph is an n-gram direct match Middle is intermediate result with observed and unobserved words

Closing slide

Thank you for reading my pitch!
This project was a great push to learn new things
Thanks to JHU teachers team!