Data Science Capstone

Andrei Boulgakov
22 June 2020

How model works

The model works on full statements (no stopwords).

Predict last word by full sentence
Given discount, checks unobserved ngrams
Back off to the shorter ngram
Combines top 10 results

Performance

No marketing here! It works quite slow.
Bigger the database of ngrams, slower is the system.

  size time count
1  4.2  2.2   193
2 90.5  7.8  3045

Where size is in Mb for bi-gram database, count - num of unique bigrams for “love”

Technical details

To deal with the mutual exclusive speed and occuracy I selected smaller database
For tokenization I used quanteda, ngrams stored in csv files Internally I use sqldf for reading concerned ngrams only
No smart indexing etc as senior software developer can think of

User guide

Enter a sentence into input text box
You can change Discount for getting more unobserved ngrams
Top graph is the final one with top 10 predicted words The last graph is an n-gram direct match Middle is intermediate result with observed and unobserved words

Closing slide

Thank you for reading my pitch!
This project was a great push to learn new things
Thanks to JHU teachers team!