Andrei Boulgakov
22 June 2020
The model works on full statements (no stopwords).
No marketing here!
It works quite slow.
Bigger the database of ngrams, slower is the system.
size time count
1 4.2 2.2 193
2 90.5 7.8 3045
Where size is in Mb for bi-gram database, count - num of unique bigrams for “love”
To deal with the mutual exclusive speed and occuracy I selected smaller database
For tokenization I used quanteda, ngrams stored in csv files
Internally I use sqldf for reading concerned ngrams only
No smart indexing etc as senior software developer can think of
Enter a sentence into input text box
You can change Discount for getting more unobserved ngrams
Top graph is the final one with top 10 predicted words
The last graph is an n-gram direct match
Middle is intermediate result with observed and unobserved words
Thank you for reading my pitch!
This project was a great push to learn new things
Thanks to JHU teachers team!