Data Science Specialization: Capstone Project

Brynjólfur Gauti Jónsson
2018-02-10

About the Application

Model

The app is based on markov models. Given any state (input text) it tries to predict the next state (next text sequence).

Data

Trained on a huge amount of data, made availiable by Swiftkey, using the map-reduce philosophy. A large dataset was split into smaller subsets. Each subset was preprocessed and then added back into one whole. This circumvents the problem of limited RAM any computer might have.

Outcome

Simple but effective text prediction model. Because of its smaller chunk-sized preprocessing it can easily be distributed on any system, even by small laptops.

Predicting with N-Grams

  • An n-gram is a unit of n words appearing in sequence.
  • By counting these n-grams we can find the most likely match for our next word given some input text.
  • If there is no match among 6-Grams, we try 5-Grams and so on.
6-Grams (Scroll to see more)
word1 word2 word3 word4 word5 word6 n
at the end of the day 1044
on the other side of the 516
in the middle of the night 480
all you have to do is 362
this is going to be a 329
thank you so much for the 303
could not be reached for comment 302
let me know what you think 299
by the end of the year 254
vested interests vested interests vested interests 250
interests vested interests vested interests vested 249
happy mother’s day to all the 240
for the first time in a 237
for the rest of my life 236
rock and roll hall of fame 233
for the rest of the day 228
there is no such thing as 228
happy mothers day to all the 225
at the end of the month 219
and is subject to change or 214
as is and is subject to 214
certain content that appears on this 214
change or removal at any time 214
content is provided as is and 214
is and is subject to change 214
is provided as is and is 214
is subject to change or removal 214
provided as is and is subject 214
subject to change or removal at 214
this content is provided as is 214
to change or removal at any 214
a means for sites to earn 213
a participant in the amazon services 213
advertising and linking to amazon.com amazon.ca 213
advertising fees by advertising and linking 213
amazon eu associates programmes designed to 213
amazon eu this content is provided 213
amazon services llc and amazon eu 213
amazon services llc and or amazon 213
amazon.ca amazon.co.uk amazon.de amazon.fr amazon.it and 213
amazon.co.uk amazon.de amazon.fr amazon.it and amazon.es 213
amazon.com amazon.ca amazon.co.uk amazon.de amazon.fr amazon.it 213
amazon.de amazon.fr amazon.it and amazon.es certain 213
amazon.es certain content that appears on 213
amazon.fr amazon.it and amazon.es certain content 213
amazon.it and amazon.es certain content that 213
and amazon eu associates programmes designed 213
and amazon.es certain content that appears 213
and linking to amazon.com amazon.ca amazon.co.uk 213
and or amazon eu this content 213
5-Grams
word1 word2 word3 word4 word5 n
at the end of the 3624
in the middle of the 1809
for the first time in 1608
the end of the day 1298
by the end of the 1238
for the rest of the 1139
thank you so much for 1038
is going to be a 1020
there are a lot of 967
it’s going to be a 861
thanks for the shout out 858
to be a part of 833
is one of the most 811
let me know if you 793
the other side of the 786
for the first time since 769
can’t wait to see you 726
the end of the year 718
at the top of the 702
this is going to be 698
i can’t wait to see 692
on the other side of 684
thanks so much for the 631
thank you for the follow 619
i love you so much 610
and the rest of the 599
for those of you who 590
has nothing to do with 577
in the middle of a 571
keep up the good work 566
but at the same time 543
i thought it would be 541
at the time of the 534
hope you have a great 524
the middle of the night 517
is one of the best 513
the rest of the world 508
at the beginning of the 506
at the bottom of the 505
to be one of the 504
this is one of the 500
the rest of the day 496
for a chance to win 491
to figure out how to 487
in the bottom of the 483
the end of the month 482
i have no idea what 481
if you would like to 477
thanks for the follow i 461
happy mother’s day to all 445

Visualizing the Model

Graph

If there is an arrow pointing from one word to the next, those words are likely to appear in sequence. The darker the arrow the higher the likelihood.

Model

For any input text, it chooses the darkest arrow it can find and returns the word that it is pointing to.

About the Author

Brynjólfur Gauti Jónsson

Special Thanks