Ricardo Mavigno
12/13/2014
There is a saying that a monkey hiting keys at random on a typewriter for an infinite amount of time will almost surely type a given text, such as the complete works of William Shakespeare ( infinite monkey theorem ).
Regarding actual monkeys, I think this is very improbable, but to a well crafted NLP algorithm maybe isn't far from the truth :-)
More than the predicting, the spontaneus formation of sentences based in ngrams
probability can give insights in a society culture. The ones that saw the phrase
“I have a lot of work to do” appear by just starting with the word “I” will
understand what I'm saying.
Mokey Writer is a application that works very well suggesting words. And has a very nice feature that I call babble. Input a word, press the babble button and it will make a complete phrase for you. Follow some insigths that I got developping this application:
ngram %Positive
1 bigram 12.98307
2 trigram 17.52535
3 4-gram 18.50831
4 5-gram 19.13838
5 total 16.97376
ngram %Positive
1 <NA> 0.00000
2 bigram 14.71529
3 trigram 18.36345
4 4-gram 20.12977
5 5-gram 22.80735
6 total 17.82569
RECEIVE INPUT FROM USER
NORMALIZE INPUT TO MATCH THE NORMALIZATION OF NGRAMS
BREAK THE INPUT IN SENTENCES
TAKE THE LAST SENTENCE
IF SENTENCE > 4 WORDS REDUCE INPUT TO 4 WORDS
IF SENTENCE = 4 WORDS
…..FOUND MATCHES IN 5-GRAM?
……….ACCUMULATE ALL MATCHES IN RESULT VECTOR
IF SENTENCE = 3 WORDS
…..FOUND MATCHES IN 4-GRAM?
……….ACCUMULATE ALL MATCHES IN RESULT VECTOR
IF SENTENCE = 2 WORDS
…..FOUND MATCHES IN 3-GRAM?
……….ACCUMULATE ALL MATCHES IN RESULT VECTOR
IF SENTENCE = 1 WORDS
…..FOUND MATCHES IN 2-GRAM?
……….ACCUMULATE ALL MATCHES IN RESULT VECTOR
ELIMINATE DUPLICATES IN THE RESULT VECTOR
RETURN THE FIRST 3 MATCHES IN RESULT VECTOR
SHOW TO USER THE FIRST 3 MATCHES BY ORDER
The application try to predict the next word of the user's' phrase. It gives 3 alternatives based in the last 4 words entered in the textbox. The predicted alternatives will appear above the textbox, and the user can choose which one she wants by pressing the control keys indicated above then.
The buttons below the textbox are helpers:
“Clear” will erase the textbox and put the application in the
initial state.
“Babble” will generate phrases based on the words present in the
textbox, with some parameters:
I think that the algorithm could be greatly improved by:
The application is here. You can also run the application directly from github using the following command on your “RSTUDIO” installation:
runGitHub("wordprev","mavigno")
Enjoy! And if you want to send me any comments/complains/suggestions just send a e-mail at ricardo@mavigno.com.