Monkey Writer

Ricardo Mavigno
12/13/2014

NLP Word Predictor

Monkey Writer

Foreword

There is a saying that a monkey hiting keys at random on a typewriter for an infinite amount of time will almost surely type a given text, such as the complete works of William Shakespeare ( infinite monkey theorem ).

Regarding actual monkeys, I think this is very improbable, but to a well crafted NLP algorithm maybe isn't far from the truth :-)

Monkey Writer More than the predicting, the spontaneus formation of sentences based in ngrams probability can give insights in a society culture. The ones that saw the phrase “I have a lot of work to do” appear by just starting with the word “I” will understand what I'm saying.

Summary

Mokey Writer is a application that works very well suggesting words. And has a very nice feature that I call babble. Input a word, press the babble button and it will make a complete phrase for you. Follow some insigths that I got developping this application:

  • It's very difficult to achieve high accuracy with a generalized corpora as the one we used. In our tests the accuracy stayed between 13 and 19%. We need to investigate if using corpora from specialized domains (science, politics, etc.) could increase the accuracy.
  • We've found that the number of words in the vocabulary is counter intuitive. We expected that as we increased the size of the vocabulary, it would decrease the possibility of a no match in the ngrams. But what happens is the opposite. With a reduced vocabulary and the use of “unknow” tag to out-of-vocabulary words, the possibility of a no match goes toward zero. We can see that in the next slide. The tests with a bigger vocabulary (above 96% quantile), gave us a sligth increase in the positive results, but presented some NA results (no matchs). With a smaller vocabulary we got no NAs.

Summary

  • Results of a test made with a vocabulary of words with frequency equal or above the 99% quantile.
    ngram %Positive
1  bigram  12.98307
2 trigram  17.52535
3  4-gram  18.50831
4  5-gram  19.13838
5   total  16.97376
  • And with a vocabulary of words with frequency equal or above the 96% quantile.
    ngram %Positive
1    <NA>   0.00000
2  bigram  14.71529
3 trigram  18.36345
4  4-gram  20.12977
5  5-gram  22.80735
6   total  17.82569

Decisions

  • Use a small vocabulary, corresponding to the words that have frequencies equal and above the 99% quantile.
  • Use tags to identify the start-of-sentence (SOS) and end-of-sentence (EOS)
  • Substitute all words with frequency below the 99% quantile with the tag UNK.
  • We eliminate from the n-grams the undesired predictions like a UNK or a EOS.
  • We also eliminate the combination of partial phrases like: “someword EOS SOS someword”, “EOS SOS someword someword”.
  • Reduced the predictions of the ngrams to save size (40 predictions to bigrams and 3 predictions to each one of the others).
  • Use a backoff strategy with 5-grams,4-grams,3-grams and 2-grams, in that order.

The Application Algorithm

RECEIVE INPUT FROM USER
NORMALIZE INPUT TO MATCH THE NORMALIZATION OF NGRAMS
BREAK THE INPUT IN SENTENCES
TAKE THE LAST SENTENCE
IF SENTENCE > 4 WORDS REDUCE INPUT TO 4 WORDS
IF SENTENCE = 4 WORDS
…..FOUND MATCHES IN 5-GRAM?
……….ACCUMULATE ALL MATCHES IN RESULT VECTOR
IF SENTENCE = 3 WORDS
…..FOUND MATCHES IN 4-GRAM?
……….ACCUMULATE ALL MATCHES IN RESULT VECTOR
IF SENTENCE = 2 WORDS
…..FOUND MATCHES IN 3-GRAM?
……….ACCUMULATE ALL MATCHES IN RESULT VECTOR
IF SENTENCE = 1 WORDS
…..FOUND MATCHES IN 2-GRAM?
……….ACCUMULATE ALL MATCHES IN RESULT VECTOR ELIMINATE DUPLICATES IN THE RESULT VECTOR
RETURN THE FIRST 3 MATCHES IN RESULT VECTOR
SHOW TO USER THE FIRST 3 MATCHES BY ORDER

How to use the application

Monkey Writer

How to use the application

The application try to predict the next word of the user's' phrase. It gives 3 alternatives based in the last 4 words entered in the textbox. The predicted alternatives will appear above the textbox, and the user can choose which one she wants by pressing the control keys indicated above then.

  • Ctrl-1 is for the word with the greatest probability,
  • Ctrl-2 for the word with the probability below the first, and so on.

The buttons below the textbox are helpers:
“Clear” will erase the textbox and put the application in the initial state.
“Babble” will generate phrases based on the words present in the textbox, with some parameters:

  • # of words specify how many words are generated with the input.
  • Random specify how the words are chosen. If TRUE it will get 1 word randomly between the first 3 with greatest probability. If FALSE the choice will be always the word with the greatest probability.

Final Words

I think that the algorithm could be greatly improved by:

  • Embedding a learning strategy based on the words already entered by the user and reducing the vocabulary to a minimum. This will made the application fall in the special case of a restricted domain (the user domain).
  • Implement a model to search words as the user is typing.

The application is here. You can also run the application directly from github using the following command on your “RSTUDIO” installation:

runGitHub("wordprev","mavigno")

Enjoy! And if you want to send me any comments/complains/suggestions just send a e-mail at ricardo@mavigno.com.