Menno Oerlemans
7 januari 2018
The assignment is to build a prediction app (in Shiny) based on a word or a sequence of words and to predict the next word. This should be based on a N-gram model. The input was given in the form of three textfiles (news, blogs and twitter texts).
Main statistics about the files:
## Filename Size of file Number of lines text Max. length of line
## Blogs 200.4242 899288 40835
## News 196.2775 1010242 11384
## Twitter 159.3641 2360148 213
The basis for the theory on n-gram prediction of words based on a string of words is explained in the following videos by Professor Dan Jurafsky (theory video).
The basis is the chain rule of probability: what is the probability if a certain chain of words occur.
It is not possible to define the probability based on calculation of the counts, because it is not possible to write down all the sentences in English. That’s why the Markov Assumption is used to simplify the definition of probability. It says that you don’t look into the collection of all the English sentences but to the previous words (one, two, three, etc.). The simplest version is a Unigram model (one word), then the two previous words (bigram) and the three previous words (trigram), etc.
Although it is possible to extend the model, the N-gram model stays insufficient (because in languages long distance dependencies are not taken into account), but can give rewarding results.
Based on the N-gram structure, you can calculate the Estimated Probability (number of times the combination occurs divided by the number of times the prefix words occur). Then the best probability can be selected to return the most likely next word in the sentence or the chain of words.
Every language model has to be evaluated. The evaluation is based on the quality of the prediction (the number of times it predicts well). If you evaluate N-gram language models these score bad unless the data looks just like the training data. The perplexity is used a lot. Perplexitiy is the probability of the test set, normalized by the number of words. So the lower the perplexity is, the higher the probability. In an example of a trainingset of 38 million words the perplexitiy for unigrams was 962, bigrams 170 and trigrams 109 (see video 3). So the trigrams perform much better than the unigrams (9 times better).
This project was done in a number of steps:
How the app works: