Data Science Capstone presentation

Project: Predicting next word

Menno Oerlemans

7 januari 2018

Summary

The assignment is to build a prediction app (in Shiny) based on a word or a sequence of words and to predict the next word. This should be based on a N-gram model. The input was given in the form of three textfiles (news, blogs and twitter texts).

Main statistics about the files:

##  Filename Size of file Number of lines text Max. length of line
##     Blogs     200.4242               899288               40835
##      News     196.2775              1010242               11384
##   Twitter     159.3641              2360148                 213

Theory

The basis for the theory on n-gram prediction of words based on a string of words is explained in the following videos by Professor Dan Jurafsky (theory video).

The basis is the chain rule of probability: what is the probability if a certain chain of words occur.

It is not possible to define the probability based on calculation of the counts, because it is not possible to write down all the sentences in English. That’s why the Markov Assumption is used to simplify the definition of probability. It says that you don’t look into the collection of all the English sentences but to the previous words (one, two, three, etc.). The simplest version is a Unigram model (one word), then the two previous words (bigram) and the three previous words (trigram), etc.

Although it is possible to extend the model, the N-gram model stays insufficient (because in languages long distance dependencies are not taken into account), but can give rewarding results.

Based on the N-gram structure, you can calculate the Estimated Probability (number of times the combination occurs divided by the number of times the prefix words occur). Then the best probability can be selected to return the most likely next word in the sentence or the chain of words.

Every language model has to be evaluated. The evaluation is based on the quality of the prediction (the number of times it predicts well). If you evaluate N-gram language models these score bad unless the data looks just like the training data. The perplexity is used a lot. Perplexitiy is the probability of the test set, normalized by the number of words. So the lower the perplexity is, the higher the probability. In an example of a trainingset of 38 million words the perplexitiy for unigrams was 962, bigrams 170 and trigrams 109 (see video 3). So the trigrams perform much better than the unigrams (9 times better).

Approach building the app

This project was done in a number of steps:

  1. Loading and cleaning the data: after the data was loaded the data was cleaned for punctions, extra spaces, numbers, etc. and was transformed to lowercases only.
  2. Exploring the data: the data was studied and looked into the filesize, etc. (see sheet 2). After that the content of the files was explored. It was looked into frequencies of words and n-grams, stopwords, etc. Link: Cleaning and exploring
  3. Preparing the data for the app: the total dataset was to big for my computer so I had to use a smaller part of the origional data. After preparing the N-grams, I found out that there were a lot of none English words and spelling mistakes. Rows in the N-grams with these words were removed. Link: Preparing the data
  4. Building the prediction model: At first the input the users enters, is cleaned. I simplified the model by removing the N-grams with the same N-gram variables which predict the result with lower frequencies and to select the result based on the highest possible N-gram and then if not found on the (N-1) - gram, etc. If no result is found the message “Sorry, I can’t find the answer!”. Link: Building prediction model
  5. Building the Shiny app: the app is build and tested. Link: Shiny app

Shiny app Explination

How the app works: