Text Prediction using an N-Gram Language Model



plot of chunk unnamed-chunk-1





Coursera Data Science Specialization Capstone Project

Victor Ruiz

N-gram Language Models

  • Given a sequence of words, a language model can be used to predict the probability of upcoming words given the previous words
  • In n-gram models, only the \( n-1 \) last words are considered relevant when predicting the next word (Markov assumption)
    • \( \tiny P(w_{k}\mid w_{1}, w_{2}, w_{3}, ..., w_{k-2}, w_{k-1}) \approx P(w_{k}\mid w_{k-n+1}, ... ,w_{k-1}) \)
  • Maximum-likelihood probability estimates (MLE), maximize the likelihood on the training data
    • \( \tiny P_{ML}(w_{k}\mid w_{k-n+1}, ... ,w_{k-1}) = \tiny \frac{c(w_{k-n+1}^{k})}{c(w_{k-n+1}^{k-1})} \), where \( \tiny w_{i}^{j} =w_{i}, w_{i+1},..,w_{j} \)
    • \( c(w_{i}^{j}) \) is the number of occurrences of the n-gram \( w_{i}^{j} \)
  • Problem of MLE: probabilities of n-grams unseen in training data are undefined
  • Solution: Mix higher order with lower order models
    • Interpolation vs backoff? Interpolation performs better

Building the Language Model

  • Prepare corpus: clean and normalize text, remove non-ascii characters, delimit sentences with special marks, mark some special classes of words (urls, numbers, time, dates, …)
  • Split corpus in training (90%), development test (5%) and testing (5%)
  • Prior to generate the n-grams model from training corpus, vocabulary is reduced to a subset of words covering 97.5% of total occurrences, that is, about 40,500 words. Any other words are marked as unknown
  • Split training corpus in 4-grams and get their counts. Use the counts to build models containing MLE probabilities for 1-grams, 2-grams, 3-grams and 4-grams. Additionally, build four more models, including all possible partial matches: \( \tiny(w_1,w_2,*,w_4), (w_1, *,w_3,w_4), (w_1, *,*,w_4), (*, w_2, *,w_4) \). Then, the models are pruned, only n-grams with count > 1 and only the 3 predicted words \( w_4 \) with highest score for each context are saved
  • The eight n-gram models obtained in the previous step are interpolated with coefficients \( \tiny a_{123} + a_{12} + a_{13} + a_{23} + a_{1} + a_{2} + a_{3} + a_{4}= 1 \)

Evaluation of the Language Model

Training corpus twitter news blogs
lines 2,140,000 900,000 800,000
words (millions) 27.17 30.76 33.32
unique 1-grams 265,523 224,521 263,565
unique 4-grams (millions) 16.85 23.12 25.54
  • Development test corpus: 215,000 lines / 52 million words
    • used to determine the optimum values for interpolation coefficients, in order to minimize entropy and maximize accuracy
  • Test corpus: 215,000 lines / 51.7 million words
    • Cross-entropy: 3.086 (4-grams)
    • Accuracy: 21.1% (4-grams)
    • Avg Speed: 15ms (per 4-gram query)

Text Prediction App

  • Accept a sequence of words, introduced by the user in a text input
  • Clean and normalize input text. It MUST be processed EXACTLY in the same way as the training data
  • Then use n-gram model to generate predictions for the next word from the last three words
  • Display a button with the predicted word, i.e, word with highest probability. If this button is clicked, the word will be automaticcally appended to the input text
  • Besides this, all the predicted words are shown in a wordcloud
  • Disclaimer: Due to the limited resources available in shinyapps.io, the model had to be pruned a bit more, so the memory footprint of the app can fit the requirements. In the 4-gram models, n-grams with count lower than 2 were discarded and only 2 best choices were saved. In the 3-gram models, only the 2 best predictions were saved. With this, accuracy dropped to 19.6%