Text Prediction using an N-Gram Language Model

plot of chunk unnamed-chunk-1

Coursera Data Science Specialization Capstone Project

Victor Ruiz

Given a sequence of words, a language model can be used to predict the probability of upcoming words given the previous words
In n-gram models, only the \( n-1 \) last words are considered relevant when predicting the next word (Markov assumption)
- \( \tiny P(w_{k}\mid w_{1}, w_{2}, w_{3}, ..., w_{k-2}, w_{k-1}) \approx P(w_{k}\mid w_{k-n+1}, ... ,w_{k-1}) \)
Maximum-likelihood probability estimates (MLE), maximize the likelihood on the training data
- \( \tiny P_{ML}(w_{k}\mid w_{k-n+1}, ... ,w_{k-1}) = \tiny \frac{c(w_{k-n+1}^{k})}{c(w_{k-n+1}^{k-1})} \), where \( \tiny w_{i}^{j} =w_{i}, w_{i+1},..,w_{j} \)
- \( c(w_{i}^{j}) \) is the number of occurrences of the n-gram \( w_{i}^{j} \)
Problem of MLE: probabilities of n-grams unseen in training data are undefined
Solution: Mix higher order with lower order models
- Interpolation vs backoff? Interpolation performs better

Prepare corpus: clean and normalize text, remove non-ascii characters, delimit sentences with special marks, mark some special classes of words (urls, numbers, time, dates, …)
Split corpus in training (90%), development test (5%) and testing (5%)
Prior to generate the n-grams model from training corpus, vocabulary is reduced to a subset of words covering 97.5% of total occurrences, that is, about 40,500 words. Any other words are marked as unknown
Split training corpus in 4-grams and get their counts. Use the counts to build models containing MLE probabilities for 1-grams, 2-grams, 3-grams and 4-grams. Additionally, build four more models, including all possible partial matches: \( \tiny(w_1,w_2,*,w_4), (w_1, *,w_3,w_4), (w_1, *,*,w_4), (*, w_2, *,w_4) \). Then, the models are pruned, only n-grams with count > 1 and only the 3 predicted words \( w_4 \) with highest score for each context are saved
The eight n-gram models obtained in the previous step are interpolated with coefficients \( \tiny a_{123} + a_{12} + a_{13} + a_{23} + a_{1} + a_{2} + a_{3} + a_{4}= 1 \)

Training corpus	twitter	news	blogs
lines	2,140,000	900,000	800,000
words (millions)	27.17	30.76	33.32
unique 1-grams	265,523	224,521	263,565
unique 4-grams (millions)	16.85	23.12	25.54

Development test corpus: 215,000 lines / 52 million words
- used to determine the optimum values for interpolation coefficients, in order to minimize entropy and maximize accuracy
Test corpus: 215,000 lines / 51.7 million words
- Cross-entropy: 3.086 (4-grams)
- Accuracy: 21.1% (4-grams)
- Avg Speed: 15ms (per 4-gram query)

Accept a sequence of words, introduced by the user in a text input
Clean and normalize input text. It MUST be processed EXACTLY in the same way as the training data
Then use n-gram model to generate predictions for the next word from the last three words
Display a button with the predicted word, i.e, word with highest probability. If this button is clicked, the word will be automaticcally appended to the input text
Besides this, all the predicted words are shown in a wordcloud
Disclaimer: Due to the limited resources available in shinyapps.io, the model had to be pruned a bit more, so the memory footprint of the app can fit the requirements. In the 4-gram models, n-grams with count lower than 2 were discarded and only 2 best choices were saved. In the 3-gram models, only the 2 best predictions were saved. With this, accuracy dropped to 19.6%