\( c(w_{i}^{j}) \) is the number of occurrences of the n-gram \( w_{i}^{j} \)
Problem of MLE: probabilities of n-grams unseen in training data are undefined
Solution: Mix higher order with lower order models
Interpolation vs backoff? Interpolation performs better
Building the Language Model
Prepare corpus: clean and normalize text, remove non-ascii characters, delimit sentences with special marks, mark some special classes of words (urls, numbers, time, dates, …)
Split corpus in training (90%), development test (5%) and testing (5%)
Prior to generate the n-grams model from training corpus, vocabulary is reduced to a subset of words covering 97.5% of total occurrences, that is, about 40,500 words. Any other words are marked as unknown
Split training corpus in 4-grams and get their counts. Use the counts to build models containing MLE probabilities for 1-grams, 2-grams, 3-grams and 4-grams. Additionally, build four more models, including all possible partial matches: \( \tiny(w_1,w_2,*,w_4), (w_1, *,w_3,w_4), (w_1, *,*,w_4), (*, w_2, *,w_4) \). Then, the models are pruned, only n-grams with count > 1 and only the 3 predicted words \( w_4 \) with highest score for each context are saved
The eight n-gram models obtained in the previous step are interpolated with coefficients \( \tiny a_{123} + a_{12} + a_{13} + a_{23} + a_{1} + a_{2} + a_{3} + a_{4}= 1 \)
Evaluation of the Language Model
Training corpus
twitter
news
blogs
lines
2,140,000
900,000
800,000
words (millions)
27.17
30.76
33.32
unique 1-grams
265,523
224,521
263,565
unique 4-grams (millions)
16.85
23.12
25.54
Development test corpus: 215,000 lines / 52 million words
used to determine the optimum values for interpolation coefficients, in order to minimize entropy and maximize accuracy
Test corpus: 215,000 lines / 51.7 million words
Cross-entropy: 3.086 (4-grams)
Accuracy: 21.1% (4-grams)
Avg Speed: 15ms (per 4-gram query)
Text Prediction App
Accept a sequence of words, introduced by the user in a text input
Clean and normalize input text. It MUST be processed EXACTLY in the same way as the training data
Then use n-gram model to generate predictions for the next word from the last three words
Display a button with the predicted word, i.e, word with highest probability. If this button is clicked, the word will be automaticcally appended to the input text
Besides this, all the predicted words are shown in a wordcloud
Disclaimer: Due to the limited resources available in shinyapps.io, the model had to be pruned a bit more, so the memory footprint of the app can fit the requirements. In the 4-gram models, n-grams with count lower than 2 were discarded and only 2 best choices were saved. In the 3-gram models, only the 2 best predictions were saved. With this, accuracy dropped to 19.6%