Please feel free to contact us with any feedback you have at ivan@paspeur.com.
\( n \)-gram Language Models
We implemented three \( n \)-gram language models that use different intuitions to deal with a key observation on data sparsity, namely that a large number of \( n \)-grams occur with such low frequency that they haven't been observed them in the training dataset.
Next-Word Prediction
With these language models in place, we then predict the next words as follow:
(Note: We used \( 1 \)-, \( 2 \)-, and \( 3 \)-grams)
| \( Variables \) | \( 3-grams \) | \( 2-grams \) | \( 1-grams \) |
|---|---|---|---|
| \( q_{ML}(\cdot) \)(1,2) | \( q_{ML}(w_{3} | w_{2}, w_{1}) = \frac{c(w_{1}, w_{2}, w_{3})}{c(w_{1}, w_{2})} \) | \( q_{ML}(w_{2} | w_{1}) = \frac{c(w_{1}, w_{2})}{c(w_{1})} \) | \( q_{ML}(w_{1}) = \frac{c(w_{1})}{c(\cdot)} \) |
| \( q_{LI}(\cdot) \)(1,2) | \( q_{LI}(w_{3} | w_{2}, w_{1}) = \lambda^{3}_{3} \cdot q_{ML}(w_{3} | w_{2}, w_{1}) \) \( + \lambda^{3}_{2} \cdot q_{ML}(w_{3} | w_{2}) \) \( + \lambda^{3}_{1} \cdot q_{ML}(w_{3}) \) | \( q_{LI}(w_{2} | w_{1}) = \lambda^{2}_{2} \cdot q_{ML}(w_{2} | w_{1}) + \lambda^{2}_{1} \cdot q_{ML}(w_{2}) \) | \( q_{LI}(w_{1}) = q_{ML}(w_{1}) \) |
| \( d^{*}(\cdot) \)(3) | \( d^{*}(w_{1}, w_{2}, w_{3}) = \left (1 + \frac{1}{c(w_{1}, w_{2}, w_{3})} \right ) ^ {1+b_{3}} \) | \( d^{*}(w_{1}, w_{2}) = \left (1 + \frac{1}{c(w_{1}, w_{2})} \right ) ^ {1+b_{2}} \) | \( d^{*}(w_{1}) = \left (1 + \frac{1}{c(w_{1})} \right ) ^ {1+b_{1}} \) |
| \( q_{BO}(\cdot) \)(1,2,4) | \( q_{BO}(w_{3} | w_{2}, w_{1}) = \begin{cases} d^{*}(w_{1}, w_{2}, w_{3}) \cdot q_{ML}(w_{3} | w_{1}, w_{2}) \\ \text{ if } c(w_{1}, w_{2}, w_{3}) > 0 \\ \alpha(w_{1}, w_{2}) \cdot q_{BO}(w_{3} | w_{2}) \\ \text{ otherwise} \end{cases} \) \( \alpha(w_{1}, w_{2}) = \frac{\beta(w_{1}, w_{2})}{\sum_{w_{i}: c(w_{1}, w_{2}, w_{i}) \leq k} q_{BO}(w_{2}, w_{i})} \) \( \beta(w_{1}, w_{2}) = 1 - \sum_{w_{i}: c(w_{1}, w_{2}, w_{i}) > 0} d^{*}(w_{1}, w_{2}, w_{i}) \cdot q_{ML}(w_{1}, w_{2}, w_{i}) \) | \( q_{BO}(w_{2} | w_{1}) = \begin{cases} d^{*}(w_{1}, w_{2}) \cdot q_{ML}(w_{2} | w_{1}) \\ \text{ if } c(w_{1}, w_{2}) > 0 \\ \alpha(w_{1}) \cdot q_{BO}(w_{2}) \\ \text{ otherwise} \end{cases} \) \( \alpha(w_{1}) = \frac{\beta(w_{1})}{\sum_{w_{i}: c(w_{1}, w_{i}) \leq k} q_{BO}(w_{i})} \) \( \beta(w_{1}) = 1 - \sum_{w_{i}: c(w_{1}, w_{i}) > 0} d^{*}(w_{1}, w_{i}) \cdot q_{ML}(w_{1}, w_{i}) \) | \( q_{BO}(w_{1}) = d^{*}(w_{1}) \cdot q_{ML}(w_{1}) \) |
| \( q_{KN}(\cdot) \)(2,5) | \( q_{KN}(w_{3} | w_{1}, w_{2}) = \frac{max(c(w_{1}, w_{2}, w_{3}) - \delta, 0)}{\sum_{w_{i}} c(w_{1}, w_{2}, w_{i})} \) \( + \lambda(w_{1}, w_{2}) \cdot q_{KN}(w_{3} | w_{2}) \) \( \lambda(w_{1}, w_{2}) = \frac{\delta}{\sum_{w_{i}} c(w_{1}, w_{2}, w_{i})} \cdot \left | w_{i} \right |_{c(w_{1}, w_{2}, w_{i}) > 0} \) | \( q_{KN}(w_{2} | w_{1}) = \frac{max(c(w_{1}, w_{2}) - \delta, 0)}{\sum_{w_{i}} c(w_{1}, w_{i})} \) \( + \lambda(w_{1}) \cdot q_{KN}(w_{2}) \) \( \lambda(w_{1}) = \frac{\delta}{\sum_{w_{i}} c(w_{1}, w_{i})} \cdot \left | w_{i} \right |_{c(w_{1}, w_{i}) > 0} \) | \( q_{KN}(w_{1}) = \frac{\left | w_{i} \right |_{c(w_{i}, w_{1}) > 0}}{\left | (w_{i}, w_{j}) \right |_{c(w_{i}, w_{j}) > 0}} \) |
Please note that we used the Good–Turing Smoothing method to compute the discountings used by Katz's Back-Off and Kneser-Ney Smoothing (Good–Turing isn't a model by itself).
(1) Columbia — Natural Language Processing Course, Michael Collins, Coursera
(2) Stanford — Natural Language Processing Course, Dan Jurafsky and Christopher Manning, Coursera
(3) Good–Turing Smoothing Without Tears, William A. Gale, AT&T Bell Laboratories
In the offline language model phase, we pre-computed all \( \alpha \)'s, \( \beta \)'s, \( \delta \)'s, \( d^{*} \)'s, and \( q^{*} \)'s for our (observed) \( n \)-grams (\( 1 \leq n \leq 3 \)). These building blocks were then stored in a SQLite database.
These pre-computed blocks stored a SQL database leave minimal work to the back-end (queries are simple with minimal joins) and further minimize the memory footprint (The model stays on disk and little data is loaded into main memory).
For more information on the data clean-up and the \( n \)-grams generation, please refer to our early report.