Jacob Govshteyn
Aug 2015
Swiftkey capstone project.
Internals:
Ngrams in the range of 2-5 where constructed from ~150,000 lines of news, blogs , and twitter social media training data entries:
To analyze n-gram frequencies, the following preprocessing steps were performed:
<UNK> placeholderEnter Partial Phrase in Text Box
Submit Server Request
Complete The Phrase
We want a heuristic that more accurately estimates the number of times we might expect to see word w in a new unseen context. The Kneser-Ney intuition is to base our estimate on the number or different contexts word w has appeared in( Huang, X. & Deng, L. (2010). An Overview of Modern Speech Recognition.).
\( P_{\mathit{KN}}(w_i \mid w_{i-1}) = \dfrac{\max(c(w_{i-1} w_i) - \delta, 0)}{\sum_{w'} c(w_{i-1} w')} + \lambda \dfrac{\left| \{ w_{i-1} : c(w_{i-1}, w_i) > 0 \} \right|}{\left| \{ w_{j-1} : c(w_{j-1},w_j) > 0\} \right|} \)
where \[ \lambda(w_{i-1}) = \dfrac{\delta}{c(w_{i-1})} \left| \{w' : c(w_{i-1}, w') > 0\} \right| \]
Links and references
Word Predictor Shiny app
Data Science Specialization by Johns Hopkins University
Natural Language Processing by Stanford University on coursera