Alexander Alexandrov
Thursday, July 07, 2016
Smart Keyboard makes it easier for people to type on their mobile devices. One cornerstone of smart keyboard is predictive text models.
When someone types: I went to the
The keyboard presents several options: gym, store, restaurant
Manual:
Features:
Training data is represented by “raw” text from blogs, news, and twitter:
| Size | Message Count | Word Count | |
|---|---|---|---|
| Blogs | 200 Mb | 899.288 | 38.031.339 |
| News | 196 Mb | 77.259 | 2.643.972 |
| 159 Mb | 2.360.148 | 30.374.033 |
Data was cleaned from punctation, numbers, stop wrods, profanity words, swear words, URLs, emails, accounts. Because such stuff does not make sense in context of word prediction.
Algorithm is based on n-gram frequency dictionary. According to the exploratory analysis dictionary has been cleaned from rare n-grams to reduce memory usage and increase performance.
Kneser-Ney was used for n-gram frequency smoothing.
More information can be found in reports: