Johns Hopkins Data Science Project
SwiftKey Prediction



Amy Jiang
April 21, 2015

Challenge: Language Model

  • Ambguity
  • Non-standard english
  • segmentation issues
  • idioms
  • world knowledge
  • neologism
  • Entity names
  • Open class and close class

Challenge: User Expereince

  • Performance vs. Accuracy: The prediction model must balance between performance and accuracy.
  • Reactive Input: In order to simulate the user experience of SwiftKey on a smartphone or tablet, the prediction must be on real time.
  • Conditional Output: Based on the user input, the prediction must be able to refine its result conditioned on the new information given.
  • Responsive UI Design: The UI of the app must display nicely on cell phones and tablets.

Solution: My Prediction Model

  • Katz's back-off model \( P_{b0}(W_{i}| W_{i-n+1} \cdot\cdot\cdot W_{i-1}) \):

\[ \begin{cases} d_{W_{i-n+1} \cdot\cdot\cdot W_{i}} \frac {C(W_{i-n+1} \cdot\cdot\cdot W_{i-1} W_{i})}{C(W_{i-n+1} \cdot\cdot\cdot W_{i-1})} \hspace{2em} if C(W_{i-n+1} \cdot\cdot\cdot W_{i}) > k \\ \alpha_{W_{i-n+1} \cdot\cdot\cdot W_{i-1}} P_{b0} (W_{i-n+1} \cdot\cdot\cdot W_{i-1}) \hspace{2em} otherwise \end{cases} \]

  • Enhanced unigram model: In order to utilize unigram model to provide auto-completion help while user typing the word, the model used space character as the boundary character between unigram and bigram/trigram input. The ouput is conditioned on user's input up to first four characters.

Result: My Prediction Model

  • Tiny footprint: 11.6M total file size
  • Excellent coverage: 66.8% (on test data)
  • Fast response time: 0.006 second!

alt text

Result: The Data Product
alt text