TypeLess - the best prediction app for your typewriter

Task and data set

  • goal of the project was to build input prediction web application using data sets obtained from the HC Corpora
  • all three subsets (Twitter, Blogs, News) have been used to train and test the application

Preprocessing

  • mark sentence boundaries
  • replace “bad” words, urls, hashtags, nicknames and emails using tags
  • identify and replace with tags dates, currency amounts and parts of the addresses
  • replace groups of emoticons with tags
  • tokenize remaining text removing punctuation and remaining special characters.

Algorithm

  • pre-process input text using the same rules as for the main corpora
  • predict MLE method predict next using \( n, n - 1, n -2, ..., 1 \) trailing tokens (with default \( n = 5 \))
  • combine models using exponential penalty for decreasing length (for example 5-gram based prediction has 10 times higher score than a one based on 4-grams)

Summary

Future directions:

  • applying spelling correction to reduce noise and the size of the model
  • test context based predictions using initial clustering by topic and sentiment using emoticons as the noisy labels
  • test alternative storage modes

Interested?