TypeLess - the best prediction app for your typewriter
Task and data set
- goal of the project was to build input prediction web application using data sets obtained from the HC Corpora
- all three subsets (Twitter, Blogs, News) have been used to train and test the application
Preprocessing
- mark sentence boundaries
- replace “bad” words, urls, hashtags, nicknames and emails using tags
- identify and replace with tags dates, currency amounts and parts of the addresses
- replace groups of emoticons with tags
- tokenize remaining text removing punctuation and remaining special characters.
Algorithm
- pre-process input text using the same rules as for the main corpora
- predict MLE method predict next using \( n, n - 1, n -2, ..., 1 \) trailing tokens (with default \( n = 5 \))
- combine models using exponential penalty for decreasing length (for example 5-gram based prediction has 10 times higher score than a one based on 4-grams)
Summary
Future directions:
- applying spelling correction to reduce noise and the size of the model
- test context based predictions using initial clustering by topic and sentiment using emoticons as the noisy labels
- test alternative storage modes
Interested?