TypeLess - the best prediction app for your typewriter

Task and data set

goal of the project was to build input prediction web application using data sets obtained from the HC Corpora
all three subsets (Twitter, Blogs, News) have been used to train and test the application

Preprocessing

mark sentence boundaries
replace “bad” words, urls, hashtags, nicknames and emails using tags
identify and replace with tags dates, currency amounts and parts of the addresses
replace groups of emoticons with tags
tokenize remaining text removing punctuation and remaining special characters.

Algorithm

pre-process input text using the same rules as for the main corpora
predict MLE method predict next using \( n, n - 1, n -2, ..., 1 \) trailing tokens (with default \( n = 5 \))
combine models using exponential penalty for decreasing length (for example 5-gram based prediction has 10 times higher score than a one based on 4-grams)

Summary

Future directions:

applying spelling correction to reduce noise and the size of the model
test context based predictions using initial clustering by topic and sentiment using emoticons as the noisy labels
test alternative storage modes

Interested?

Visit TypeLess on shinyapps.io