We write hundreds of words every day: emails, social media posts, business documents, text messages, etc.
Understanding patterns in language can help people write these documents faster and more efficiently
Predictive text helps users by suggesting the next word, allowing them to choose from a list of words which commonly appear in the phrase they have already typed
My app allows users to see between 1 and 5 potential next words after entering a word or phrase
The ability to choose how many words to predict allows flexibility: more words may provide greater accuracy, but fewer words provide greater speed
How to predict text?
Building a corpus from from blogs, tweets, and news articles gives us a lot of data to find patterns in language
An N-gram is a group of N words which appear together. One of the most frequent 3-grams is “one of the”
N-grams frequencies follow statistical trends such as Zipf's law. We can use these trends to predict text
My app primarily uses 3-grams, 2-grams, and 1-grams
To decrease loading and processing times, N-grams which were less frequent than 0.1% of the most frequent N-gram were omitted from the data
These infrequent N-grams were mostly hapaxes: words which only occur once in a corpus, but may make up as much as 50% of the data
Future work may incorporate more advanced predictive models, including 4-grams, 5-grams, and machine learning algorithms. These may be more accruate but come at a cost of slower prediciton times
Future work may incorporate user input: users who write about certain topics will naturally have certain words appearing more frequently than the average user
Future work may expand to other languages, such as German, Russian, or Finnish
Acknowledgements
Many thanks to:
SwiftKey for providing the corpora used in this project
Stefan Th. Gries at UCSB for teaching me R and introducing me to regular expressions
Jeff Leek, Roger Peng, and Brian Caffo at John Hopkins University for teaching the Data Science Specialization