Text Prediction App

Onna Nelson
April 24 2015

Presented as part of the Capstone Project for the Data Science Specialization through Coursera and John Hopkins University

We write hundreds of words every day: emails, social media posts, business documents, text messages, etc.
Understanding patterns in language can help people write these documents faster and more efficiently
Predictive text helps users by suggesting the next word, allowing them to choose from a list of words which commonly appear in the phrase they have already typed
My app allows users to see between 1 and 5 potential next words after entering a word or phrase
The ability to choose how many words to predict allows flexibility: more words may provide greater accuracy, but fewer words provide greater speed

Building a corpus from from blogs, tweets, and news articles gives us a lot of data to find patterns in language
An N-gram is a group of N words which appear together. One of the most frequent 3-grams is “one of the”
N-grams frequencies follow statistical trends such as Zipf's law. We can use these trends to predict text
My app primarily uses 3-grams, 2-grams, and 1-grams
To decrease loading and processing times, N-grams which were less frequent than 0.1% of the most frequent N-gram were omitted from the data
These infrequent N-grams were mostly hapaxes: words which only occur once in a corpus, but may make up as much as 50% of the data

screenshot

Future work may incorporate more advanced predictive models, including 4-grams, 5-grams, and machine learning algorithms. These may be more accruate but come at a cost of slower prediciton times
Future work may incorporate user input: users who write about certain topics will naturally have certain words appearing more frequently than the average user
Future work may expand to other languages, such as German, Russian, or Finnish

Many thanks to:
- SwiftKey for providing the corpora used in this project
- Stefan Th. Gries at UCSB for teaching me R and introducing me to regular expressions
- Jeff Leek, Roger Peng, and Brian Caffo at John Hopkins University for teaching the Data Science Specialization