Text Prediction App

Onna Nelson
April 24 2015

Presented as part of the Capstone Project for the Data Science Specialization through Coursera and John Hopkins University

Why predict text?

  • We write hundreds of words every day: emails, social media posts, business documents, text messages, etc.
  • Understanding patterns in language can help people write these documents faster and more efficiently
  • Predictive text helps users by suggesting the next word, allowing them to choose from a list of words which commonly appear in the phrase they have already typed
  • My app allows users to see between 1 and 5 potential next words after entering a word or phrase
  • The ability to choose how many words to predict allows flexibility: more words may provide greater accuracy, but fewer words provide greater speed

How to predict text?

  • Building a corpus from from blogs, tweets, and news articles gives us a lot of data to find patterns in language
  • An N-gram is a group of N words which appear together. One of the most frequent 3-grams is “one of the”
  • N-grams frequencies follow statistical trends such as Zipf's law. We can use these trends to predict text
  • My app primarily uses 3-grams, 2-grams, and 1-grams
  • To decrease loading and processing times, N-grams which were less frequent than 0.1% of the most frequent N-gram were omitted from the data
  • These infrequent N-grams were mostly hapaxes: words which only occur once in a corpus, but may make up as much as 50% of the data

Using the Shiny App

Future research

  • Future work may incorporate more advanced predictive models, including 4-grams, 5-grams, and machine learning algorithms. These may be more accruate but come at a cost of slower prediciton times
  • Future work may incorporate user input: users who write about certain topics will naturally have certain words appearing more frequently than the average user
  • Future work may expand to other languages, such as German, Russian, or Finnish

Acknowledgements

  • Many thanks to:
    • SwiftKey for providing the corpora used in this project
    • Stefan Th. Gries at UCSB for teaching me R and introducing me to regular expressions
    • Jeff Leek, Roger Peng, and Brian Caffo at John Hopkins University for teaching the Data Science Specialization