This presentation is for the Data Science Specialization Capstone that Johns Hopkins University offers in Coursera.
The course instructors have gave us a curated dataset that Swiftkey, a company that provides text prediction technology, gently provided. It contains data from Twitter, blogs and news sites.
The goal is to create an algorithm for predicting the next word given one or more words (a phrase/sentence) as input. A large corpus of more than 4 million documents was loaded, sampled, tokenized and analyzed. N-grams (1 to 4) were extracted from the corpus and then used for building the predictive model.