This app takes the the English dataset from SwiftKey, sourced from Twitter and various news and blogs, which contain 70,000 to over 2 million lines per, calculates a modified probability of words in the context of previous words, and use that to predict the next word following a user-input sentence.
The dataset is summarized in my Milestone Report, and there are easily over millions of words from each source. Although the more data the better, analyzing the entire dataset, while feasible, would be undesirable, as it would be time- and memory-consuming, and making an online application unwieldy. Therefore 1% of total number of lines is randomly sampled and this subset serves as the basis of the application.