The app predicts the next word in a partial sentence based on previous words. The app uses n-grams and stupid backoff algorithm to predict the next word. The following steps were used to generate the n-grams
- Datasets from three sources twitter, blogs, and news were available from swiftkey
- The blog, news, and twitter datasets were sampled (10%).
- The datasets were converted into a corpus and cleaned.
- The sentences were tokenized into unigrams, bigrams, trigrams, quadragrams, and pentagrams and their frequencies estimated
- Very low frequency n-grams (frequency<2) were removed and the n-grams were written to datasets