The prediction model was built using the English blogs, news and Twitter datasets provided for this project.
The following steps were used to build the model:
- Combined the three text datasets into one corpus.
- Converted all text to lowercase.
- Removed punctuation, numbers and extra spaces.
- Created unigram, bigram, trigram and quadgram tables.
- Counted how often each word combination appeared.
- Saved the final n-gram tables as RDS files to reduce loading time.
When a user enters a phrase, the application searches these n-gram tables to find the most likely next word.