For the Johns Hopkins Data Science Capstone Project, the challenge was to build a next word prediction (NLP) model, using R. The aid with building the model, we were given a large data set of US English text (>550MB).
- 102 million-word training corpus
- Corpus comprised of twitter (X), news and blog - raw text data.