The purpose of this project was to build a natural language processing model that would predict the next word for any given phrase input by a user. The general process by which I achieved this is the following:
- Generating unigrams, bigrams, and trigrams from a corpus of news articles, tweets, and blogs as provided.
- Cleaning the noise from the data such as capitalizations, punctuation, numbers, characters, and stop words that do not contribute to the substantive elements of an input.
- Implementing a Katz Back-Off algorithm to assign probabilities to all observed and unobserved n-grams for a given input.