Data Science Specialization Capstone Project: Natural Language Processing
Mike W
24 August, 2020
Natural Language Processing
- The objective of the capstone project was to apply our skills acquired in the Data Science specialization to natural language processing, to create a predictive text model based on a corpus of texts from Twitter, blogs, and news sources
Our Model
- In order to construct our model we randomly selected a subset of the data, cleaned the text (e.g., removed punctuation, profanity), tokenized the lines of text into n-grams (unigrams, bigrams, trigrams, 4-grams).
- Our model predicted the next word in a string based on the frequency of the n-grams in the sample, relying on a Stupid Backoff scheme to estimate the probability of novel or unobserved n-grams