Data Science Specialization Capstone Project: Natural Language Processing

Mike W
24 August, 2020

Natural Language Processing

  • The objective of the capstone project was to apply our skills acquired in the Data Science specialization to natural language processing, to create a predictive text model based on a corpus of texts from Twitter, blogs, and news sources

Our Model

  • In order to construct our model we randomly selected a subset of the data, cleaned the text (e.g., removed punctuation, profanity), tokenized the lines of text into n-grams (unigrams, bigrams, trigrams, 4-grams).
  • Our model predicted the next word in a string based on the frequency of the n-grams in the sample, relying on a Stupid Backoff scheme to estimate the probability of novel or unobserved n-grams

Shiny App