Slide deck

CC
Sep 21, 2020

Coursera Capstone Project for Data Science Specialization

This presentation was produced for the final project of the Data Science Specialization course hosted by Coursera. I created an application for Natural Language Processing (NLP) with the intent to generate predictive text from a word or phrase input by a user.

Goal

  • Goal is to create a data product containing an algorithm to predictive next word given a word or phrase input by the user.
  • Divided into several tasks: examining the data, data cleaning, experimenting with algorithms, building the app.
  • Used n-gram models on a corpus of tweets, blog posts, and news sources, which builds up predictions from frequency analysis of the test.

Methodology details

  • Data supplied from the course website, downloaded, and read in.
  • Cleaned up the corpus to remove some artefacts and typos.
  • Used n-gram tokenization to find common phrases in the corpus.
  • In particular, we looked at phrases of 2, 3, and 4 words for patterns, transforming them into lookup tables.

Application Usage

The user may enter the text into the application text box on the left, and will recieve several predictions for the next word on the right!