15/11/2021

Overview

  • This presentation is for the Johns Hopkins University Data Science Specialization Capstone
  • The dataset that is available for this project is provided by Swiftkey
  • The Coursera Data Science Specialization project is to create an application that predicts the next word in a phrase/sentence.

Data used

The corpora, provided by Swiftkey, was publicly available and collected by a web crawler. Four data sets were available; our application uses the English dataset only. The data was taken from random news articles, blog posts, and twitter feeds. For use in this project, it was necessary to clean the data, removing extraneous punctuation, excessive whitespace, profanity, and other non-text elements. The portion of the data was then tokenized into ngram tables.

App features

  • Side panel with user instructions
  • Text box for user input
  • Predicted next word output dynamically below user input

Benefits of using the app

  • Lightweight
  • Fast response
  • Method allows for large training sets leading to better next word predictions