26/07/2020

Objective

  • This application is the capstone project for the Coursera Data Science specialization in cooperation with SwiftKey.
  • The main goal of this capstone project is to build a shiny application that is able to predict the next word as accurately as possible.
  • The text data is used to form a frequency dictionary and then to predict the next words. This data comes from a corpus called HC Corpora.

Methodology

  • After creating a data sample from the HC Corpora data, this sample was cleaned by conversion to lowercase, removing punctuation, links, white space, numbers and all kinds of special characters.

  • This data sample was then tokenized into n-grams.

  • Those aggregated bi-,tri- and quadgram term frequency matrices have been transferred into frequency dictionaries.

  • Lastly, the resulting data frames are used to predict the next word in connection with the text input by a user of the described application and the frequencies of the underlying n-grams table.

Usage

The usage is pretty straightforward.

  1. Enter text
  2. The field with the predicted next word refreshes instantaneously
  3. Also the whole text input gets displayed.

Resources and Links