13/4/2020

Introduction

Background

This presentation serves as a pitch for a Shiny app that was built as a final assignment for the course Capstone Project from Data Science Specialization from Johns Hopkins University.

What is the app for?

The app simulates the word prediction feature of smart keyboards such as SwiftKey by taking a word sequence as input and offering predictions of the next word as output.

URL

https://dalibor.shinyapps.io/predictword/

Text Corpus

The course has provided datasets of text corpora from different languages. The US English language corpus was split into three files from

  • blogs
  • news
  • twitter

Since the files were in the order of 150-200MB, the corpus was subsetted to a size which was optimized for predicted word coverage. The corpus was also cleaned by

  • transforming wods to lowercase
  • removing punctuation and numbers
  • stripping whitespace

Prediction algorithm

The prediction is based on an 3-gram tokenizer which is defined in a custom function. The corpus was tokenized and filtered into a more efficient frequency sorted trigram dictionary with a final size in the order of 10s of MBs. This dictionary is accessed by the custom prediction function which takes as input the last two words of provided text and outputs the three most frequently followed words. In addition, is no trigram is found for the given text, the model backs off in search for a bigram to increase coverage.

Usage

Thank you for taking the time to review my submission!

This presentation was generated using R version 3.5.1 (2018-07-02).