Overview of the next word prediction App

HengTai Ann
09-Sep-2017

Introduction

This slide is consisted in the motivation, methodology and manual to use prediction app by Shiny. It was developed as part of the data science specialisation. Purpose of that app is to predict the next word based on one or more previous words.

The task was to analyse and use preexisting corpora to build an app in Shiny. The three given corpora where taken from Blogs, News and Twitter. (Source link:https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip)

Methodology

After data cleansing of special characters such as $!* and etc. and the corpora were used to create repositories of n-grams. Through 4 difference of n-gram, having enough distinctive data: 1-gram 2-gram 3-gram 4-gram

Due to the enormous size of the result tables all n-grams which occurred 10 times were discarded. This ensured a sensible and agile compromise between accurracy, runtime and memory usage respectively.

Using library and Algorithm

Libraries

  • For the calculation of n-grams: qunateda
  • For data modelling and data storage: SnowballC, MASS, data.table and pryr

Algorithm

  • For prediction the next word: Uing to “a simple Katz's Back-off algorithm” : Katz tries to find occurences in the calculated n-gram tables of the given word sequence

Repository

  • n-gram repository

App

The app is hosted here. After a while for loading it shows following GUI:

app

Future Planning

Algorithm : The a simple Katz's Back-off is a very simple algorithm to predict the next word. For further enhancement different other more sophisticated algorithms should be applied.

Memory : Throughout that task a lot of struggles were due to the high amout of memory used for the tables. Therefore it might be sensible to think about furhter pruning or hashing of word sequences to save space.

Thank you for your attention!