Overview of SKT NUGU

htann
2017/9/2

Introduction

This slide is consisted in the motivation, methodology and manual to use prediction app by Shiny. It was developed as part of the data science specialisation. Purpose of that app is to predict the next word based on one or more previous words.

The task was to analyse and use preexisting corpora to build an app in Shiny. The three given corpora where taken from Blogs, News and Twitter. (Source link:https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip)

Methodology

After data cleansing of special characters such as $!* and etc. and the corpora were used to create repositories of n-grams. Through 3 difference of n-gram, having enough distinctive data:

1-gram (unigram) 2-gram (bigram) 3-gram (trigram)

Due to the enormous size of the result tables all n-grams which occurred 10 times were discarded. This ensured a sensible and agile compromise between accurracy, runtime and memory usage respectively.

Using library and Algorithm

Libraries

  • For the calculation of n-grams: qunateda
  • For data modelling and data storage: data.table and dplyr
  • For chart: ggplot2

Algorithm

  • For prediction the next word: Uing to Stupid Backoff (SB) algorithm : SB tries to find occurences in the calculated n-gram tables of the given word sequence : ex. Input in sequence as ???my name is??? -> Cutting ???name is??? from 4-gram table -> Searching repeats in the 3-gram table

Repository

  • n-gram repository: arond 170MB memory usage

App

The app is hosted here. After a while for loading it shows following GUI:

App

Future Planning

  • New Product Release: End of 2017 (Small Unit)
  • SoftPoC: Supporting to T map and Kids Phone
  • Open AI Openplatform for making Eco-system: Around middle of 2018