Word prediction application

Jakub Vedral
31.8.2017

Data Science Capstone project

Application Introduction

This presentation will briefly introduce my Data Science Capstone application for predicting the next word.

Application is written in R language with the help of additional R packages. Namely:

shinythemes, shiny, markdown
data.table, dplyr, dtplyr
quanteda

Prediction uses HC Corpora dataset which is quite large (at least for my coputer), so the model was build on sample of 30% of all data.

Methods, Models

After creating a data sample from the HC Corpora data, following transformations (cleaning) were done on the dataset:

conversion to lowercase,
removing punctuation, links, white space, numbers and special characters.
removing of profanity words (coarse language) - to prevent offending somebody

This data sample was then converted into various n-grams (bi, tri, quad and penta gram)

Prediction is based on lookup mechanism which goes from longest n-grams to shortest and tries to find most frequent combination, which is then used as the next word. Search is based on user input

Problems

Not everything went well during creation of prediction model. I had severe performance problems and had to try many different R packages to get to the finish line.

For bad performance I discarder (in order):

tm package with RWeka tokenizer (took 2 hours to make bi- to tri-grams)
ngram package (difficult to oparate)

I also had to use data.table package as regular data.frame was to slow for operations with 150MB dataset.

In the end I got following times to completely build my data model for these n-grams:

[1] "ngram size: 1"
Time difference of 1.072537 mins
[1] "ngram size: 2"
Time difference of 3.327966 mins
[1] "ngram size: 3"
Time difference of 12.02057 mins
[1] "ngram size: 4"
Time difference of 20.06881 mins
[1] "ngram size: 5"
Time difference of 34.46598 mins

How to use

Application is fairly simple to use. Just type your text in the single text field and get your suggestions below.

Screenshot

App gives more than one prediction, printing words with decreasing probability from left to right.