Coursera R Capstone: A Word Prediction Shiny App

Antonio Rubiera

1/24/2020

Introduction

We design a language model based on a dataset provided by Swiftkey, which is now part of Microsoft (https://www.microsoft.com/en-us/swiftkey?activetab=pivot_1%3aprimaryr2). The dataset provided has sub-data in four languages (Finnish, Russian, German, and English). We use the English data here. After a detailed analysis, we decided to build a model using N-Grams of one, two, three, four, and five words using all of the data provided, and supplemented the language model with 6-grams from the blogs subset of the data.

We design a shiny app hosting a version of the language model that represents 50 percent of all of words found in the data, and use it to predict the next word given an input text. The shiny can handle N-grams with N as large as 100, and it uses as many as the last five words entered for it’s prediction.

The Language Model: Examples

##     word1    word2 freq
## 1   happy birthday 9266
## 2    good  morning 8250
## 3   years      ago 7498
## 4 mothers      day 6468
## 5  follow     back 6227
## 6    high   school 5937
##    word1      word2 word3 freq
## 1  happy    mothers   day 3419
## 2  cinco         de  mayo 1098
## 3   love       love  love  784
## 4  happy valentines   day  689
## 5   hope      great   day  516
## 6 couple      weeks   ago  514

The Language Model: Average Perplexity

Here is a sample of 5-grams from our language model with perplexity = 2. This means that we have a 50 percent change of predicting word5 correctly.

##       word1  word2     word3   word4     word5 freq
## 10000   bit  bored afternoon writing      tort    2
## 10001   bit    box   covered   piece cardboard    2
## 10002   bit  broke     movie   crack      made    2
## 10003   bit butter    dipped   sauce       pot    2
## 10004   bit  catch       hit    town      true    2
## 10005   bit change      nice   heres     catch    2

The Shiny App