Introduction

We design a language model based on a dataset provided by Swiftkey, which is now part of Microsoft (https://www.microsoft.com/en-us/swiftkey?activetab=pivot_1%3aprimaryr2). The dataset provided has sub-data in four languages (Finnish, Russian, German, and English). We use the English data here. After a detailed analysis, we decided to build a model using N-Grams of one, two, three, four, and five words using all of the data provided, and supplemented the language model with 6-grams from the blogs subset of the data.

We design a shiny app hosting a version of the language model that represents 50 percent of all of words found in the data, and use it to predict the next word given an input text. The shiny can handle N-grams with N as large as 100, and it uses as many as the last five words entered for it’s prediction.

We started with 13.8 million individual 2-grams, found a total of 26.7 million times in our corpus. After an extensive clean up for curse words and other text strings that did not correspond to English words, and the selection of the top 20,000 words found individually in the corpus, we arrived at a working corpus of 1.1 million 2-grams. We take these 2-grams, and select only the most representative example by found frequency. Because 2-grams tend to have a very high perplexity, the 1.1 million 2-grams become 19,097 2-grams with the most common second word following the first word.
Using a similar procedure, we also selected 258,058 3-grams, 407,723 4-grams, 464,767 5-grams, and 70,052 6-grams.
The Shiny App is available at: https://rubiera.shinyapps.io/capstone/

The Language Model: Examples

Here are the 2-grams with the highest perplexity. This means that there are that many possible words that could be offered as a prediction given one of these words.

##     word1    word2 freq
## 1   happy birthday 9266
## 2    good  morning 8250
## 3   years      ago 7498
## 4 mothers      day 6468
## 5  follow     back 6227
## 6    high   school 5937

We decided to leave “Cinco de Mayo” in the model because it is a term commonly used in the United States.

##    word1      word2 word3 freq
## 1  happy    mothers   day 3419
## 2  cinco         de  mayo 1098
## 3   love       love  love  784
## 4  happy valentines   day  689
## 5   hope      great   day  516
## 6 couple      weeks   ago  514

The Language Model: Average Perplexity

Here is a sample of 5-grams from our language model with perplexity = 2. This means that we have a 50 percent change of predicting word5 correctly.

##       word1  word2     word3   word4     word5 freq
## 10000   bit  bored afternoon writing      tort    2
## 10001   bit    box   covered   piece cardboard    2
## 10002   bit  broke     movie   crack      made    2
## 10003   bit butter    dipped   sauce       pot    2
## 10004   bit  catch       hit    town      true    2
## 10005   bit change      nice   heres     catch    2

Perplexity decreases significantly, and especially between 2-grams and larger N-grams, as we increase the size of the N-grams. We selected 2-grams with frequency of 3 or more, and 3-grams with frequency of 2 or more, and all 4,5, and 6-grams from our 20,000 word selection in the 1-grams. Our average perplexities are:
- 57.67 for 2-grams.
- 1.78 for 3-grams.
- 1.92 for 4-grams.
- 1.04 for 5-grams.
- 1.00 for 6-grams (which are only from the blogs subset of the data).

The Shiny App

The steps followed in the Shiny App are:
- Take the text input and process it using the tm package to determine the size of the current N-gram.
- Knowing the size of the N-gram, if N is 5 or greater, take the last five tokens in the current input N-gram, and check, using stupid backoff, for a possible prediction in our language model, beginning with our 6-grams, all the way down to 2-grams. If after this procedure, there is no prediction, the app outputs “no prediction.” Otherwise, a prediction is made from the largest N-gram in our model containing the last word in the input text as it’s (n-1) word.
- If N is 2,3, or 4, the same procedure is followed, but we can only check N-grams of a smaller size. These predictions are less accurate due to the high perplexity of N-grams where N < 5, and especially, for 2-grams.
Limitations and room for improvement:
- There is ample room for cleaning up the corpus to ensure all of the words in the N-grams are in the English language.
- There is also ample room to tune how the application behaves in shinyapps.io versus how it behaves on a laptop.
- The corpus on which this language model is somewhat dated. Language evolves over time.

Coursera R Capstone: A Word Prediction Shiny App

Introduction

The Language Model: Examples

The Language Model: Average Perplexity

The Shiny App