Antonio Rubiera
1/24/2020
We design a language model based on a dataset provided by Swiftkey, which is now part of Microsoft (https://www.microsoft.com/en-us/swiftkey?activetab=pivot_1%3aprimaryr2). The dataset provided has sub-data in four languages (Finnish, Russian, German, and English). We use the English data here. After a detailed analysis, we decided to build a model using N-Grams of one, two, three, four, and five words using all of the data provided, and supplemented the language model with 6-grams from the blogs subset of the data.
We design a shiny app hosting a version of the language model that represents 50 percent of all of words found in the data, and use it to predict the next word given an input text. The shiny can handle N-grams with N as large as 100, and it uses as many as the last five words entered for it’s prediction.
Using a similar procedure, we also selected 258,058 3-grams, 407,723 4-grams, 464,767 5-grams, and 70,052 6-grams.
The Shiny App is available at: https://rubiera.shinyapps.io/capstone/
## word1 word2 freq
## 1 happy birthday 9266
## 2 good morning 8250
## 3 years ago 7498
## 4 mothers day 6468
## 5 follow back 6227
## 6 high school 5937
## word1 word2 word3 freq
## 1 happy mothers day 3419
## 2 cinco de mayo 1098
## 3 love love love 784
## 4 happy valentines day 689
## 5 hope great day 516
## 6 couple weeks ago 514
Here is a sample of 5-grams from our language model with perplexity = 2. This means that we have a 50 percent change of predicting word5 correctly.
## word1 word2 word3 word4 word5 freq
## 10000 bit bored afternoon writing tort 2
## 10001 bit box covered piece cardboard 2
## 10002 bit broke movie crack made 2
## 10003 bit butter dipped sauce pot 2
## 10004 bit catch hit town true 2
## 10005 bit change nice heres catch 2