Jakub Vedral
31.8.2017
Data Science Capstone project
This presentation will briefly introduce my Data Science Capstone application for predicting the next word.
Application is written in R language with the help of additional R packages. Namely:
Prediction uses HC Corpora dataset which is quite large (at least for my coputer), so the model was build on sample of 30% of all data.
After creating a data sample from the HC Corpora data, following transformations (cleaning) were done on the dataset:
This data sample was then converted into various n-grams (bi, tri, quad and penta gram)
Prediction is based on lookup mechanism which goes from longest n-grams to shortest and tries to find most frequent combination, which is then used as the next word. Search is based on user input
Not everything went well during creation of prediction model. I had severe performance problems and had to try many different R packages to get to the finish line.
For bad performance I discarder (in order):
I also had to use data.table package as regular data.frame was to slow for operations with 150MB dataset.
In the end I got following times to completely build my data model for these n-grams:
[1] "ngram size: 1"
Time difference of 1.072537 mins
[1] "ngram size: 2"
Time difference of 3.327966 mins
[1] "ngram size: 3"
Time difference of 12.02057 mins
[1] "ngram size: 4"
Time difference of 20.06881 mins
[1] "ngram size: 5"
Time difference of 34.46598 mins
Application is fairly simple to use. Just type your text in the single text field and get your suggestions below.
App gives more than one prediction, printing words with decreasing probability from left to right.