Natural Language Processing (NLP) prediction algorithm presentation

16 December 2017

Introduction

The goal of this project was to build a predictive text model. By giving some input words, it should predict which is going to be the next word that the user will type. Important points for that purpose:

which data? –> text data sets in English from three different media sources (blogs, news and tweets); some files with >800000 lines (aprox. 200 Mb per file)
which tools? –> ‘’tm’’ turned to be not efficient with big size data so we used ‘’quanteda’’ package.
Model 1: 50000 lines sample. Model 2: 100000 lines sample.
evaluate models –> cross entropy, benchmarks
build an app –> shiny

Prediction Model: set up

Some needed transformations:

conversion to ASCII format

remove contractions (qdap package)

remove url, numbers and special characters (~!@#$%^&*|(){}_+:"<>?,./;’[]-=)

profanity filter (badwords removed)

lower case and stemming used

Building the models:

corpus of the data sample

bigrams, trigrams and quatrigrams used

extract most frequent words (freq >1)

compare models and select one

Prediction algorithm

Last three input words –> search quatrigrams (highest frequencies)
If not found –> decrease nr. of input words (search trigrams and bigrams)
Several word combinations (simulating skip-grams) are incorporated.

Evaluation: 2 and 3 words uncomplete sentences selected to test the models.

Model 1 : input sample size = 10 –> selected

CrossEntropy = 19.1618
#Accuracy = 50%
AverageCrossEntropy = 3.83236

Model 2 : input sample size = 10

CrossEntropy = 21.29822
#Accuracy = 50%
AverageCrossEntropy = 4.259644

Benchmarking: input sample size = 600 https://github.com/hfoffani/dsci-benchmark

#with ''blogs'' dataset
Overall top-3 score:     13.43 %
Overall top-3 precision: 16.45 %
Average runtime:         129.13 msec
Number of predictions:   14483
Total memory used:       14.00 MB

Comparing with other results, we discovered a exponential relationship between accuracy and memory. For an algorithm to have a 2% accuracy increase, 250Mb more memory is needed.

Shiny App

Step 1 Provide an incomplete sentence. Ex: if you know any and click the “Predict” button. The “Clear” button helps to clear the text box and prediction table.

At the right side of the app, a table will appear showing the found matching-words in order of importance, the first word is the most probable one. Besides the elapsed time that the computer took to make the prediction (as a measure of performance) is given.

Step 2 Now the user can help to improve the model by providing some feedback. Same as when typing with a smartphone. She or he must choose a word from the listed in the results table or provide a new correct one and click “Send feedback” button.

Conclusions

Exploratory data analysis was performed on the text samples
A lot of cleaning and preparation of the data was needed
Building a model required many decisions:
- sample size?
- good accuracy-performance ratio?
- memory restrictions
- possible improvements?
A model was chosen and some benchmarking done
Focus was done on top3 (3 input words) predictions
Find the app here: https://cyberosa.shinyapps.io/MakingPredictionsApp/
Source code available in github: https://github.com/cyberosa/datasciencecoursera/tree/master/NaturalLanguageProcessing