16 December 2017

Introduction

The goal of this project was to build a predictive text model. By giving some input words, it should predict which is going to be the next word that the user will type. Important points for that purpose:

  • which data? –> text data sets in English from three different media sources (blogs, news and tweets); some files with >800000 lines (aprox. 200 Mb per file)
  • which tools? –> ‘’tm’’ turned to be not efficient with big size data so we used ‘’quanteda’’ package.
  • Model 1: 50000 lines sample. Model 2: 100000 lines sample.
  • evaluate models –> cross entropy, benchmarks
  • build an app –> shiny

Prediction Model: set up

Some needed transformations:

  • conversion to ASCII format
  • remove contractions (qdap package)
  • remove url, numbers and special characters (~!@#$%^&*|(){}_+:"<>?,./;’[]-=)
  • profanity filter (badwords removed)
  • lower case and stemming used

Building the models:

  • corpus of the data sample
  • bigrams, trigrams and quatrigrams used
  • extract most frequent words (freq >1)
  • compare models and select one

Prediction algorithm

  1. Last three input words –> search quatrigrams (highest frequencies)

  2. If not found –> decrease nr. of input words (search trigrams and bigrams)

  3. Several word combinations (simulating skip-grams) are incorporated.

Evaluation: 2 and 3 words uncomplete sentences selected to test the models.

Model 1 : input sample size = 10 –> selected

CrossEntropy = 19.1618
#Accuracy = 50%
AverageCrossEntropy = 3.83236

Model 2 : input sample size = 10

CrossEntropy = 21.29822
#Accuracy = 50%
AverageCrossEntropy = 4.259644

Benchmarking: input sample size = 600 https://github.com/hfoffani/dsci-benchmark

#with ''blogs'' dataset
Overall top-3 score:     13.43 %
Overall top-3 precision: 16.45 %
Average runtime:         129.13 msec
Number of predictions:   14483
Total memory used:       14.00 MB

Comparing with other results, we discovered a exponential relationship between accuracy and memory. For an algorithm to have a 2% accuracy increase, 250Mb more memory is needed.

Shiny App

Step 1 Provide an incomplete sentence. Ex: if you know any and click the “Predict” button. The “Clear” button helps to clear the text box and prediction table.

At the right side of the app, a table will appear showing the found matching-words in order of importance, the first word is the most probable one. Besides the elapsed time that the computer took to make the prediction (as a measure of performance) is given.

Step 2 Now the user can help to improve the model by providing some feedback. Same as when typing with a smartphone. She or he must choose a word from the listed in the results table or provide a new correct one and click “Send feedback” button.

Conclusions

  1. Exploratory data analysis was performed on the text samples
  2. A lot of cleaning and preparation of the data was needed
  3. Building a model required many decisions:
    • sample size?
    • good accuracy-performance ratio?
    • memory restrictions
    • possible improvements?
  4. A model was chosen and some benchmarking done
  5. Focus was done on top3 (3 input words) predictions
  6. Find the app here: https://cyberosa.shinyapps.io/MakingPredictionsApp/
  7. Source code available in github: https://github.com/cyberosa/datasciencecoursera/tree/master/NaturalLanguageProcessing