Next word prediction application

[Coursera Data Science specialization]

DMalygin 11/09/2019

The application overview

The aim of creation the application was to apply knowledge obtained during the course for building the Natural Language Processing pipeline: from downloading the data to deliver complete data product.

In order to create the application the following steps were undertaken:

  • Download the data
  • Pre-process the data: clean, tokenize, create data-feature matrix
  • N-gram creation: count frequency for each of the N-grams
  • Save prepared datasets
  • Create prediction algorithm
  • Create Shiny application
  • Test and deploy the application

Pre-processing part overview

The data set (corpora) is a set of three sources: 'blogs', 'news', 'twitter' which were obtained with special software - web crawler.

The data was cleaned from elements useless for prediction process:

  • numbers
  • special symbols
  • profanity words
  • web-links etc.

In order to keep naturality of prediction 'stop' words weren't deleted and profanity words were deleted with sentences where they were (in order to avoid nonatural order of words).

Steps mentioned above are described in details and plots here: https://rpubs.com/DMalygin/dsMilestone

For the operations above the 'Quanteda' package was used: https://quanteda.io/

N-gram creation overview

After the data was cleaned the tokenization was performed. The text was split into:

  • Digrams (bigrams, couples of words)
  • Trigrams
  • Tetragrams
  • Pentagrams

After that for every N-gram frequency was counted and lists with N-grams were sorted from the frequest N-grams to the rarest ones.

Having pentagrams the application can predict the next word for 4 words in a row.

Prediction algorithm overview

The application uses simpple 'Backoff' algorithm in order to predict the next word.

The following steps describe the process of prediction: 1. A user enters several words 2. The application transmits the string into the algorithm 3. The algorithm seeks last N-1 words in appropriate N-gram list 4. If the algo finds the next word (N-word) it returns it, if not the algo starts to seek N-2 consequence of last words etc 5. If the algo doesn't find the next word it returns the frequest one from 'stop' words list.

The application can be tried here: https://dmalygin.shinyapps.io/wordPredictorApp/