[Coursera Data Science specialization]
DMalygin 11/09/2019
The aim of creation the application was to apply knowledge obtained during the course for building the Natural Language Processing pipeline: from downloading the data to deliver complete data product.
In order to create the application the following steps were undertaken:
The data set (corpora) is a set of three sources: 'blogs', 'news', 'twitter' which were obtained with special software - web crawler.
The data was cleaned from elements useless for prediction process:
In order to keep naturality of prediction 'stop' words weren't deleted and profanity words were deleted with sentences where they were (in order to avoid nonatural order of words).
Steps mentioned above are described in details and plots here: https://rpubs.com/DMalygin/dsMilestone
For the operations above the 'Quanteda' package was used: https://quanteda.io/
After the data was cleaned the tokenization was performed. The text was split into:
After that for every N-gram frequency was counted and lists with N-grams were sorted from the frequest N-grams to the rarest ones.
Having pentagrams the application can predict the next word for 4 words in a row.
The application uses simpple 'Backoff' algorithm in order to predict the next word.
The following steps describe the process of prediction: 1. A user enters several words 2. The application transmits the string into the algorithm 3. The algorithm seeks last N-1 words in appropriate N-gram list 4. If the algo finds the next word (N-word) it returns it, if not the algo starts to seek N-2 consequence of last words etc 5. If the algo doesn't find the next word it returns the frequest one from 'stop' words list.
The application can be tried here: https://dmalygin.shinyapps.io/wordPredictorApp/