Vadim Bondarenko
July 25, 2016
Data scientists are often required to handle large amours of messy, unstructured data. One example of such data could be a large collection of natural human language stored various text files. This area of data science is commonly called the Natural Language Processing (NLP).
For this projects I had the opportunity to:
Data Source: I used the text files from three different sources:
Data Preprocessing: I took the following steps to clean the text files:
Tokenization: I split the text into N-grams, which are unique N-length combinations of words observed in the training data. The number of unique tokens in my training set was:
Relative Frequencies: Once the corpus is tokenized into N-grams, it's straightforward to compute their frequencies of occurring in the training set and assume those to be the probability density mass of natural language.
The algorithm predicts the most likely word that follows the previous 1, 2, or 3 words provided as inputs. Given the last N words, the model returns the most frequent N+1 - gram that begins with those N words.
Katz's Backoff Model: Often a prediction is needed for a combination of words that is not observed in the training corpus. For those cases, I implemented a form of the Katz's Backoff algorithm. When a given N-gram is not found, the model “backs off” to the next level of N-1 - grams.
Finally, after testing prediction accuracy, I turned it into a basic interactive application and deployed it to be accessible by anyone over the internet. My main requirements for the app were:
As mentioned before, I had to balance between increasing prediction accuracy (larger sample size) and relatively lightweight computing resources (reducing sample size).
The App can be accessed at https://vadimus202.shinyapps.io/word_predict/. Please enjoy!