This project is based on the data HC Corpora Dataset.
The goal of this project is to create a Shiny App which can recognize the next word based on the user’s input.
The data is provided by SwiftKey. It contains the data from news, twitter and blogs text. It is aavailabe in multiple languages such as English, French and Russian.
This report illustrates the ‘Exploratory Data Analysis’ of the english language data set.
Also, different n-gram models and their performance is included in this report.
Following summary illustartes the data from English News, Blogs and Twits.
| f_names | f_size | f_lines | n_char | n_words | pct_n_char | pct_lines | pct_words |
|---|---|---|---|---|---|---|---|
| blogs | 200.4242 | 899288 | 208361438 | 37334131 | 0.54 | 0.27 | 0.53 |
| news | 196.2775 | 77259 | 15683765 | 2643969 | 0.04 | 0.02 | 0.04 |
| 159.3641 | 2360148 | 162384825 | 30373543 | 0.42 | 0.71 | 0.43 |
The file sizes are pretty large and cannot be considered as it as for the analysis as the resources are limited to process those files.
To avoid that, I have sampled the data from each file and then analysis has been performed after cleaning and tidying it.
A unigram model can be treated as the combination of several one-state finite automata.
In this model, the probability of each word only depends on that word’s own probability in the document, so we only have one-state finite automata as units. The automaton itself has a probability distribution over the entire vocabulary of the model, summing to 1.
The different sources are news, blogs and twitter.
Based on relative frequency uni-gram distributions is plotted. They are plotted for each set of n-grams.
The predictions are based on the n-gram tables.
In bi-gram model the next word is predicted based on the last word with highest relative frequency.
In tri-grams the next word is predicted based on the last two words and their relative frequency.
In quad-gram model, the next word is predicted based on the last three words and their relative frequency.
| word1 | word2 | word3 | word4 | n | proportion | coverage |
|---|---|---|---|---|---|---|
| the | end | of | the | 497 | 8.00e-05 | 0.0000800 |
| the | rest | of | the | 454 | 7.31e-05 | 0.0001531 |
| at | the | end | of | 405 | 6.52e-05 | 0.0002183 |
| for | the | first | time | 397 | 6.39e-05 | 0.0002822 |
| thank | you | for | the | 359 | 5.78e-05 | 0.0003401 |
| is | going | to | be | 358 | 5.76e-05 | 0.0003977 |
The data is sampled for ease of analysis and better use of resources.
Exploratory Data Analysis is done.
N-gram models are built with predection model.