This is part of my Capstone Project Assignment for week 2 and use the data from HC Corpora Dataset to anlyse. My ultimate goal is to create a Shiny App for predicting n-gams.
Following is my first summarized milestone report for an exploratory data anlysis.
I use 3 files with origin of news, blogs and twitter that i use in R. Note: the news file has 3 null charcters (hidden) that prevent a full file read and require a manual deltion within a text editor, e.g. Notepad.
| f_names | f_size | f_lines | n_char | n_words | pct_n_char | pct_lines | pct_words |
|---|---|---|---|---|---|---|---|
| blogs | 200.4242 | 899288 | 208361438 | 37334131 | 0.36 | 0.21 | 0.37 |
| news | 196.2775 | 1010242 | 203791400 | 34372528 | 0.35 | 0.24 | 0.34 |
| 159.3641 | 2360148 | 162385035 | 30373583 | 0.28 | 0.55 | 0.30 |
The File Size impact the R memory limit and cause slow running.
I take 10 % sample size form each file. I cleaned the sample and created n-grams. To further speed processing, we subsetted the n-grams to those that covered 90% of the sample phrases.
Be aware that the corpora contains by acronys & abbreviations such as “rt” which means re-tweet, or “lol” for laugh out loud. I chose to leave the shortage “im” for I am and “dont” for don’t / do not as is, hence they show up as uni-grams.
A word distribution is summarized with a word cloud as following, where colour/size represent the frequency in the corpora. The words, “im”, and “time” show up as most frequent followed by “people”, “dont”, “day”, and “love”. This is a popular visual method, but we prefer the relative frequency column plots shown below.
The different files - news, blogs, news, twitter - had different word relative frequencies. Notice that in terms of most frequent words, “rt” occurs only on twitter, “ic” and “donc” only in blogs, and “city”, “percent”, “county” only in news.
The Distributions were created for each set of n-grams, based on relative frequency. Below the charts.
I am using the n-gram tables created for bi-gram, tri-grams, and quad-grams as the basis for prediction. Later the user will input a word, where the model will find the bi-gram with the biggest relative frequency based on the input word. While the tri-gram table will be used for making predictions from two word entries and so on.
| word1 | word2 | word3 | word4 | n | proportion | coverage |
|---|---|---|---|---|---|---|
| the | end | of | the | 806 | 8.93e-05 | 0.0000893 |
| at | the | end | of | 656 | 7.27e-05 | 0.0001619 |
| the | rest | of | the | 651 | 7.21e-05 | 0.0002340 |
| for | the | first | time | 613 | 6.79e-05 | 0.0003019 |
| at | the | same | time | 506 | 5.60e-05 | 0.0003580 |
| is | going | to | be | 482 | 5.34e-05 | 0.0004113 |
Please notice that the quad-gram table, while the 4-grams are separated by word and sorted by relative frequncy. In case the user put 3 words as input, the model matches those words and will then find the fourth word based on the greatest relative frequency. Cases where there is no match, or where more than three words are entered, will have random completion.