Introduction to the Analyse

This is part of my Capstone Project Assignment for week 2 and use the data from HC Corpora Dataset to anlyse. My ultimate goal is to create a Shiny App for predicting n-gams.

Following is my first summarized milestone report for an exploratory data anlysis.

File - Summary

I use 3 files with origin of news, blogs and twitter that i use in R. Note: the news file has 3 null charcters (hidden) that prevent a full file read and require a manual deltion within a text editor, e.g. Notepad.

f_names f_size f_lines n_char n_words pct_n_char pct_lines pct_words
blogs 200.4242 899288 208361438 37334131 0.36 0.21 0.37
news 196.2775 1010242 203791400 34372528 0.35 0.24 0.34
twitter 159.3641 2360148 162385035 30373583 0.28 0.55 0.30

The File Size impact the R memory limit and cause slow running.
I take 10 % sample size form each file. I cleaned the sample and created n-grams. To further speed processing, we subsetted the n-grams to those that covered 90% of the sample phrases.

UNI-GRAM

Be aware that the corpora contains by acronys & abbreviations such as “rt” which means re-tweet, or “lol” for laugh out loud. I chose to leave the shortage “im” for I am and “dont” for don’t / do not as is, hence they show up as uni-grams.

CHART: Uni-gram Wordcloud

A word distribution is summarized with a word cloud as following, where colour/size represent the frequency in the corpora. The words, “im”, and “time” show up as most frequent followed by “people”, “dont”, “day”, and “love”. This is a popular visual method, but we prefer the relative frequency column plots shown below.

CHART: Uni-grms, By Source

The different files - news, blogs, news, twitter - had different word relative frequencies. Notice that in terms of most frequent words, “rt” occurs only on twitter, “ic” and “donc” only in blogs, and “city”, “percent”, “county” only in news.

CHART: Uni-gram Distribution

The Distributions were created for each set of n-grams, based on relative frequency. Below the charts.

CHART: Bi-gram Distribution

CHART: Tri-gram Distribution

CHART: Quad-gram Distribution

The N-gram- prediction model

I am using the n-gram tables created for bi-gram, tri-grams, and quad-grams as the basis for prediction. Later the user will input a word, where the model will find the bi-gram with the biggest relative frequency based on the input word. While the tri-gram table will be used for making predictions from two word entries and so on.

word1 word2 word3 word4 n proportion coverage
the end of the 806 8.93e-05 0.0000893
at the end of 656 7.27e-05 0.0001619
the rest of the 651 7.21e-05 0.0002340
for the first time 613 6.79e-05 0.0003019
at the same time 506 5.60e-05 0.0003580
is going to be 482 5.34e-05 0.0004113

Please notice that the quad-gram table, while the 4-grams are separated by word and sorted by relative frequncy. In case the user put 3 words as input, the model matches those words and will then find the fourth word based on the greatest relative frequency. Cases where there is no match, or where more than three words are entered, will have random completion.