Capstone Milestone Report

Introduction to the Analyse

This is part of my Capstone Project Assignment for week 2 and use the data from HC Corpora Dataset to anlyse. My ultimate goal is to create a Shiny App for predicting n-gams.

Following is my first summarized milestone report for an exploratory data anlysis.

File - Summary

I use 3 files with origin of news, blogs and twitter that i use in R. Note: the news file has 3 null charcters (hidden) that prevent a full file read and require a manual deltion within a text editor, e.g. Notepad.

f_names	f_size	f_lines	n_char	n_words	pct_n_char	pct_lines	pct_words
blogs	200.4242	899288	208361438	37334131	0.36	0.21	0.37
news	196.2775	1010242	203791400	34372528	0.35	0.24	0.34
twitter	159.3641	2360148	162385035	30373583	0.28	0.55	0.30

The File Size impact the R memory limit and cause slow running.
I take 10 % sample size form each file. I cleaned the sample and created n-grams. To further speed processing, we subsetted the n-grams to those that covered 90% of the sample phrases.

UNI-GRAM

Be aware that the corpora contains by acronys & abbreviations such as “rt” which means re-tweet, or “lol” for laugh out loud. I chose to leave the shortage “im” for I am and “dont” for don’t / do not as is, hence they show up as uni-grams.

CHART: Uni-gram Wordcloud

A word distribution is summarized with a word cloud as following, where colour/size represent the frequency in the corpora. The words, “im”, and “time” show up as most frequent followed by “people”, “dont”, “day”, and “love”. This is a popular visual method, but we prefer the relative frequency column plots shown below.

CHART: Uni-grms, By Source

The different files - news, blogs, news, twitter - had different word relative frequencies. Notice that in terms of most frequent words, “rt” occurs only on twitter, “ic” and “donc” only in blogs, and “city”, “percent”, “county” only in news.

CHART: Uni-gram Distribution

The Distributions were created for each set of n-grams, based on relative frequency. Below the charts.

CHART: Bi-gram Distribution

CHART: Tri-gram Distribution

CHART: Quad-gram Distribution

The N-gram- prediction model

I am using the n-gram tables created for bi-gram, tri-grams, and quad-grams as the basis for prediction. Later the user will input a word, where the model will find the bi-gram with the biggest relative frequency based on the input word. While the tri-gram table will be used for making predictions from two word entries and so on.

word1	word2	word3	word4	n	proportion	coverage
the	end	of	the	806	8.93e-05	0.0000893
at	the	end	of	656	7.27e-05	0.0001619
the	rest	of	the	651	7.21e-05	0.0002340
for	the	first	time	613	6.79e-05	0.0003019
at	the	same	time	506	5.60e-05	0.0003580
is	going	to	be	482	5.34e-05	0.0004113

Please notice that the quad-gram table, while the 4-grams are separated by word and sorted by relative frequncy. In case the user put 3 words as input, the model matches those words and will then find the fourth word based on the greatest relative frequency. Cases where there is no match, or where more than three words are entered, will have random completion.