Coursera Data Science Capstone Project

Summary

This report provides an exploratory analysis of a large sample of blogs, news and tweets in English taken from HC Corpora. It begins by exploring the size of the data sets (or corpora) and then proceeds to analyze the distribution of terms (words, essentially), bigrams and trigrams. Finally, next steps for modelling are laid out.

Corpora Size

The table belows summarizes the size of the 3 data sets in terms of number of elements, terms (or words) and unique terms (after removing numbers and punctuation). Given the sheer size of the data sets–blogs with 899,288 elements; news with 1,010,242; and twitter with 2,360,148–only a portion of each data set was explored here (specifically 10,000 elements from each). While this doesn’t cover the majority of the raw data, this sample should suffice for analytical purposes.

Blogs contain the most terms (3.1724310^{5}) and unique terms (31048). Tweets are much smaller in size than blogs or news, which is not surprising given the restrictions on their length.

	Elements	Total_Terms	Unique_Terms
blogs	10,000	317,243	31,048
news	10,000	273,688	30,565
twitter	10,000	95,010	15,155

Term Frequencies

The distribution of term frequencies (after removing standard ‘stopwords’) are explored in the plots below. The predominant theme is that frequencies of terms are highly skewed, with the vast majority of terms appearing relatively few times and a few terms appearing relatively frequently. The news data set is especially skewed, which may be due to the relatively standardized language and format of news articles compared with blogs and tweets.

Anywhere from roughly 1% to 4% of unique terms are required to cover at least 50% of term instances among the data sets. For 90% coverage, roughly 40% to 56% unique terms are needed.

The most frequent terms below provide some extra colour to the previous distribution plots. As might be expected, the word ‘said’ is by far the most frequent term among news documents. The top twitter terms tend to reflect views or opinions about something, such as ‘like’, ‘love’ and ‘good’.

Top 10 Most Frequent Terms
	blogs	blogs	news	news	twitter	twitter
1	one	1,327	said	2,484	just	656
2	will	1,233	will	1,084	like	509
3	just	1,158	one	835	get	473
4	can	1,137	new	675	love	435
5	like	1,091	year	598	good	419
6	time	1,016	also	586	will	402
7	get	768	can	576	thanks	394
8	know	717	two	576	can	385
9	now	676	just	552	day	374
10	people	627	last	524	one	359

Bigram and Trigrams Frequencies

The plots below illustrate how bigrams and trigrams are distributed in the data sets. The main difference between these distributions and that of single terms is that the tail is generally fatter, especially with trigrams. This makes sense as it’s unlikely to get very high frequencies of a sequence of words, especially as the sequence gets longer.

The most frequent bigrams and trigrams among the news data set are what one would expect–mainly cities, times and politicians–as are those among tweets where greetings dominate. Other interesting findings to note are the preponderance of New York-related terms as well as the trigram ‘u u u’ taking first place in the news data set.

Top 10 Most Frequent Bigrams
	blogs	blogs	news	news	twitter	twitter
1	can see	62	last year	126	right now	69
2	first time	59	new york	108	last night	57
3	new york	59	high school	89	happy birthday	39
4	make sure	58	st louis	79	just got	37
5	right now	50	years ago	69	good morning	36
6	even though	49	new jersey	66	looking forward	34
7	last year	49	last week	60	can get	32
8	feel like	46	los angeles	46	follow back	31
9	years ago	46	first time	45	thanks follow	30
10	every day	41	health care	43	feel like	25

Top 10 Most Frequent Trigrams
	blogs	blogs	news	news	twitter	twitter
1	new york times	9	u u u	17	happy mothers day	17
2	new york city	8	president barack obama	13	let us know	9
3	amazon services llc	6	first time since	12	cinco de mayo	8
4	happy new year	6	two years ago	12	happy new year	8
5	new york ny	6	gov chris christie	11	follow follow back	6
6	two weeks ago	6	new york city	11	lovz lovz lovz	6
7	hotel birmingham nec	5	pates fountain parks	11	lies lies lies	5
8	makes feel like	5	new york times	9	o o o	5
9	year old daughter	5	st louis county	9	show last night	5
10	years ago now	5	us district court	9	brenda brenda brenda	4

Next Steps

The plan ahead for developing the prediction algorithm and app can be grouped under the following main steps:

Profanity filtering
Reducing the size of the input data sets: for example, through some measure of sparsity, using a dictionary of terms from which to subset, or sampling
Modelling: likely start off with an n-gram model using Markov Chains and then proceed to using backoff models
Testing and improving the models

Coursera Data Science Capstone Project - Milestone Report

David Rubinger

Summary

Corpora Size

Term Frequencies

Bigram and Trigrams Frequencies

Next Steps