Introduction

In the following we do some basic exploratory data analysis for the files: en_US.twitter.txt, en_US.news.txt, and en_US.blogs.txt. Ultimately, we wish to use these to build a predictive text analytics app. This report is to begin by getting a basic understanding of the data. We will look at 1-grams, 2-grams, and 3-grams. We strip all punctuation and normalize by lower-casing everything. This preprocessing, as well as aggregating counts by n-gram, was done in Python. We load these results into R to analyze in the following.

Counts

We now provide the number of total and distinct n-grams, as well as how many are needed to cover 50% and 90% of the total n-grams. We also provide (logarithmically scaled) plots to show the distributions of the n-grams for frequencies above a certain threshold - the threshold determined for 90% total n-gram coverage. This is done for n = 1, 2, and 3, and for the three data sets. Finally, we give the top ten most frequent n-grams.

News

Number of lines = 1010242

distinct 1-grams = 399550
distinct 2-grams = 6761240
distinct 3-grams = 18702509

total 1-grams = 34253633
total 2-grams = 33243404
total 3-grams = 32237106

10 most freqent n-grams

1-gram	count	2-gram	count	3-gram	count
the	1967260	of_the	185993	one_of_the	14512
to	900704	in_the	176066	a_lot_of	11459
and	883594	to_the	83692	as_well_as	6233
a	873408	on_the	72710	part_of_the	5688
of	770956	for_the	68797	the_end_of	5617
in	673415	at_the	57798	according_to_the	5601
for	350846	and_the	51799	out_of_the	5552
that	345662	in_a	50507	some_of_the	5440
is	283823	to_be	46821	to_be_a	5363
on	266482	with_the	43438	in_the_first	5132

## [1] "Distinct 1-grams needed for 50%:" "219"

## [1] "Distinct 1-grams needed for 90%:" "9756"

## [1] "Distinct 2-grams needed for 50%:" "26781"

## [1] "Distinct 2-grams needed for 90%:" "793377"

## [1] "Distinct 3-grams needed for 50%:" "241646"

## [1] "Distinct 3-grams needed for 90%:" "2128053"

Twitter

Number of lines = 2360148

distinct 1-grams = 544991
distinct 2-grams = 5548330
distinct 3-grams = 13851470

total 1-grams = 29760155
total 2-grams = 27400007
total 3-grams = 25041177

10 most freqent n-grams

1-gram	count	2-gram	count	3-gram	count
the	933427	in_the	77991	thanks_for_the	23528
to	786379	for_the	73833	looking_forward_to	8711
i	712377	of_the	56811	thank_you_for	8588
a	606543	on_the	48294	cant_wait_to	8244
you	542155	to_be	46841	i_love_you	7974
and	433528	to_the	43298	for_the_follow	7777
for	384311	thanks_for	42747	going_to_be	7393
in	376726	at_the	37100	i_want_to	7030
of	358876	i_love	35397	a_lot_of	6224
is	357433	going_to	34170	to_be_a	5981

## [1] "Distinct 1-grams needed for 50%:" "127"

## [1] "Distinct 1-grams needed for 90%:" "6252"

## [1] "Distinct 2-grams needed for 50%:" "12306"

## [1] "Distinct 2-grams needed for 90%:" "503841"

## [1] "Distinct 3-grams needed for 50%:" "117201"

## [1] "Distinct 3-grams needed for 90%:" "1354424"

Blogs

distinct 1-grams = 517843
distinct 2-grams = 6841290
distinct 3-grams = 19576346

total 1-grams = 37214322
total 2-grams = 36315098
total 3-grams = 35427699

Number of lines = 899288

10 most freqent n-grams

1-gram	count	2-gram	count	3-gram	count
the	1848597	of_the	186623	one_of_the	14512
and	1084582	in_the	152515	a_lot_of	11459
to	1064762	to_the	85712	as_well_as	6233
a	894545	on_the	75010	part_of_the	5688
of	874700	to_be	67879	the_end_of	5617
i	762769	and_the	58154	according_to_the	5601
in	592243	for_the	57839	out_of_the	5552
that	458155	i_was	49075	some_of_the	5440
is	430735	and_i	48745	to_be_a	5363
it	397611	i_have	47439	in_the_first	5132

## [1] "Distinct 1-grams needed for 50%:" "127"

## [1] "Distinct 1-grams needed for 90%:" "6252"

## [1] "Distinct 2-grams needed for 50%:" "16041"

## [1] "Distinct 2-grams needed for 90%:" "636227"

## [1] "Distinct 3-grams needed for 50%:" "176523"

## [1] "Distinct 3-grams needed for 90%:" "1965105"

Conclusion

In all cases, the distribution drops off pretty quickly to zero and then there are a few n-grams that show up extremely frequently. The distribution is narrower for 2-grams than 1-grams, and even more so for 3-grams; that is, the frequencies with which n-grams occur is smaller for larger n. This is also why the number of distinct n-grams grows so much larger for n =2 and n=3.

This indicates that incorporating 2-gram influences into the model is better than a purely 1-gram (i.e. word count) approach, and 3-grams will do even better still. The 50% and 90% coverage numbers will be useful in that they indicate a subset of the data that we can opt to work with to increase

NLP_Capstone - Exploratory Data Analysis

Chris Harris

8/6/2017

Introduction

Counts

News

Twitter

Blogs

Conclusion