Natural Language Processing Milestone Report

Exploratory Analysis

We begin by reporting summary statistics from each of the three data sources, shown in the table below. Length refers to the number of characters in each line from the data source.

Data Source	Lines	Mean Length	Median Length	St Dev Length
Twitter	2360148	68.68045	64	37.22725
Blogs	899288	229.98695	156	258.66081
News	1010242	201.16285	185	133.21714

In addtion, we can view density plots of the line length from each data source. Note that this data from Twitter has a maximum of 144 characters. For the density plots for blog and news sources, we plot based on base ten log of number of characters, as the maximum number of characters are 40833 and 11384, respectively.

To build a predictive text algorithm, we must break down each line by word. After filtering to remove the most common words (stop words), we then filter for profanity. We use a list of banned words previously developed by Google and found on the user RobertJGabriel’s GitHub. We then sort by the most frequent words in each data set, the top 15 of which are shown in the table below.

Fifteen Most Frequent Words by Data Source

Twitter
Word	Frequency
just	151115
like	122455
get	112459
love	106721
good	101026
day	91710
can	89847
thanks	89660
rt	89537
now	83986
one	82858
know	79916
u	77531
time	76794
great	76139

Blog
Word	Frequency
one	127287
just	100793
like	100442
can	98420
time	90918
get	71093
know	60496
now	60358
people	59574
also	55366
new	54847
day	52372
even	52174
first	51634
back	51306

News
Word	Frequency
said	250418
one	88794
year	76765
new	70773
two	63867
can	58924
also	58786
first	57866
time	57062
just	53350
last	52079
like	50829
state	50095
people	47666
years	46969

We can also see how frequently unique words appear in each source after filtering. The plot below shows the base ten log of the frequency of each word in the Twitter data set, where the horizontal axis is given by word frequency rank. We do not show the analogous plots for the other two data sets, as they exhibit the same behavior as the plot for Twitter.

To have an effective predictive algorithm, we must be able to find the frequencies of pairs of words in each data set. The table below shows, for each data set, the fifteen most common ordered word pairings after filtering for stop words and profanity.

Fifteen Most Frequent Pairs of Words by Data Source

Twitter
Word 1	Word 2	Frequency
happy	birthday	8389
social	media	3886
mother’s	day	2874
stay	tuned	2657
mothers	day	2572
san	diego	2232
rt	rt	2106
happy	friday	1952
1	2	1919
ice	cream	1899
happy	hour	1859
beautiful	day	1813
happy	mothers	1769
lol	rt	1646
tomorrow	night	1605

Blogs
Word 1	Word 2	Frequency
1	2	3976
weeks	ago	1606
ice	cream	1585
1	4	1469
social	media	1342
jesus	christ	1314
south	africa	1153
real	life	1145
3	4	1108
10	minutes	1072
olive	oil	1059
feel	free	1014
blog	post	997
months	ago	983
30	minutes	968

News
Word 1	Word 2	Frequency
st	louis	9329
los	angeles	5333
30	p.m	4493
san	francisco	4478
health	care	4009
vice	president	2906
1	2	2885
san	diego	2712
7	p.m	2275
white	house	2249
30	a.m	2188
law	enforcement	2170
executive	director	2156
real	estate	2062
supreme	court	2052

Our predictive model will use the relative frequencies of ordered word pairs and ordered word triples to predict a word based on the two words preceeding it in the training data set. For words input in the algorithm that do not appear in the training set, the algorithm will average over the training set to generate a suggested word.

Natural Language Processing Milestone Report

Ross Sweet

Data

Exploratory Analysis