Exploratory Text Analysis

Introduction

In this assignment we will explore the 3 SwiftKey data files provided in attempt to draw insights from the data that may help us in developing a predictive model that predicts the next word. The process that we will follow to explore the data is as follows:

Explore the high level file row-counts and word-counts.
Explore the term frequency of the 3 files.
Explore the Unigram, BiGram and TriGram frequency of the 3 files.
Explore the word correlation of the top terms for each of the 3 files.
Explore some of the Entities of each of the 3 files.

File Analysis

Below we explore the high level row-counts and word-counts fo the 3 files. We see that the tweets file has the highest row-count, however the blogs file has the highest word-count.

File	Line Count	Word Count
en_US.blogs.txt	899,288	37,334,690
en_US.twitter.txt	2,360,148	30,374,206
en_US.news.txt	1,010,242	34,372,720

Data Pre-processing

Before we move on with exploratory analysis we need to clean the corpora by performing the following data pre-processing steps:

remove white-space
remove punctuation
convert each document to lowercase
remove stop-words & profanity
stem or lemmatize each term in the corpus
skip irrelevant stopwords based on a dictionary

Term Frequency

Below are the results summary tables and word-clouds of the most frequent terms found in the Corpora.

Summary Tables

Tweets
word	freq
just	149,580
like	121,279
get	111,901
love	105,430
good	99,549
will	94,247
day	89,816
can	89,084
dont	88,678
thanks	88,588

News
word	freq
said	250,326
will	108,039
one	82,798
new	70,189
also	58,727
can	58,555
year	57,321
two	57,262
just	52,981
first	52,542

Blogs
word	freq
one	123,617
will	112,369
just	99,524
like	97,913
can	97,816
time	87,442
get	70,482
know	59,509
now	58,780
people	58,676

Wordclouds

Word-clouds reflect the same information as the summary table, but are a nice way to visualize the results.

N-Gram Analysis

In addition to the summary tables and word-clouds we should look at N-Grams to understand which words frequently appear together, or rather what groups of words are most highly correlated in each of the files.

Word Association

Word associations below are presented for each of the top bi-Grams highlighted in the previous section. The word associations charts help us understand what words are highly correlated with the anchor word, and therefore what is likely to be the next words predicted by our next-word algorithm/model.

Entity Extraction

Lastly we take a look at the entities that exist in each Corpora, however as entity extraction is quite compute intensive an slow the entities have been extracted on a 1000 document sample extract from each file. Entities may or may not help us significantly with our next-word prediction model, however there is no harm to explore them.

Locations

Tweets Locations
Entity	Freq
Chicago	3
Love	3
Wrigley	3
Colorado	2
Houston	2
New York	2
Pittsburgh	2
23.8M	1
518	1
Annapolis	1

News Locations
Entity	Freq
Cleveland	11
New Jersey	8
Afghanistan	7
Arizona	7
California	7
Chicago	7
New York	7
Washington	7
Detroit	6
Ohio	6

Blogs Locations
Entity	Freq
South Africa	6
Europe	4
Bette	3
London	3
Memphis	3
New York	3
Portland	3
Spring	3
Zimbabwe	3
Amazon	2

Persons

Tweets Persons
Entity	Freq
I	2
Twins	2
Wolf	2
Ah	1
Alberto	1
American Samoa	1
And	1
Anderson	1
Antonio Tabucchi	1
Ben -	1

News Persons
Entity	Freq
Obama	6
Adams	4
Johnson	4
Christie	3
Clinton	3
James	3
Jones	3
Ross	3
Baker	2
Beethoven	2

Blogs Persons
Entity	Freq
Chang Min	4
The	4
Chase	3
I	3
John	3
Sam	3
So	3
We	3
And	2
Charles	2

Organisations

Tweets Organisations
Entity	Freq
Academy	2
It	2
Red Sox	2
RT	2
6:30 PM Monrovia California	1
AT	1
Baylor	1
Beatles	1
Best	1
Cardinals	1

News Organisations
Entity	Freq
Senate	7
NBA	5
NFL	5
NCAA	4
The	4
House	3
Lakers	3
Legislature	3
NBC	3
Toyota	3

Blogs Organisations
Entity	Freq
Navy	3
School	3
UK	3
US	3
Lord	2
Team	2
University	2
2011	1
7-11	1
American Association of Poison Control Centers	1