Exploratory Data Analysis

Introduction

In this assignment we will explore the 3 SwiftKey data files provided in attempt to draw insights from the data that may help us in developing a predictive model that predicts the next word. The process that we will follow to explore the data is as follows:

  • Explore the high level file row-counts and word-counts.

  • Explore the term frequency of the 3 files.

  • Explore the Unigram, BiGram and TriGram frequency of the 3 files.

  • Explore the word correlation of the top terms for each of the 3 files.

  • Explore some of the Entities of each of the 3 files.


File Analysis

Below we explore the high level row-counts and word-counts fo the 3 files. We see that the tweets file has the highest row-count, however the blogs file has the highest word-count.

File Line Count Word Count
en_US.blogs.txt 899,288 37,334,690
en_US.twitter.txt 2,360,148 30,374,206
en_US.news.txt 1,010,242 34,372,720


Data Pre-processing

Before we move on with exploratory analysis we need to clean the corpora by performing the following data pre-processing steps:

  • remove white-space

  • remove punctuation

  • convert each document to lowercase

  • remove stop-words & profanity

  • stem or lemmatize each term in the corpus

  • skip irrelevant stopwords based on a dictionary


Term Frequency

Below are the results summary tables and word-clouds of the most frequent terms found in the Corpora.


Summary Tables
Tweets
word freq
just 149,580
like 121,279
get 111,901
love 105,430
good 99,549
will 94,247
day 89,816
can 89,084
dont 88,678
thanks 88,588
News
word freq
said 250,326
will 108,039
one 82,798
new 70,189
also 58,727
can 58,555
year 57,321
two 57,262
just 52,981
first 52,542
Blogs
word freq
one 123,617
will 112,369
just 99,524
like 97,913
can 97,816
time 87,442
get 70,482
know 59,509
now 58,780
people 58,676


Wordclouds

Word-clouds reflect the same information as the summary table, but are a nice way to visualize the results.


N-Gram Analysis

In addition to the summary tables and word-clouds we should look at N-Grams to understand which words frequently appear together, or rather what groups of words are most highly correlated in each of the files.


Word Association

Word associations below are presented for each of the top bi-Grams highlighted in the previous section. The word associations charts help us understand what words are highly correlated with the anchor word, and therefore what is likely to be the next words predicted by our next-word algorithm/model.


Entity Extraction

Lastly we take a look at the entities that exist in each Corpora, however as entity extraction is quite compute intensive an slow the entities have been extracted on a 1000 document sample extract from each file. Entities may or may not help us significantly with our next-word prediction model, however there is no harm to explore them.

Locations
Tweets Locations
Entity Freq
Chicago 3
Love 3
Wrigley 3
Colorado 2
Houston 2
New York 2
Pittsburgh 2
23.8M 1
518 1
Annapolis 1
News Locations
Entity Freq
Cleveland 11
New Jersey 8
Afghanistan 7
Arizona 7
California 7
Chicago 7
New York 7
Washington 7
Detroit 6
Ohio 6
Blogs Locations
Entity Freq
South Africa 6
Europe 4
Bette 3
London 3
Memphis 3
New York 3
Portland 3
Spring 3
Zimbabwe 3
Amazon 2
Persons
Tweets Persons
Entity Freq
I 2
Twins 2
Wolf 2
Ah 1
Alberto 1
American Samoa 1
And 1
Anderson 1
Antonio Tabucchi 1
Ben - 1
News Persons
Entity Freq
Obama 6
Adams 4
Johnson 4
Christie 3
Clinton 3
James 3
Jones 3
Ross 3
Baker 2
Beethoven 2
Blogs Persons
Entity Freq
Chang Min 4
The 4
Chase 3
I 3
John 3
Sam 3
So 3
We 3
And 2
Charles 2
Organisations
Tweets Organisations
Entity Freq
Academy 2
It 2
Red Sox 2
RT 2
6:30 PM Monrovia California 1
AT 1
Baylor 1
Beatles 1
Best 1
Cardinals 1
News Organisations
Entity Freq
Senate 7
NBA 5
NFL 5
NCAA 4
The 4
House 3
Lakers 3
Legislature 3
NBC 3
Toyota 3
Blogs Organisations
Entity Freq
Navy 3
School 3
UK 3
US 3
Lord 2
Team 2
University 2
2011 1
7-11 1
American Association of Poison Control Centers 1