Data Science Capstone Project: Exploratory data analysis

Introduction

The capstone project of the Data Science course focuses on text prediction. Based on an existing corpus of text I’ll attempt to construct the model that predicts the next most likely word based on the sequence of preceeding words written by the user. The goal of this report is to load, clean up and summarize the text corpus that will be used to train the prediction model.

Data loading

The three data files are available under the following address. The file is quite large (over 500MB zipped), and contains corpora in four different languages: English, German, Russian and Finnish. I will focus on the english text corpus for the remainder of the project. In the selected corpus, three files are available: en_US.blogs.txt, en_US.news.txt and en_US.twitter.txt. Let’s load them into the system.

download.file(url = "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip",
              destfile = "Coursera-Swiftkey.zip")
unzip("Coursera-Swiftkey.zip")
en_blog_lines <- readLines("./final/en_US/en_US.blogs.txt", encoding = "latin1")
en_news_lines <- readLines("./final/en_US/en_US.news.txt", encoding = "latin1")
en_twitter_lines <- readLines("./final/en_US/en_US.twitter.txt", encoding = "latin1", skipNul = TRUE)

length(en_blog_lines)

## [1] 899288

length(en_news_lines)

## [1] 1010242

length(en_twitter_lines)

## [1] 2360148

The raw files read in directly have 899288, 1010242 and 2360148 lines, respectively. The next important step in the exploratory analysis is text cleaning.

Cleaning

Text pre-processing steps are important for downstream modeling. Cleaned text can be interpreted into more useful information and contains more patterns. The steps I’ve implemented for the purpose of text cleaning are as follows:

Turn all letters to lowercase
Map special characters - in this step I replace custom-encoded characters with their uniform version. The mapping file is available in the github repository with the capstone code.
Clean the unmapped special characters - in latin1 encoding these are represented by series of pairs of hexadecimal symbols in sharp brackets (e.g. <c3><a2><e2><82><ac><c2><a6>)
[optional] Remove stopwords - commonly used words, like “the”, “an”, “of” etc.
Remove obscene words - the list of these words comes from three different sources: here, here and here.
Clean punctation and numbers - first I remove apostrophes, so the words like “i’ve” become concatenated into “ive”, then I remove all punctation and all numbers.
[optional] Remove skipwords - in this step I remove a manually curated list of artefacts that may appear in the previous steps
Clean spaces - I compress multiple spaces into one, and remove possible starting or trailing spaces.

Word count exploration

After the cleaning step I can summarize the distribution of words in the three parts of the corpus - blogs, news and twitter. The figure below shows word couns ordered descending by their frequency in the text. In all three cases the vast majority of the entire vocabulary is covered by a small fraction of the words.

When we look at the most frequent words, we can notice that they are so-called “stopwords”. They may be problematic in the downstream modeling, as they will be rather unspecific to any patterns appearing in the text.

	blogs	num	news	num	twitter	num
1	the	1860773	the	1974503	the	937948
2	and	1094900	to	906158	to	788906
3	to	1069553	a	893982	i	726831
4	a	904252	and	889535	a	616915
5	of	876893	of	774510	you	548482
6	i	777464	in	679104	and	438736
7	in	598741	for	353911	for	385485
8	that	460822	that	346835	in	380744
9	is	432858	is	284282	of	359753
10	it	404217	on	269849	is	358992
11	for	363965	with	254819	it	295457
12	you	298816	said	250432	my	292133
13	with	286781	was	228972	on	277973
14	was	278355	he	228687	that	234847
15	on	276447	it	219556	me	203448
16	my	270932	at	214199	be	188019
17	this	259183	as	188091	at	186839
18	as	224211	i	159110	with	173523
19	have	218949	his	157672	your	171344
20	be	209134	be	152872	have	168769

Removing the stopwords (using the stopwords() function from the tm library). The composition of the most frequent words changes, and we can see even some specific lineup - abbreviation “rt” is one of the most prevalent keywords in the twitter dataset. The impact of the stopwords removal should be carefully evaluated to see if it improves prediction accuracy.

	blogs	num	news	num	twitter	num
1	one	127345	said	250432	just	151217
2	will	112848	will	108238	like	122526
3	just	100814	one	88796	get	112646
4	like	100457	year	76735	love	106894
5	can	98407	new	70787	good	101164
6	time	90972	two	63868	will	94818
7	get	71101	can	58842	day	92989
8	know	60503	also	58786	can	89869
9	now	60408	first	57868	thanks	89817
10	people	59588	time	57067	rt	89775
11	also	55378	just	53356	now	84183
12	new	54856	last	52083	one	82948
13	day	52413	years	51702	know	80003
14	even	52186	like	50831	time	76951
15	first	51644	state	50145	great	76213
16	back	51317	people	47702	go	73195
17	make	51216	get	43785	today	73113
18	well	50846	three	39369	new	69857
19	us	50468	city	37882	see	67117
20	see	50222	now	36530	back	58583

Exploration of ngrams data

Construction of ngrams from the entire corpus is infeasible. To construct the ngrams I will:

filter out lines that have less than 5 words
sample 10% of the lines of each corpus
construct and compare the sets of digrams, trigrams and quadgrams for the sample set.

The figure below summarizes the frequencies of ngrams observed in the sample datasets with and without stopwords. First of all, similarly to word frequencies, we can see a large tail of low frequency ngrams. The higher order of the ngram, the more ngrams appear with low frequency. Also, a direct comparison bwetween the corpora with and without stopwords show, that removing the stopwords greatly reduces the amount of available ngrams.

Let’s look at the most frequent ngrams “with stopwords” corpus, listed in the table below.

	digrams	freq	trigrams	freq	quadgrams	freq
1	of the	42953	one of the	3402	the end of the	738
2	in the	41107	a lot of	3082	the rest of the	680
3	to the	21228	to be a	1873	for the first time	641
4	on the	19579	thanks for the	1835	at the end of	637
5	for the	19470	going to be	1730	at the same time	528
6	to be	16391	i want to	1566	thanks for the follow	478
7	at the	14017	out of the	1490	is going to be	471
8	and the	12640	the end of	1481	one of the most	443
9	in a	12186	it was a	1421	in the middle of	406
10	with the	10419	some of the	1383	is one of the	393
11	is a	9943	as well as	1372	to be able to	389
12	it was	9677	the u s	1360	going to be a	385
13	for a	9273	be able to	1333	when it comes to	383
14	i have	8786	i dont know	1267	i dont want to	349
15	from the	8780	part of the	1197	cant wait to see	336
16	i was	8697	i have a	1194	thank you for the	325
17	and i	8376	i have to	1177	if you want to	318
18	it is	8296	looking forward to	1145	in the u s	310
19	with a	8292	the rest of	1084	one of the best	300
20	will be	8073	the first time	1064	in the united states	280

Let’s compare them with the most frequent ngrams “without stopwords” corpus, listed in the table below.

	digrams	freq	trigrams	freq	quadgrams	freq
1	right now	2282	new york city	269	vested interests vested interests	251
2	new york	1967	interests vested interests	251	interests vested interests vested	250
3	year old	1966	vested interests vested	251	amazon services amazon eu	42
4	last year	1854	let us know	238	cake cake cake cake	42
5	last night	1498	happy mothers day	217	martin luther king jr	40
6	years ago	1410	two years ago	161	just finished mi run	37
7	high school	1369	happy new year	146	rock roll hall fame	33
8	first time	1271	president barack obama	137	new york new jersey	29
9	feel like	1265	cinco de mayo	119	amp amp amp gt	28
10	last week	1216	new york times	118	mg cholesterol mg sodium	28
11	make sure	1067	world war ii	118	happy cinco de mayo	27
12	looking forward	1060	will take place	109	calories protein carbohydrate fat	26
13	can get	1052	looking forward seeing	93	protein carbohydrate fat saturated	26
14	looks like	927	gov chris christie	92	cholesterol mg sodium fiber	25
15	even though	921	first time since	86	let us know think	25
16	new jersey	842	year old son	82	amp amp amp amp	24
17	just got	802	year old daughter	81	carbohydrate fat saturated mg	23
18	one day	779	four years ago	76	fat saturated mg cholesterol	23
19	next week	773	three years ago	76	get real rewards just	23
20	two years	768	new years eve	75	new york stock exchange	23

As we can see, the ngrams in the second table seem more specific.

Conclusions for future modeling

The available text corpus is large and needs to be filtered carefully before modeling. With the long tails of low-frequency items, it may be useful to trim down the number of collected words and ngrams. When trimming down the corpus I will do the following:

keep single words that represent 95% of total word count
remove ngrams of frequency = 1.

An initial comparison of object.size() of the sample corpus of ngrams shows that after this reduction

“with stopwords” corpus size is reduced from 1.2Gb to 122.7Mb
“without stopwords” corpus size is reduced from 907.6Mb to 45.4Mb

This step will be helpfull to make downstream analysis more lightweight. Introducing higher order n-grams may be necessary, but will increase the size of the entire corpus.