Data Science Milestone Report

Abstract

This report presents interim results during developing text prediction as a task of Data Science Capsone. It describes basic statistics of loaded data and some interim results of data cleaning.

In the summary I describe some ideas which I am going to investigate.

Data Source

Data was downloaded as a zip file from course repository. Zip file consists of four directories:

de_DE
en_US
fi_FI
ru_RU

For analysis and product developement I will use only data from directory en_EN. The directory consists of three files:

## [1] en_US.blogs.txt   en_US.news.txt    en_US.twitter.txt

Basic Summary

File	Size (MB)	No. of Lines	No. of Characters
en_US.news.txt	196.28	1010242	203223159
en_US.blogs.txt	200.42	899288	206824505
en_US.twitter.txt	159.36	2360148	162096031

Basic Data Cleaning

After loading of some samples of data we could see that data are a bit dirty - a lot of numbers, interpunction etc. It is almost imposible to guess the number of words. We need to clean it before we start the exploratory analysis. I have chosen “brutal force” strategy for the first run of cleaning data in those steps:

transcode all character to lower case
remove all escaped quotations
remove all numbers
replace all non [a-z] characters by space
collapse all multi spaces to one space

As a result we get reasonable pre-cleaned data. Of course some problems remain and we also created some mess in data - like “ve” instead of correct “we’ve” or “st” instead of “1st”. But we can address those problems later during fine tuning models. We got reasonable cleaned data for exploratory analysis for low cost.

Exploratory Analysis

After basic cleaning it is time to create tokenization and have a look at the data. Number of tokens is reasonable good approximation of number of words.

File	Tokens after cleaning	Unique tokens after cleaning
en_US.news.txt	34615456	212552
en_US.blogs.txt	37877989	253523
en_US.twitter.txt	30541949	305383

The result is surprising. Especially when we compare unique tokens with Shakespeare who used 31 thousand different words in his all writing. I did not expect so big disproportion.

Root Cause

Typos. Definitely the main source of extended number of tokens
Plurals etc. We distinguish between “years” and “year” and the same is for verbs and other grammar rules
Some problems are caused by cleaning strategy (as described above)

Way to Solve It

Before solving problems step by step, which can be time consuming and non-effective, let’s have a look at a presentation of data.

The chart bellow shows frequency (blue) and cummulative frequency (red) of all tokens together from all three files. Data are sorted by frequency.

The “saw teeth” on the Frequency (blue line) in the chart means that there are more than one token with the same frequency. For example the last spike represents 280,627 tokens with only one occurence. Those tokens are mainly typos.

Let’s have a look at the beginning of the chart.

This chart shows that we need only 125 tokens to cover 50% of total frequency!

Let’s have a look at the top 15 most frequent tokens (Frequency and Cummulative Frequency are in %):

Rank	Feature	Frequency	Cumulative Frequency
1	the	4.63	4.63
2	to	2.68	7.32
3	and	2.35	9.67
4	a	2.35	12.01
5	i	1.96	13.97
6	of	1.95	15.93
7	in	1.61	17.54
8	it	1.12	18.66
9	that	1.09	19.75
10	for	1.07	20.82
11	s	1.07	21.89
12	is	1.04	22.93
13	you	1.01	23.95
14	on	0.80	24.75
15	with	0.69	25.44

We can see that 10 tokens cover together 20% of cummulative frequency and 15 tokens together cover 25% of cummulative frequency. We can also see some problems which we need to solve:

Do we want to use “a” and “the” in predictions?
We must find and solve the rests after the cleaning like “s” - which propably comes from " ’s ".

And the last table shows the numbers of tokens which are needed to cover some chosen numbers of cummulative frequency.

Cummulative Frequency	Number of Tokens Needed
25	15
50	125
75	1334
90	6748
95	15760

We can see that 95% of tokens as for the size is comparable to lower border of active vocabulary of an English speaker(15,000-20,000 words https://www.bbc.com/news/world-44569277). If we guess the rate of typos and other errors to 5% (1 of 20) it seems that our cleaning strategy was pretty fair.

Summary

We showed that even the data files are huge we are able to clean up and identify the strategy how to get reasonable amount of tokens which are comparable to independent estimation of human size of vocabulary.

Next steps

Next round of cleaning tokens - remove remains after initial cleaning
Remove stopwords
Go through tokens which cover 75% of cummulative frequency manually and correct the possible errors (it is only 1,334 tokens so it is not so much)
Decide whether to use only tokens which represent 90% or 95 % of cummulative frequency (or 75%) - especially with respect to resource needed in the future.
Join some tokens if possible (eg. has / have)
Create n-grams and continue with development