Data Science Milestone Report

Martin Slíva

21/04/2020

Abstract

This report presents interim results during developing text prediction as a task of Data Science Capsone. It describes basic statistics of loaded data and some interim results of data cleaning.

In the summary I describe some ideas which I am going to investigate.

Data Source

Data was downloaded as a zip file from course repository. Zip file consists of four directories:

  • de_DE
  • en_US
  • fi_FI
  • ru_RU

For analysis and product developement I will use only data from directory en_EN. The directory consists of three files:

## [1] en_US.blogs.txt   en_US.news.txt    en_US.twitter.txt

Basic Summary

File Size (MB) No. of Lines No. of Characters
en_US.news.txt 196.28 1010242 203223159
en_US.blogs.txt 200.42 899288 206824505
en_US.twitter.txt 159.36 2360148 162096031

Basic Data Cleaning

After loading of some samples of data we could see that data are a bit dirty - a lot of numbers, interpunction etc. It is almost imposible to guess the number of words. We need to clean it before we start the exploratory analysis. I have chosen “brutal force” strategy for the first run of cleaning data in those steps:

  1. transcode all character to lower case
  2. remove all escaped quotations
  3. remove all numbers
  4. replace all non [a-z] characters by space
  5. collapse all multi spaces to one space

As a result we get reasonable pre-cleaned data. Of course some problems remain and we also created some mess in data - like “ve” instead of correct “we’ve” or “st” instead of “1st”. But we can address those problems later during fine tuning models. We got reasonable cleaned data for exploratory analysis for low cost.

Exploratory Analysis

After basic cleaning it is time to create tokenization and have a look at the data. Number of tokens is reasonable good approximation of number of words.

File Tokens after cleaning Unique tokens after cleaning
en_US.news.txt 34615456 212552
en_US.blogs.txt 37877989 253523
en_US.twitter.txt 30541949 305383

The result is surprising. Especially when we compare unique tokens with Shakespeare who used 31 thousand different words in his all writing. I did not expect so big disproportion.

Root Cause

  1. Typos. Definitely the main source of extended number of tokens
  2. Plurals etc. We distinguish between “years” and “year” and the same is for verbs and other grammar rules
  3. Some problems are caused by cleaning strategy (as described above)

Way to Solve It

Before solving problems step by step, which can be time consuming and non-effective, let’s have a look at a presentation of data.

The chart bellow shows frequency (blue) and cummulative frequency (red) of all tokens together from all three files. Data are sorted by frequency.

The “saw teeth” on the Frequency (blue line) in the chart means that there are more than one token with the same frequency. For example the last spike represents 280,627 tokens with only one occurence. Those tokens are mainly typos.

Let’s have a look at the beginning of the chart.

This chart shows that we need only 125 tokens to cover 50% of total frequency!

Let’s have a look at the top 15 most frequent tokens (Frequency and Cummulative Frequency are in %):

Rank Feature Frequency Cumulative Frequency
1 the 4.63 4.63
2 to 2.68 7.32
3 and 2.35 9.67
4 a 2.35 12.01
5 i 1.96 13.97
6 of 1.95 15.93
7 in 1.61 17.54
8 it 1.12 18.66
9 that 1.09 19.75
10 for 1.07 20.82
11 s 1.07 21.89
12 is 1.04 22.93
13 you 1.01 23.95
14 on 0.80 24.75
15 with 0.69 25.44

We can see that 10 tokens cover together 20% of cummulative frequency and 15 tokens together cover 25% of cummulative frequency. We can also see some problems which we need to solve:

  1. Do we want to use “a” and “the” in predictions?
  2. We must find and solve the rests after the cleaning like “s” - which propably comes from " ’s ".
And the last table shows the numbers of tokens which are needed to cover some chosen numbers of cummulative frequency.
Cummulative Frequency Number of Tokens Needed
25 15
50 125
75 1334
90 6748
95 15760

We can see that 95% of tokens as for the size is comparable to lower border of active vocabulary of an English speaker(15,000-20,000 words https://www.bbc.com/news/world-44569277). If we guess the rate of typos and other errors to 5% (1 of 20) it seems that our cleaning strategy was pretty fair.

Summary

We showed that even the data files are huge we are able to clean up and identify the strategy how to get reasonable amount of tokens which are comparable to independent estimation of human size of vocabulary.

Next steps

  1. Next round of cleaning tokens - remove remains after initial cleaning
  2. Remove stopwords
  3. Go through tokens which cover 75% of cummulative frequency manually and correct the possible errors (it is only 1,334 tokens so it is not so much)
  4. Decide whether to use only tokens which represent 90% or 95 % of cummulative frequency (or 75%) - especially with respect to resource needed in the future.
  5. Join some tokens if possible (eg. has / have)
  6. Create n-grams and continue with development