2. Exploratory data Analysis

For this project, we’re given some data sets of consisting of multiple lines of text written in english (though german, russian and finnish are also available) and analyzing those texts using R. The data sets can be downloaded from here.

2.1 Loading and looking at the data

We’ll focus on the english data, available in the final/en_US/ directory, which contains three text files:

en_US.blogs.txt
en_US.news.txt
en_US.twitter.txt

Each file contains several lines of text corresponding to either blogs, news or tweets, which can be analyzed to see which words naturally follow which ones in a natural language setting. In the Appendix it’ll be shown which packages we’ll be using for our analysis.

The following R packages are used for this analysis: dplyr, tidyr, ggplot2, LaF, tokenizers, stringr, stringi, quanteda, data.table, caret, and, clearly, knitr.

First, let’s load the data, and make a summary table of it:

And to take a quick look at it:

	Twitter	Blogs	News
Lines	2,360,148	899,288	77,259
Characters	162,096,241	206,824,382	15,639,408
Words	30,451,170	37,570,839	2,651,432
Min words per line	1	0	1
Mean words per line	13	42	35
Max words per line	47	6,726	1,123

As you can see, we have really big data sets, weighting over 550 MB between them all. So, due to my computer having not so much processing capacity, and given the weight or the data bases, we’ll be working with random samples from the data sets. The sampling method uses the sample_lines() function from the LaF package. We’ll be working with samples of 10% of the total lines per data base. From this 10%, 20% will be for our training set, and 5% for our testing set. Although my computer allows for a bigger data set for exploratory data analysis, when it comes to fiting models and prediction, it just doesn’t have enough memory.

2.2 Cleaning and partitioning our data sets

So, we now have samples 25% the size of our original data sets, but given the size of the datasets, it should suffice (for my computer’s sake). The following step is to extract the relevant data from the sample data sets. This is, we’re not interested in whole stories being told in blogs, news or tweets, but on the individual frequency of words and phrases, as well as knowing which words follow which ones.

To do this, we’ll be using the quanteda package, as well as the base package and regular expressions, to extract from the texts sets of words (tokens), as well as ngrams (sequence of tokens). To do this, I’ve created a simple function to allow me to extract tokens and ngrams of n-number of words. This function can be then applied to our data samples to get the most common tokens (words) and ngrams (sequence of tokens).

For this example, we’ll be working with the most common tokens considering all words, as well as excluding so called stopwords (words that are very common but have little meaning in an overall analysis, such as “the”, “is”, among others). Luckily, the package quanteda exports a stopwords() function which includes a list of 175 english common words which can be excluded from the analysis. We’ll also be looking at the most common 2 token ngrams and 3 token ngrams.

We’ll plot our results on the training data set.

2.3 Plotting our results

The following tabs show occurance of each token or ngram for twitter, blogs and news. For quick Reference:

1T. Most common tokens
1T-nsw. Most commont tokens, without stopwords
2T. Most common 2-token ngrams
2T-nsw. Most common 2-token ngrams, without stopwords
3T. Most common 3-token ngrams
4T. Most common 4-token ngrams

1T

Most common tokens

Twitter

Blogs

News

1T-nsw

Most common tokens, without stopwords

Twitter

Blogs

News

2T

Most common 2-token ngrams

Twitter

Blogs

News

2T-nsw

Most common 2-token ngrams, without stopwords

Twitter

Blogs

News

3T

Most common 3-token ngrams

Twitter

Blogs

News

4T

Most common 4-token ngrams

Twitter

Blogs

News

2.4 Word Coverage

Now we want to see how many words amount to which percentage of the total number of words. For this, we’ll create some basic plots that will help ilustrate this. We will notice that a relatively small number of words cover 50% of the words, and the number increases rapidly to cover almost 80% of total word usage, after which the rate rapidely decreases. For quick Reference:

1T. Most common tokens
1T-nsw. Most commont tokens, without stopwords
2T. Most common 2-token ngrams
2T-nsw. Most common 2-token ngrams, without stopwords
3T. Most common 3-token ngrams
4T. Most common 4-token ngrams

1T

Cumulative distribution of tokens

Twitter

Blogs

News

1T-nsw

Cumulative distribution of tokens, without stopwords

Twitter

Blogs

News

2T

Cumulative distribution of 2-ngrams

Twitter

Blogs

News

2T-nsw

Cumulative distribution of 2-ngrams, without stopwords

Twitter

Blogs

News

3T

Cumulative distribution of 3-ngrams

Twitter

Blogs

News

4T

Cumulative distribution of 3-ngrams

Twitter

Blogs

News

2.5 Interpreting the cumulative frequency plots

What we notice in the cumulative frequency plots is very straight forward. When dealing with single words, we reach the 90% of total words with a relatively small number of unique words. In this sense, predicting the first word to be typed should be easy (mainly, if we restrict ourselves to only analyzing the first word typed in every sentence in our data sets). However, ngrams represent permutations of n words, and reaching even the 50% of posible ngrams takes us into the hundreds of thousands of posibilities.

So, having 4-token ngrams may be way better for predicting the fourth word typed given the previous three words. However, the amount of data needed for this prediction is a clear setback we have to weight.

Milestone Report

Diógenes Cruz Figueroa García

2020-04-14

1. Synopsis

2. Exploratory data Analysis

2.1 Loading and looking at the data

2.2 Cleaning and partitioning our data sets

2.3 Plotting our results

1T

Twitter

Blogs

News

1T-nsw

Twitter

Blogs

News

2T

Twitter

Blogs

News

2T-nsw

Twitter

Blogs

News

3T

Twitter

Blogs

News

4T

Twitter

Blogs

News

2.4 Word Coverage

1T

Twitter

Blogs

News

1T-nsw

Twitter

Blogs

News

2T

Twitter

Blogs

News

2T-nsw

Twitter

Blogs

News

3T

Twitter

Blogs

News

4T

Twitter

Blogs

News

2.5 Interpreting the cumulative frequency plots

3. Next Steps