Introduction

This document was written as a status report to a manager. As such

The data

Files in several languages from three sources where provided:

However, we limit ourselves to english

lines words bytes file_name
899,288 37,334,117 210,160,014 en_US.blogs.txt
1,010,242 34,365,936 205,811,889 en_US.news.txt
2,360,148 30,373,559 167,105,338 en_US.twitter.txt

Attempting to do full analysis of this corpus would be ill advised due to memory and cpu constraints. I sampled 33% of each source and analyzed them separately to render this exercise.

Cleaning

There is a lot to clean in the text. Blog text data can and does include html code that is useless to us. Twitter is a culture of hashtags and codewords due to its character limits, and some of those need to be excluded.

This is an iterative process that would be too long to document here.

Exploratory Analysis

After cleaning the basic text, I parsed it in three distinct datasets.

Once we have this kind of data format, we can start showing the basics of what this data set contains.

Distribution of ngram frequencies

A first approach to check the sanity of the data sets is looking at the term frequency. Languages have a small set of words that get repeated a lot and a lot of words that are not as frequent. We would thus expect frequencies to have long tails. If we didnt, the data would be immediately suspect.

one-grams

bigrams

trigrams

All of this distributions look like they should: high at the most frequent and a long tail to the right.

Top 20 most frequent terms

Now we can start looking at the most frequent terms for one, two and three worded terms:

For one-grams

Love, day, people, time stand out in all three sources of text. It is interesting even at this early stage how some words are more “news”y (percent, million, county), others more “blog”y (god, family, books), others well… yup, very twitteresque (rt, lol, im, game)…

For bigrams

When we do it by bigrams, then some meaning starts to emerge. Names of cities, government positions in the news, salutations in twitter, everyday interests in blogs.

For trigrams

Trigrams continue offering more meaning than bigrams, and now its a more interesting characterization of what people talk about in these three text sources.

TF-IDF: a better frequency analysis

From the previous plots you can see that plain analysis frequency of n-grams (or term frequency) only gets us so far. In the book Tidy Text Mining Silge and Robinson (2016), the authors propose term frequency-inverse document frequency (TF-IDF) as a way to summarize a document by ngram frequency in such a way that the most commonly used words weigh less than the lesser used. This yields a ranked list of terms characteristic of each document. From Silge and Robinson (2016):

The statistic tf-idf is intended to measure how important a word is to a document in a collection (or corpus) of documents, for example, to one novel in a collection of novels or to one website in a collection of websites.

The term tf-idf is confusing because it looks like a substraction when in fact the statistic is calulated by multiplying two quantities:

\[ tf\_idf=tf\ *\ idf \]

Where tf is just term frequency: the number of times a term appears divided by the number of terms in a document (here our three “documents” are the texts extracted from: blogs, twitter and news). And idf is deemed a heuristic quantity useful for text mining but with some shaky information theory foundations. Since we are doing text mining here, we can just calculate it as such:

\[ idf(term) = ln(\frac{n_{documents}}{n_{documents\ containing\ term}}) \]

So lets look at the most important terms for each text source, in one, two and three worded grams:

For one-grams

For bigrams

For trigrams

In all three of the above cases, we can see a very different list of the 20 top terms appears when using tf-idf compared to plain frequency analysis. The top 20 tf-idf terms is a much better way to get a gist of what is in the news, blogs and twitter since it eliminates terms that have too much frequency within each source document.

By the way, tf-idf distribution has the same long tail of the distributions of term frequencies. This is to be expected in any language, as far as we know, thus lending to the idea that our sources are somewhat well cleaned and actually have mostly English text in them.

Conclusion

With this exploration, I believe we are ready to do some analysis.

Silge, Julia, and David Robinson. 2016. “Tidytext: Text Mining and Analysis Using Tidy Data Principles in r.” The Journal of Open Source Software 1 (3): 37. https://doi.org/10.21105/joss.00037.