This document was written as a status report to a manager. As such
Files in several languages from three sources where provided:
Sources
News
Blogs
Languages
Deutsch
Finish
Russian
However, we limit ourselves to english
| lines | words | bytes | file_name |
|---|---|---|---|
| 899,288 | 37,334,117 | 210,160,014 | en_US.blogs.txt |
| 1,010,242 | 34,365,936 | 205,811,889 | en_US.news.txt |
| 2,360,148 | 30,373,559 | 167,105,338 | en_US.twitter.txt |
Attempting to do full analysis of this corpus would be ill advised due to memory and cpu constraints. I sampled 33% of each source and analyzed them separately to render this exercise.
There is a lot to clean in the text. Blog text data can and does include html code that is useless to us. Twitter is a culture of hashtags and codewords due to its character limits, and some of those need to be excluded.
This is an iterative process that would be too long to document here.
After cleaning the basic text, I parsed it in three distinct datasets.
tok_corpora: a 1 word per row token dataset
tok_corpora_bigram: same as above, for two words per token
tok_corpora_trigram: same, but for three words
Once we have this kind of data format, we can start showing the basics of what this data set contains.
A first approach to check the sanity of the data sets is looking at the term frequency. Languages have a small set of words that get repeated a lot and a lot of words that are not as frequent. We would thus expect frequencies to have long tails. If we didnt, the data would be immediately suspect.
All of this distributions look like they should: high at the most frequent and a long tail to the right.
Now we can start looking at the most frequent terms for one, two and three worded terms:
Love, day, people, time stand out in all three sources of text. It is interesting even at this early stage how some words are more “news”y (percent, million, county), others more “blog”y (god, family, books), others well… yup, very twitteresque (rt, lol, im, game)…
When we do it by bigrams, then some meaning starts to emerge. Names of cities, government positions in the news, salutations in twitter, everyday interests in blogs.
Trigrams continue offering more meaning than bigrams, and now its a more interesting characterization of what people talk about in these three text sources.
From the previous plots you can see that plain analysis frequency of n-grams (or term frequency) only gets us so far. In the book Tidy Text Mining Silge and Robinson (2016), the authors propose term frequency-inverse document frequency (TF-IDF) as a way to summarize a document by ngram frequency in such a way that the most commonly used words weigh less than the lesser used. This yields a ranked list of terms characteristic of each document. From Silge and Robinson (2016):
The statistic tf-idf is intended to measure how important a word is to a document in a collection (or corpus) of documents, for example, to one novel in a collection of novels or to one website in a collection of websites.
The term tf-idf is confusing because it looks like a substraction when in fact the statistic is calulated by multiplying two quantities:
\[ tf\_idf=tf\ *\ idf \]
Where tf is just term frequency: the number of times a term appears divided by the number of terms in a document (here our three “documents” are the texts extracted from: blogs, twitter and news). And idf is deemed a heuristic quantity useful for text mining but with some shaky information theory foundations. Since we are doing text mining here, we can just calculate it as such:
\[ idf(term) = ln(\frac{n_{documents}}{n_{documents\ containing\ term}}) \]
So lets look at the most important terms for each text source, in one, two and three worded grams:
In all three of the above cases, we can see a very different list of the 20 top terms appears when using tf-idf compared to plain frequency analysis. The top 20 tf-idf terms is a much better way to get a gist of what is in the news, blogs and twitter since it eliminates terms that have too much frequency within each source document.
By the way, tf-idf distribution has the same long tail of the distributions of term frequencies. This is to be expected in any language, as far as we know, thus lending to the idea that our sources are somewhat well cleaned and actually have mostly English text in them.
With this exploration, I believe we are ready to do some analysis.