A word cloud (or tag cloud) is a visual representation of text data, in order to capture the feeling of the subject text, by quickly perceiving the most prominent terms of the large text entities. I’ll demonstrate this strategy by forming a worl cloud of the the most viral articles (Setember 2016) from the swedish newspaper Aftonbladet.
The most trivial step of forming a word cloud is the aquire the source text. I the case of Aftonblandet no web API is provided, so I ended up coding a “shaky” webscarper for the task. One might argue that for a one time data collection a simply web-browser and copy/paste strategy would have been sufficient, however a web-scarper will provide a least some manner of repeatability (imperitive on the field of data-science) for the project. The web-scarper itself is python implementation, source code available at my github page. The harvested data is stored locally under folder “data/raw”. As usual, encoding posed a minor challenge for the import. It’s vital to keep everything encoded with UTF8, when dealing with scandinavian charater sets.
I created a text corpus for the text mining using the procedure below. In order to read the scandinavian letters properly, it is important (again) to set the encoding to UTF8 in DirSource.
createCorpus <- function(pDirectory, pLanguage) {
ptm <- proc.time()
corpus <- VCorpus(DirSource(pDirectory, encoding = "utf-8"), readerControl = list(language = pLanguage))
print(proc.time() - ptm)
return(corpus)
}
corpus.raw <- createCorpus("data/raw", "se")
The corpus is cleaned, by common text mining procedure(s). Note that I have extended the inbuild stopwords(“swedish”) with my extended version stopword_extended, where I have added custom stop words, e.g name of reporters that otherwise would have appeared in the output, this loop is often required when working with word clouds.
processCorpus <- function(corpus) {
corpus <- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, removeWords, stopword_extended)
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, PlainTextDocument)
return(corpus)
}
corpus.processed <- processCorpus(corpus.raw)
For accurate world cloud representation, it’s imperative to derive the root form of the words, a process known as the word stemming. However we must also complete the stemmed words back to the their original form, otherwise we’ll end up with word cloud that only a two year old might use.
Unfortunately the stemComplete function is inherityly broken in R. I found a hack solution get things working againg where the thanks goes to the stackoverflow.com
When the code is ran the following word cloud appears.
And hence the title of this blog post; Polis on fire. The viral articles have words releated to authorities and accidents, like sensation journalism tends to contains. I am not going to analyze the implication of these results further as the scope of this article was to demonstrate how to produce a word cloud from a online source. All the code is available at my github page..