Earlier in the course, we used text analytics as a predictive tool, using word frequencies as independent variables in our models. However, sometimes our goal is to understand commonly occurring topics in text data instead of to predict the value of some dependent variable. In such cases, word clouds can be a visually appealing way to display the most frequent words in a body of text.
A word cloud arranges the most common words in some text, using size to indicate the frequency of a word. For instance, this is a word cloud for the complete works of Shakespeare, removing English stopwords:
While we could generate word clouds using free generators available on the Internet, we will have more flexibility and control over the process if we do so in R. We will visualize the text of tweets about Apple, a dataset we used earlier in the course. As a reminder, this dataset (which can be downloaded from tweets.csv) has the following variables:
Tweet – the text of the tweet
Avg – the sentiment of the tweet, as assigned by users of Amazon Mechanical Turk. The score ranges on a scale from -2 to 2, where 2 means highly positive sentiment, -2 means highly negative sentiment, and 0 means neutral sentiment.
Download the dataset “tweets.csv”, and load it into a data frame called “tweets” using the read.csv() function, remembering to use stringsAsFactors=FALSE when loading the data.
# Read data
tweets = read.csv("tweets.csv", stringsAsFactors = FALSE)Next, perform the following pre-processing tasks (like we did in Unit 5), noting that we don’t stem the words in the document or remove sparse terms:
# Load package
library(tm)
# Create a corpus using the Tweet variable
corpus = VCorpus(VectorSource(tweets$Tweet))
# Convert the corpus to lowercase
corpus = tm_map(corpus, content_transformer(tolower))
# Remove punctuation from the corpus
corpus = tm_map(corpus, removePunctuation)
# Remove all English-language stopwords
corpus = tm_map(corpus, removeWords, stopwords("english"))
# Build a document-term matrix out of the corpus
dtm = DocumentTermMatrix(corpus)
# Convert the document-term matrix to a data frame called allTweets
allTweets = as.data.frame(as.matrix(dtm))# Unique words
dtm
## <<DocumentTermMatrix (documents: 1181, terms: 3780)>>
## Non-/sparse entries: 10273/4453907
## Sparsity : 100%
## Maximal term length: 115
## Weighting : term frequency (tf)It will be easier to read and understand the word cloud if it includes full words instead of just the word stems
As we can read from ?wordcloud, we will need to provide the function with a vector of words and a vector of word frequencies.
library(wordcloud)Each tweet represents a row of allTweets, and each word represents a column. We need the names of all the columns of allTweets, which is returned by colnames(allTweets). While str(allTweets) displays the names of the variables along with other information, it doesn’t return a vector that we can use as the first argument to wordcloud().
Each tweet represents a row in allTweets, and each word represents a column. Therefore, we need to access the sums of each column in allTweets, which is returned by colSums(allTweets).
Because we are plotting a large number of words, you might get warnings that some of the words could not be fit on the page and were therefore not plotted – this is especially likely if you are using a smaller screen. You can address these warnings by plotting the words smaller. From ?wordcloud, we can see that the “scale” parameter controls the sizes of the plotted words. By default, the sizes range from 4 for the most frequent words to 0.5 for the least frequent, as denoted by the parameter “scale=c(4, 0.5)”. We could obtain a much smaller plot with, for instance, parameter “scale=c(2, 0.25)”.
# Build wordcloud
wordcloud(colnames(allTweets), colSums(allTweets))apple
# Create a corpus using the Tweet variable
corpus = VCorpus(VectorSource(tweets$Tweet))
# Convert the corpus to lowercase
corpus = tm_map(corpus, content_transformer(tolower))
# Remove punctuation from the corpus
corpus = tm_map(corpus, removePunctuation)
# Remove all English-language stopwords
corpus = tm_map(corpus, removeWords, c("apple", stopwords("english")))
# Build a document-term matrix out of the corpus
dtm = DocumentTermMatrix(corpus)
# Convert the document-term matrix to a data frame called allTweets
allTweets = as.data.frame(as.matrix(dtm))
wordcloud(colnames(allTweets), colSums(allTweets), scale=c(2, 0.25))iphone
So far, the word clouds we’ve built have not been too visually appealing – they are crowded by having too many words displayed, and they don’t take advantage of color. One important step to building visually appealing visualizations is to experiment with the parameters available, which in this case can be viewed by typing ?wordcloud in your R console. In this problem, you should look through the help page and experiment with different parameters to answer the questions.
Below are four word clouds, each of which uses different parameter settings in the call to the wordcloud() function:
# World Cloud A
wordcloud(colnames(allTweets), colSums(allTweets), scale=c(2, 0.25),rot.per=0.5)# Word Cloud B
wordcloud(colnames(allTweets), colSums(allTweets), scale=c(2, 0.25),min.freq=10,random.order=FALSE)# Word Cloud C
negativeTweets = subset(allTweets, tweets$Avg <= -1)
wordcloud(colnames(negativeTweets), colSums(negativeTweets)) # Word Cloud D
wordcloud(colnames(allTweets), colSums(allTweets), scale=c(2, 0.25),min.freq=10,random.order=FALSE,random.color=TRUE,colors=brewer.pal(9,"Purples")[5:9])Word Cloud C
Word Cloud A
Word Cloud B and Word Cloud D
Word Cloud A
Word Cloud D
The use of a palette of colors can often improve the overall effect of a visualization. We can easily select our own colors when plotting; for instance, we could pass c(“red”, “green”, “blue”) as the colors parameter to wordcloud(). The RColorBrewer package, which is based on the ColorBrewer project (colorbrewer.org), provides pre-selected palettes that can lead to more visually appealing images. Though these palettes are designed specifically for coloring maps, we can also use them in our word clouds and other visualizations.
Begin by installing and loading the “RColorBrewer” package. This package may have already been installed and loaded when you installed and loaded the “wordcloud” package, in which case you don’t need to go through this additional installation step. If you obtain errors (for instance, “Error: lazy-load database ‘P’ is corrupt”) after installing and loading the RColorBrewer package and running some of the commands, try closing and re-opening R.
The function brewer.pal() returns color palettes from the ColorBrewer project when provided with appropriate parameters, and the function display.brewer.all() displays the palettes we can choose from.
# Display the palettes
library(RColorBrewer)
display.brewer.all()On the other hand, YlOrRd is a “sequential palette,” with earlier colors begin lighter and later colors being darker. Therefore, it is a good palette choice for indicating low-frequency vs. high-frequency words.
Palette “Greys” is the only one completely in grayscale.
# Color wordcloud
wordcloud(colnames(allTweets), colSums(allTweets), scale=c(2, 0.25),min.freq=10, colors=brewer.pal(9,"Blues")[-1:-4])brewer.pal(9, “Blues”)[-1:-4]
brewer.pal(9, “Blues”)[5:9]