MITx: 15.071x The Analytics Edge - VISUALIZING TEXT DATA USING WORD CLOUDS

Introduction

Earlier in the course, we used text analytics as a predictive tool, using word frequencies as independent variables in our models. However, sometimes our goal is to understand commonly occurring topics in text data instead of to predict the value of some dependent variable. In such cases, word clouds can be a visually appealing way to display the most frequent words in a body of text.

While we could generate word clouds using free generators available on the Internet, we will have more flexibility and control over the process if we do so in R. We will visualize the text of tweets about Apple. The data set has the following variables: Tweet – the text of the tweet Avg – the sentiment of the tweet, as assigned by users of Amazon Mechanical Turk. The score ranges on a scale from -2 to 2, where 2 means highly positive sentiment, -2 means highly negative sentiment, and 0 means neutral sentiment.

PREPARING THE DATA

# Load the data sets
tweets <- read.csv("tweets.csv", stringsAsFactors = F)
library(tm)
# Create corpus
corpus = Corpus(VectorSource(tweets$Tweet))
# Convert to lower-case
corpus = tm_map(corpus, tolower)
# Remove punctuation
corpus = tm_map(corpus, removePunctuation)
# Remove stopwords and apple
corpus = tm_map(corpus, removeWords, c("apple", stopwords("english")))
# Create matrix
frequencies = DocumentTermMatrix(corpus)
# Convert to a data frame
allTweets = as.data.frame(as.matrix(frequencies))
# How many unique words are there across all the documents?
ncol(allTweets)

## [1] 3779

BUILDING A WORD CLOUD

Because we are plotting a large number of words, you might get warnings that some of the words could not be fit on the page and were therefore not plotted – this is especially likely if you are using a smaller screen. You can address these warnings by plotting the words smaller. From ?wordcloud, we can see that the “scale” parameter controls the sizes of the plotted words. By default, the sizes range from 4 for the most frequent words to 0.5 for the least frequent, as denoted by the parameter “scale=c(4, 0.5)”. We could obtain a much smaller plot with, for instance, parameter “scale=c(2, 0.25)”.

library(wordcloud)

## Loading required package: Rcpp
## Loading required package: RColorBrewer

words <- colnames(allTweets)
# obtain the frequency of each word across all tweets
freq <- colSums(allTweets)
wordcloud(words, freq, scale = c(4, 0.5))

plot of chunk unnamed-chunk-2

So far, the word clouds we've built have not been too visually appealing – they are crowded by having too many words displayed, and they don't take advantage of color. One important step to building visually appealing visualizations is to experiment with the parameters available, which in this case can be viewed by typing ?wordcloud in your R console.

# word cloud of negative tweets
negativeTweets = subset(allTweets, tweets$Avg <= -1)
# If random.order is set to FALSE, then the most frequent (largest) words
# will be plotted first, resulting in them being displayed together in the
# center of the word cloud
wordcloud(colnames(negativeTweets), colSums(negativeTweets), colors = "purple", 
    ordered.colors = T)

plot of chunk unnamed-chunk-3