Word Clouds

This week we look at creating word clouds from Twitter.

Data you’ll need

In this tutorial we’ll deal with JSON files. For this example, tweets were scraped from Twitter’s API (more can be found at http://pablobarbera.com/blog/archives/1.html ).

But for simplicity you can find the data I’ve used here - https://www.dropbox.com/s/dxhtrdqlcfifaes/tweets_BS.json?dl=0

Note Make sure you are in the directory you need to be

Some things to remember when making wordclouds:**

stem your text - i.e. verbs (walk, walked, walking => walk), nouns (cat, cats => cat)
remove stop words (redundant articles) - i.e. ‘a’, ‘the’, etc.
ensure your cases are the same - i.e. lowercase, UPPERCASE
decide how to handle links i.e. remove the links/images from tweets

Libraries you’ll need:

Make sure you have the following libraries already installed. If they aren’t already installed you’ll get an error message. To fix this install the following libraries first before you load them using the following code:

install.packages(library_name)

##Make sure you have RCurl, bitops, rjson loaded before you load streamR
library(RCurl)
library(bitops)
library(rjson)
library(streamR)

##Make sure you have RColorBrewer loaded before you load wordcloud
library(RColorBrewer)
library(wordcloud)

##Make sure you have NLP loaded before you load wordcloud
library(NLP)
library(tm)

Ok, now for the good stuff:

## First parse your tweets into a dataframe
tweets_BS.df <- parseTweets("tweets_BS.json", simplify = TRUE)

## you should get the following message 
" 813 tweets have been parsed. "

## Next... some magic
tweets_BS.df$text <- sapply(tweets_BS.df$text,function(row) iconv(row, "latin1", "ASCII", sub=""))

## Now that you have your tweet data ready to go you can clean it!
## This line takes out unnecessary spaces
TweetCorpus<-paste(unlist(tweets_BS.df$text), collapse =" ")
##This line creates a vector from your data
TweetCorpus <- Corpus(VectorSource(TweetCorpus))
##Punctuation is removed
TweetCorpus <- tm_map(TweetCorpus, removePunctuation)
##Stopwords are next
TweetCorpus <- tm_map(TweetCorpus, removeWords, stopwords('english'))
##Document access triggers the stemming function to create a corpus from the DataSource in order to extract the document
TweetCorpus <- tm_map(TweetCorpus, stemDocument)
##Use wrapper to apply character processing function
TweetCorpus <- tm_map(TweetCorpus, content_transformer(tolower),lazy=TRUE)
##Creates a plain text document
TweetCorpus <- tm_map(TweetCorpus, PlainTextDocument)

##And finally, we have the last bit that creates your wordcloud
wordcloud(TweetCorpus, max.words = 100, random.order = FALSE)

Result:

Bernie Sanders Wordcloud