Twitter Word Cloud

In this natural language processing problem, we will create a word cloud from live Twitter data.

During this exercise, I will perform the following:
  1. Extract data from twitter
  2. Clean data, find frequent words and associations
  3. Build a document matrix
  4. Visualize important words

1. Extract Data from Twitter

1.1 Let’s load the required packages

# Load packages
library(twitteR)
library(tm)
library(wordcloud)
library(e1071)
library(RColorBrewer)
library(class)
library(ggplot2)

1.2 Let’s setup Twitter authorization

In order to setup authorization from Twitter, we will have to create a twitter account and an app on Twitter Apps

Once the app is created we will use (ckey,skey,token,stoken) for setting up authorization and connecting with twitter.

# Connect to Twitter
setup_twitter_oauth(ckey,skey,token,stoken)
## [1] "Using direct authentication"

1.3 Let’s get 1000 tweets for the word ‘football’

football.tweets <- searchTwitter('football', n=1000, lang = 'en')

1.4 Let’s grab text data from tweets

football.text <- sapply(football.tweets, function(x) x$getText())

2. Clean the Data and Build a Document Matrix

2.1 Clean text data

Let’s remove emoticons and characters that are not in UTF-8 format

football.text <- iconv(football.text, 'UTF-8', 'ASCII')

2.2 View Tweets

Let’s look at some tweets from our data

football.text[1:10]
##  [1] NA                                                                                                                                         
##  [2] "RT @RaveenTheDream: He's an Atlanta celebrity from Atlanta at a football game in Atlanta. https://t.co/xKC2V6IA7g"                        
##  [3] "RT @saraholmesSTL: NFL commentary from @jthom1: 'All's well with the NFL if the rich get richer' https://t.co/mBKoFm0rAa | @stltoday #nfl"
##  [4] "Im sorry but I hate football"                                                                                                             
##  [5] "Bill Snyder is college football&amp;#039;s greatest asset https://t.co/kYoUacTNPL"                                                        
##  [6] NA                                                                                                                                         
##  [7] "RT @kananmj: This makes my heart happy. Congrats to @CoachTimLester, to @WMU_Football and all of Bronco Nation! https://t.co/2S13MYIKOr"  
##  [8] "Sukoa Sports Ball Pump with Pin Needle - Soccer, Volleyball, Basketball, Rugby, Football - Superior.. https://t.co/LNkvJyfjQe"            
##  [9] "RT @HilariousRoasts: This is what football is all about. https://t.co/iWCt8NMhNx"                                                         
## [10] "#TEAM SATURDAYENTRY Football"

2.3 Create a Corpus

Let’s create a corpus by creating a vector source from the data

football.corpus <- Corpus(VectorSource(football.text))

2.4 Create a term document

With this term document matrix, we will:
  1. Remove Punctuation
  2. Remove Stop Words
  3. Remove Numbers
  4. Convert all words to lower case
football.term.doc <- TermDocumentMatrix(football.corpus,
                      control = list(removePunctuation = TRUE,
                      stopwords=c('football','soccer','girls','boys','like',
                                  'came','get','game','play',
                                  stopwords('en')),
                      removeNumbers=TRUE, tolower=TRUE))

2.5 Frequent Words and Associations

Let’s look at some frequently used words in our twitter data and look at other words that are associated

# Find words with a frequency >= 20
findFreqTerms(football.term.doc, lowfreq = 20)
##  [1] "atlanta"           "barstoolsports"    "boyfriend"        
##  [4] "celebrity"         "hes"               "httpstconrjiwitd" 
##  [7] "httpstcovmrtzieyl" "httpstcoxkcviag"   "just"             
## [10] "lil"               "nfl"               "raveenthedream"   
## [13] "see"               "texans"            "watch"            
## [16] "watching"          "whatever"          "worldstarfunny"
# Find words that are highly associated with the word ''
findAssocs(football.term.doc, 'atlanta', corlimit=0.4)
## $atlanta
##       celebrity httpstcoxkcviag  raveenthedream             hes 
##            1.00            1.00            1.00            0.91

3. Create a Data Frame

3.1 Create a matrix from TDM

The term document matrix is not a matrix. With this step, let’s convert it into a matrix

football.matrix <- as.matrix(football.term.doc)

3.2 Get the Word Count

Let’s get the word count in decreasing order of frequency

term.freq <- sort(rowSums(football.matrix), decreasing = TRUE)

3.3 Create Data Frame

Let’s create a data frame of the words and their frequencies

football.df <- data.frame(term = names(term.freq), freq = term.freq)

4. Visualize Important Words

4.1 Create a Bar Plot

Let’s create a bar plot of words with frequencies >= 20

ggplot(subset(football.df, football.df$freq > 20), aes(term, freq, fill=freq)) + 
      geom_bar(stat='identity') + labs(x='Terms', y='Count', title='Term Frequencies')       + coord_flip()

4.1 Create a Word Cloud

Let’s create a word cloud of 200 words with a minimum frequency of 5

wordcloud(football.df$term, football.df$freq, min.freq=5, max.words=200,
          random.order=FALSE, colors=brewer.pal(8, 'Dark2'))