The main aim of this report is to present the results of short analysis of Elon’s Musk Twitter using text mining packages for R.

What is Twitter?


Twitter has a lot of different definitions, among many:

Some numbers:

Elon Musk


Elon Musk is an entrepreneur, engineer and inventor. Currently known as the CEO and founder of SpaceX and Tesla. He started his career by co-founding PayPal. His current net worth is estimated at 20 bilion USD.

Techniques and Tools

Not mentioning basic skills with base R and fundamental packages, in particular dplyr and ggplot2, we used specific text mining packages and techniques:

  • Techniques
    • Text mining
    • Topic modelling
    • Sentiment analysis
    • Social network analysis (partial)
  • Tools
    • Twitter API
    • R and its packages:
      • twitteR
      • tm
      • topicmodels
      • sentiment140
      • igraph

It is then very easy to convert those packages in a basic step-by-step process of what we wanted to do:

  1. Extract tweets and followers from the Twitter website with R and the twitteR package
  2. With the tm package, clean text by removing punctuations, numbers, hyperlinks and stop words, followed by stemming and stem completion
  3. Build a term-document matrix
  4. Analyse topics with the topicmodels package
  5. Analyse sentiment with the sentiment140 package
  6. Analyse following/followed and retweeting relationships with the igraph package


Next chapters will be based on this process and explain thoroughly the basics of functions or methods.

Twitter scraping

Using twiterR package any of us can scrap tweets to R with one simple function userTimeline. We can connect to the Twitter resources using API (where, of course, we must authenticate ourselves) and then download all the data in JSON format, translate it to one list in R.

elon <- userTimeline("elonmusk", n = 3200)

What we obtain is a list of all Elon Musk’s tweets in a number of 438, with fields such as text of the tweet or retweets’ number. It would be proficient to transform this list into a data frame.

We use purrr package to map the fields into dataframe columns. The biggest advantage of twitteR package is that it already creates a dataframe in a tidy data way, so there was not a lot of work needed in this field.

elon_df <- twListToDF(elon)

tidy data table

Note that not every column is presented, but all the columns are as following:

  • text - text of the tweet,
  • favorited - logical - whether the tweet was favorited by me,
  • favoriteCount - number of all favorites
  • replyToSN - screen name of the replied tweet,
  • created - creation date,
  • truncated -, whether thge tweet was cut off (140 characters scraped only)
  • replyToSID - ID of the replied tweet,
  • id - id of the user,
  • replyToUID - ID of the replied tweet’s author,
  • screenName - user’s screen name,
  • retweetCount - number of retweets,
  • isRetweet - wheter it is a retweet,
  • longitude - geographical localization variables,
  • latitude - geographical localization variables.

Elon’s random tweets

We wanted to randomly select few tweets to see how they present and later present how our operations on the dataset changed original text.

## Apparently, some customs agencies are saying they won’t
## allow shipment of anything called a “Flamethrower”. To
## solv… https://t.co/OCtjvdXo95
## 2018-02-02 22:34:10
## 1,000 ordered already, only 19,000 left!
## 2018-01-28 03:07:40
## @atlasobscura A reminder of the youth of our 10,000 year
## (to be generous) civilization. We are not even a flash in…
## https://t.co/FXEDnXjwoh
## 2017-10-21 03:16:03

Text Cleaning

First of our operations consider data cleaning, this is why we used package tm, which is one of the most popular in text data analyses.The main structure for managing documents in tm is a so-called Corpus, representing a collection of text documents. Therefore below code served us as a pre-cleaning in order to prepare clean text mining data. Each of the tweets will be considered a separate document.
Useful source: https://cran.r-project.org/web/packages/tm/vignettes/tm.pdf

(corp <- Corpus(VectorSource(elon_df$text)))
## <<SimpleCorpus>>
## Metadata:  corpus specific: 1, document level (indexed): 0
## Content:  documents: 437
corp <- tm_map(corp, function(x) iconv(enc2utf8(x), sub = "byte"))
corp <- tm_map(corp, content_transformer(tolower))
corp <- tm_map(corp, content_transformer(urlDel))
corp <- tm_map(corp, content_transformer(nonAlpha))
corp <- tm_map(corp, removeWords, stopwords("english"))
corp <- tm_map(corp, stripWhitespace)

Above code creates Corpus object first. Then we perform a set of operations:

  1. Lowercasing of whole text, to avoid problems with varying lettercase,
  2. Deletion of URLs, e.g. https://t.co/BBe3w78sIs, which mainly are retweets,
  3. Deletion of punctuation, e.g. The singularity for this level of the simulation is coming soon. ,
  4. Removal of stopwords, such as: and, or, by, then, etc.,
  5. Removing unnecessary whitespaces (double or more spaces, that may be a result of above operations).

Stemming

Second step considered stemming, which is the process of reducing inflected (or sometimes derived) words to their word stem, base or root form—generally a written word form. The stem need not be identical to the morphological root of the word; it is usually sufficient that related words map to the same stem, even if this stem is not in itself a valid root.

corp <- tm_map(corp, stemDocument)
## appar custom agenc say wonât allow shipment anyth call
## âflamethrowerâ solvâ
## order alreadi left
## jamesharvey wont even need ask time

To complete stemming we derived a function that uses a stemCompletion function. It heuristically completes stemmed words using prestemming corpus, thus giving more relevant results. We once again obtain a corpus.

corp <- lapply(corp, stemCompletion2, dictionary = corpcop)
corp <- Corpus(VectorSource(corp))
## apparently customer agencies say wonât allow shipment
## anything call âflamethrowerâ solvâ
## ordered left
## jamesharvey wont even need ask time

Tweets before and after

To compare how the tweets behaved in particular steps of text cleaning, let us present below example:

Original tweet:

## Apparently, some customs agencies are saying they won’t
## allow shipment of anything called a “Flamethrower”. To
## solv… https://t.co/OCtjvdXo95

After first stemming:

## appar custom agenc say wonât allow shipment anyth call
## âflamethrowerâ solvâ

After stemCompletion:

## apparently customer agencies say wonât allow shipment
## anything call âflamethrowerâ solvâ

Number of particular words

We have created a function that allows us to check the number of occurrences of a particular word. For example for word: “tesla”, it is:

(n.tesla <- wordFreq(corpcop, "tesla"))
## [1] 28

And for the word: “ai”.

(n.ai <- wordFreq(corpcop, "ai"))
## [1] 17

Term Document Matrix

A common approach in text mining is to create a term-document matrix from a corpus. In the tm package the classes TermDocumentMatrix and DocumentTermMatrix (depending on whether you want terms as rows and documents as columns, or vice versa) employ sparse matrices for corpora. Inspecting a term-document matrix displays a sample, whereas as.matrix() yields the full matrix in dense format.

(tdm <- TermDocumentMatrix(corp, control = list(wordLengths = c(1, Inf))))
## <<TermDocumentMatrix (terms: 1571, documents: 437)>>
## Non-/sparse entries: 3122/683405
## Sparsity           : 100%
## Maximal term length: 17
## Weighting          : term frequency (tf)

We might be willing to calculate occurences using this matrix. The brute force way is to calculate occurences by rows like here:

idx <- which(dimnames(tdm)$Terms %in% c("ai", "tesla", "yes"))
as.matrix(tdm[idx, 100:115])
##        Docs
## Terms   100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115
##   yes     0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
##   tesla   0   0   0   0   0   1   0   0   0   0   0   0   0   0   0   0
##   ai      0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0

Besides the fact that on this matrix a huge amount of R functions (like clustering, classifications, etc.) can be applied, this package brings some shortcuts. Imagine we want to find those terms that occur at least fifteen times, then we can use the findFreqTerms() function:

(freq.terms <- findFreqTerms(tdm, lowfreq = 15))
##  [1] "launch"  "falcon"  "heavier" "will"    "yes"     "come"    "one"    
##  [8] "rocket"  "tesla"   "need"    "good"    "like"    "work"    "can"    
## [15] "just"    "amp"     "hat"

Top Frequent Terms

Below you can see how the occurences behave in our dataset:

Wordcloud

We used wordcloud to present the words in Elon Musk tweets in which size of each word indicates its frequency or importance.

  1. First step is to create matrix of a TermDocumentMatrix;
  2. Second Step is to calculate frequency of a given word and sort it descending;
  3. The last step is to create wordcloud with the most frequent word in the center.
mat <- as.matrix(tdm)
# Frequency#
word.freq <- sort(rowSums(mat), decreasing = T)
# Colors#
pal <- brewer.pal(9, "BuGn")[-(1:4)]
# Generate wordcloud#
wordcloud(word = names(word.freq), freq = word.freq, min.freq = 3, random.order = F, 
    colors = pal, scale = c(2, 0.5))

Analyses

Associations

For any given word, findAssocs() calculates its correlation with every other word in a TDM or DTM. Scores range from 0 to 1. A score of 1 means that two words always appear together, while a score of 0 means that they never appear together.

findAssocs(tdm, "tesla", 0.2)
## $tesla
##      futureâ         past         semi supercharger      present 
##         0.36         0.36         0.31         0.25         0.25 
##      believe          oct          low        stock      obvious 
##         0.25         0.25         0.25         0.25         0.23 
##         cool         want 
##         0.23         0.23

The biggest correlation tesla word has with the “ismailnathij” and it’s score is about 0,45.

findAssocs(tdm, "spacex", 0.2)
## $spacex
##    spaceship   neddesmond       design   blueorigin    jeffbezos 
##         0.39         0.39         0.31         0.27         0.27 
## cristatolive          run      attempt       firstâ   spacecraft 
##         0.27         0.27         0.27         0.27         0.27 
##          beâ      blooper         epic    explosion      footage 
##         0.27         0.27         0.27         0.27         0.27 
##       messed     together         blvd   bryanclark         jack 
##         0.27         0.27         0.27         0.27         0.27 
##          juâ     northrop     parallel        later   commercial 
##         0.27         0.27         0.27         0.27         0.27 
##        ocean      program   undergoing      distant         less 
##         0.27         0.27         0.27         0.27         0.27 
##         likâ    proximity        south      targets  complacency 
##         0.27         0.27         0.27         0.27         0.27 
##    confident          dod   suggesting       rocket         week 
##         0.27         0.27         0.27         0.24         0.23

Word spacex has the biggest correlation with word “bebeoutside”. Both those cases mean that Musk spoke a lot about Tesla with someone called “ismailnathij” and about SpaceX with “bebeoutside”. An important point to note that the presence of a term in these list is not indicative of its frequency. Rather it is a measure of the frequency with which the two (search and result term) co-occur (or show up together) in documents across.

Network of terms

We often want to know connection between words just like between humans. With network analysis, not only can we determine which terms appear together frequently, we can visualize how keywords and tweets are connected as a network of terms. This way, we can resolve the number of connections keywords have with one another, and how many connections a specific keyword has with other keywords. We have chosen to show the network of the 15 most frequent terms.

We can see for example that hat is only connected to will word. Maybe because of his limited edition of Boring Hats, made by Boring Company.

Topic Modelling

Topic modeling is a method for unsupervised classification of such documents, similar to clustering on numeric data, which finds natural groups of items even when we’re not sure what we’re looking for. Latent Dirichlet allocation (LDA) is a particularly popular method for fitting a topic model. It treats each document as a mixture of topics, and each topic as a mixture of words. This allows documents to “overlap” each other in terms of content, rather than being separated into discrete groups, in a way that mirrors typical use of natural language. The process is that we choose how many topics we want to have and then function takes which word describe the best given topic.

##                                                        Topic 1 
##           "sure,one,car,good,just,ai,use,appreciated,computer" 
##                                                        Topic 2 
## "hat,will,now,maralkalajian,yeah,get,right,tesla,blundellapps" 
##                                                        Topic 3 
##    "come,land,hyperloop,rocket,like,thatâs,soon,version,first" 
##                                                        Topic 4 
##           "companies,tunnel,build,time,will,k,just,amp,cities" 
##                                                        Topic 5 
##       "í,businessinsider,tesla,amp,much,us,anything,high,work" 
##                                                        Topic 6 
##       "yes,just,rocket,spacex,like,flamethrower,will,fun,real" 
##                                                        Topic 7 
##                  "yes,even,maybe,part,will,ok,matter,way,just" 
##                                                        Topic 8 
##       "falcon,heavier,launch,thrust,can,tesla,cool,model,will"

On plot we can how much density of topics changes over time.

Sentiment Analysis

When human readers approach a text, we use our understanding of the emotional intent of words to infer whether a section of text is positive or negative, or perhaps characterized by some other more nuanced emotion like surprise or disgust. Therefore we present a graph showing sentiment of Elon’s tweets.

## 
## negative  neutral positive 
##       17      327       93

We are adding column score and we give:

  • 1 point if word is positive
  • -1 point if word is negative

As far as we can see the tweets are most of the time positive. We may see that the most negative time was mostly in July and December.

Followers analysis

Retrieve User Info

We used the twitteR package to retrieve user info. The appopriate code is shown above. Sadly, we cannot access all of Elons followers, since the number is about 18 million, and each hour we can retrieve about ~50k. There is an option to retireve followers of our followers, but this is even more prohibitively expensive.

statusesCount followersCount favoritesCount friendsCount url name created protected
3839 18539046 746 48 NA Elon Musk 2009-06-02 20:12:29 FALSE
# friends <- user$getFriends() # who this user follows

# followers <- user$getFollowers() # this user's followers

Top Retweeted Tweets

Here we are plotting the top retweeted tweets. We set the limit at 50’000, obviously we can set it at anything else that that. We can also change colours, label positions etc.

Tracking Message Propagation

Using the package, we can extract up to 100 retweeters. Usually, it will return less.

Tweets text:

## [1] "Apparently, some customs agencies are saying they won’t allow shipment of anything called a “Flamethrower”. To solv… https://t.co/OCtjvdXo95"

Tweets’ ID:

## [1] "959555569953660928"

Retweeters follower numbers:

## [1] 20684

In here we make an attempt at message propagation, given that we cannot access real data cause Elon is too popular (how sad). We get the total numbers of retweets (105051), our sample of retweeters had a total of 216919 followers, we can guesstimate it reached about 227’875’578 people. Of course, there is a very high chance this number is inflated, because some of those ~200k followers are counted more than once, and You can see the tweet only once, even if You follow a few people that retweeted that tweet. We can say we could use that number as the upper floor.

R packages (once more)

Most of the time, we used following packages:

  • Twitter data extraction: twitteR
  • Text cleaning and mining: tm
  • Word cloud: wordcloud
  • Topic modelling: topicmodels, lda
  • Sentiment analysis: sentiment140
  • Social network analysis: igraph, sna
  • Visualisation: wordcloud, Rgraphviz, ggplot2