The main aim of this report is to present the results of short analysis of Elon’s Musk Twitter using text mining packages for R.
Twitter has a lot of different definitions, among many:
Some numbers:
Elon Musk is an entrepreneur, engineer and inventor. Currently known as the CEO and founder of SpaceX and Tesla. He started his career by co-founding PayPal. His current net worth is estimated at 20 bilion USD.
Not mentioning basic skills with base R and fundamental packages, in particular dplyr and ggplot2, we used specific text mining packages and techniques:
It is then very easy to convert those packages in a basic step-by-step process of what we wanted to do:
Next chapters will be based on this process and explain thoroughly the basics of functions or methods.
Using twiterR package any of us can scrap tweets to R with one simple function userTimeline. We can connect to the Twitter resources using API (where, of course, we must authenticate ourselves) and then download all the data in JSON format, translate it to one list in R.
elon <- userTimeline("elonmusk", n = 3200)
What we obtain is a list of all Elon Musk’s tweets in a number of 438, with fields such as text of the tweet or retweets’ number. It would be proficient to transform this list into a data frame.
We use purrr package to map the fields into dataframe columns. The biggest advantage of twitteR package is that it already creates a dataframe in a tidy data way, so there was not a lot of work needed in this field.
elon_df <- twListToDF(elon)
Note that not every column is presented, but all the columns are as following:
We wanted to randomly select few tweets to see how they present and later present how our operations on the dataset changed original text.
## Apparently, some customs agencies are saying they won’t
## allow shipment of anything called a “Flamethrower”. To
## solv… https://t.co/OCtjvdXo95
## 2018-02-02 22:34:10
## 1,000 ordered already, only 19,000 left!
## 2018-01-28 03:07:40
## @atlasobscura A reminder of the youth of our 10,000 year
## (to be generous) civilization. We are not even a flash in…
## https://t.co/FXEDnXjwoh
## 2017-10-21 03:16:03
First of our operations consider data cleaning, this is why we used package tm, which is one of the most popular in text data analyses.The main structure for managing documents in tm is a so-called Corpus, representing a collection of text documents. Therefore below code served us as a pre-cleaning in order to prepare clean text mining data. Each of the tweets will be considered a separate document.
Useful source: https://cran.r-project.org/web/packages/tm/vignettes/tm.pdf
(corp <- Corpus(VectorSource(elon_df$text)))
## <<SimpleCorpus>>
## Metadata: corpus specific: 1, document level (indexed): 0
## Content: documents: 437
corp <- tm_map(corp, function(x) iconv(enc2utf8(x), sub = "byte"))
corp <- tm_map(corp, content_transformer(tolower))
corp <- tm_map(corp, content_transformer(urlDel))
corp <- tm_map(corp, content_transformer(nonAlpha))
corp <- tm_map(corp, removeWords, stopwords("english"))
corp <- tm_map(corp, stripWhitespace)
Above code creates Corpus object first. Then we perform a set of operations:
Second step considered stemming, which is the process of reducing inflected (or sometimes derived) words to their word stem, base or root form—generally a written word form. The stem need not be identical to the morphological root of the word; it is usually sufficient that related words map to the same stem, even if this stem is not in itself a valid root.
corp <- tm_map(corp, stemDocument)
## appar custom agenc say wonât allow shipment anyth call
## âflamethrowerâ solvâ
## order alreadi left
## jamesharvey wont even need ask time
To complete stemming we derived a function that uses a stemCompletion function. It heuristically completes stemmed words using prestemming corpus, thus giving more relevant results. We once again obtain a corpus.
corp <- lapply(corp, stemCompletion2, dictionary = corpcop)
corp <- Corpus(VectorSource(corp))
## apparently customer agencies say wonât allow shipment
## anything call âflamethrowerâ solvâ
## ordered left
## jamesharvey wont even need ask time
To compare how the tweets behaved in particular steps of text cleaning, let us present below example:
Original tweet:
## Apparently, some customs agencies are saying they won’t
## allow shipment of anything called a “Flamethrower”. To
## solv… https://t.co/OCtjvdXo95
After first stemming:
## appar custom agenc say wonât allow shipment anyth call
## âflamethrowerâ solvâ
After stemCompletion:
## apparently customer agencies say wonât allow shipment
## anything call âflamethrowerâ solvâ
We have created a function that allows us to check the number of occurrences of a particular word. For example for word: “tesla”, it is:
(n.tesla <- wordFreq(corpcop, "tesla"))
## [1] 28
And for the word: “ai”.
(n.ai <- wordFreq(corpcop, "ai"))
## [1] 17
A common approach in text mining is to create a term-document matrix from a corpus. In the tm package the classes TermDocumentMatrix and DocumentTermMatrix (depending on whether you want terms as rows and documents as columns, or vice versa) employ sparse matrices for corpora. Inspecting a term-document matrix displays a sample, whereas as.matrix() yields the full matrix in dense format.
(tdm <- TermDocumentMatrix(corp, control = list(wordLengths = c(1, Inf))))
## <<TermDocumentMatrix (terms: 1571, documents: 437)>>
## Non-/sparse entries: 3122/683405
## Sparsity : 100%
## Maximal term length: 17
## Weighting : term frequency (tf)
We might be willing to calculate occurences using this matrix. The brute force way is to calculate occurences by rows like here:
idx <- which(dimnames(tdm)$Terms %in% c("ai", "tesla", "yes"))
as.matrix(tdm[idx, 100:115])
## Docs
## Terms 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115
## yes 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## tesla 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0
## ai 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Besides the fact that on this matrix a huge amount of R functions (like clustering, classifications, etc.) can be applied, this package brings some shortcuts. Imagine we want to find those terms that occur at least fifteen times, then we can use the findFreqTerms() function:
(freq.terms <- findFreqTerms(tdm, lowfreq = 15))
## [1] "launch" "falcon" "heavier" "will" "yes" "come" "one"
## [8] "rocket" "tesla" "need" "good" "like" "work" "can"
## [15] "just" "amp" "hat"
Below you can see how the occurences behave in our dataset:
We used wordcloud to present the words in Elon Musk tweets in which size of each word indicates its frequency or importance.
mat <- as.matrix(tdm)
# Frequency#
word.freq <- sort(rowSums(mat), decreasing = T)
# Colors#
pal <- brewer.pal(9, "BuGn")[-(1:4)]
# Generate wordcloud#
wordcloud(word = names(word.freq), freq = word.freq, min.freq = 3, random.order = F,
colors = pal, scale = c(2, 0.5))
For any given word, findAssocs() calculates its correlation with every other word in a TDM or DTM. Scores range from 0 to 1. A score of 1 means that two words always appear together, while a score of 0 means that they never appear together.
findAssocs(tdm, "tesla", 0.2)
## $tesla
## futureâ past semi supercharger present
## 0.36 0.36 0.31 0.25 0.25
## believe oct low stock obvious
## 0.25 0.25 0.25 0.25 0.23
## cool want
## 0.23 0.23
The biggest correlation tesla word has with the “ismailnathij” and it’s score is about 0,45.
findAssocs(tdm, "spacex", 0.2)
## $spacex
## spaceship neddesmond design blueorigin jeffbezos
## 0.39 0.39 0.31 0.27 0.27
## cristatolive run attempt firstâ spacecraft
## 0.27 0.27 0.27 0.27 0.27
## beâ blooper epic explosion footage
## 0.27 0.27 0.27 0.27 0.27
## messed together blvd bryanclark jack
## 0.27 0.27 0.27 0.27 0.27
## juâ northrop parallel later commercial
## 0.27 0.27 0.27 0.27 0.27
## ocean program undergoing distant less
## 0.27 0.27 0.27 0.27 0.27
## likâ proximity south targets complacency
## 0.27 0.27 0.27 0.27 0.27
## confident dod suggesting rocket week
## 0.27 0.27 0.27 0.24 0.23
Word spacex has the biggest correlation with word “bebeoutside”. Both those cases mean that Musk spoke a lot about Tesla with someone called “ismailnathij” and about SpaceX with “bebeoutside”. An important point to note that the presence of a term in these list is not indicative of its frequency. Rather it is a measure of the frequency with which the two (search and result term) co-occur (or show up together) in documents across.
We often want to know connection between words just like between humans. With network analysis, not only can we determine which terms appear together frequently, we can visualize how keywords and tweets are connected as a network of terms. This way, we can resolve the number of connections keywords have with one another, and how many connections a specific keyword has with other keywords. We have chosen to show the network of the 15 most frequent terms.
We can see for example that hat is only connected to will word. Maybe because of his limited edition of Boring Hats, made by Boring Company.
Topic modeling is a method for unsupervised classification of such documents, similar to clustering on numeric data, which finds natural groups of items even when we’re not sure what we’re looking for. Latent Dirichlet allocation (LDA) is a particularly popular method for fitting a topic model. It treats each document as a mixture of topics, and each topic as a mixture of words. This allows documents to “overlap” each other in terms of content, rather than being separated into discrete groups, in a way that mirrors typical use of natural language. The process is that we choose how many topics we want to have and then function takes which word describe the best given topic.
## Topic 1
## "sure,one,car,good,just,ai,use,appreciated,computer"
## Topic 2
## "hat,will,now,maralkalajian,yeah,get,right,tesla,blundellapps"
## Topic 3
## "come,land,hyperloop,rocket,like,thatâs,soon,version,first"
## Topic 4
## "companies,tunnel,build,time,will,k,just,amp,cities"
## Topic 5
## "í,businessinsider,tesla,amp,much,us,anything,high,work"
## Topic 6
## "yes,just,rocket,spacex,like,flamethrower,will,fun,real"
## Topic 7
## "yes,even,maybe,part,will,ok,matter,way,just"
## Topic 8
## "falcon,heavier,launch,thrust,can,tesla,cool,model,will"
On plot we can how much density of topics changes over time.
When human readers approach a text, we use our understanding of the emotional intent of words to infer whether a section of text is positive or negative, or perhaps characterized by some other more nuanced emotion like surprise or disgust. Therefore we present a graph showing sentiment of Elon’s tweets.
##
## negative neutral positive
## 17 327 93
We are adding column score and we give:
As far as we can see the tweets are most of the time positive. We may see that the most negative time was mostly in July and December.
We used the twitteR package to retrieve user info. The appopriate code is shown above. Sadly, we cannot access all of Elons followers, since the number is about 18 million, and each hour we can retrieve about ~50k. There is an option to retireve followers of our followers, but this is even more prohibitively expensive.
| statusesCount | followersCount | favoritesCount | friendsCount | url | name | created | protected |
|---|---|---|---|---|---|---|---|
| 3839 | 18539046 | 746 | 48 | NA | Elon Musk | 2009-06-02 20:12:29 | FALSE |
# friends <- user$getFriends() # who this user follows
# followers <- user$getFollowers() # this user's followers
Here we are plotting the top retweeted tweets. We set the limit at 50’000, obviously we can set it at anything else that that. We can also change colours, label positions etc.
Using the package, we can extract up to 100 retweeters. Usually, it will return less.
Tweets text:
## [1] "Apparently, some customs agencies are saying they won’t allow shipment of anything called a “Flamethrower”. To solv… https://t.co/OCtjvdXo95"
Tweets’ ID:
## [1] "959555569953660928"
Retweeters follower numbers:
## [1] 20684
In here we make an attempt at message propagation, given that we cannot access real data cause Elon is too popular (how sad). We get the total numbers of retweets (105051), our sample of retweeters had a total of 216919 followers, we can guesstimate it reached about 227’875’578 people. Of course, there is a very high chance this number is inflated, because some of those ~200k followers are counted more than once, and You can see the tweet only once, even if You follow a few people that retweeted that tweet. We can say we could use that number as the upper floor.
Most of the time, we used following packages: