Introduction

Just finished Text Mining with R course on Datacamp,really excited to apply my new skills into practice!!!. The dataset which I will be using can be found here.It contains Elon Musk’s tweets from 2010-06-04 to 2017-04-05.It includes Tweet ID, Date & Time Tweet was created, Tweets & Mentions.

I will be using the following packages

library(tidyverse)
library(wordcloud)
library(RColorBrewer)
library(qdap)
library(tm)
library(gridExtra)
library(dendextend)
library(ggthemes)
library(RWeka)
library(tidytext)
library(textdata)
library(lubridate)
library(anytime)

theme_func <- function() {
  theme_minimal() +
  theme(
    text = element_text(family = "serif", color = "gray25"),
    plot.subtitle = element_text(size = 12,hjust = 0.5,color = "gray45"),
    plot.caption = element_text(color = "gray30"),
    plot.background = element_rect(fill = "gray95"),
    plot.margin = unit(c(5, 10, 5, 10), units = "mm"),
    plot.title = element_text(hjust = 0.5),
    strip.text = element_text(color = "white")
  )
  }
  
data <- read_csv("elonmusk_tweets.csv")
str(data)

## tibble [2,819 x 3] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ id        : num [1:2819] 8.50e+17 8.49e+17 8.49e+17 8.49e+17 8.48e+17 ...
##  $ created_at: POSIXct[1:2819], format: "2017-04-05 14:56:29" "2017-04-03 20:01:01" ...
##  $ text      : chr [1:2819] "b'And so the robots spared humanity ... https://t.co/v7JUJQWfCv'" "b\"@ForIn2020 @waltmossberg @mims @defcon_5 Exactly. Tesla is absurdly overvalued if based on the past, but tha"| __truncated__ "b'@waltmossberg @mims @defcon_5 Et tu, Walt?'" "b'Stormy weather in Shortville ...'" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   id = col_double(),
##   ..   created_at = col_datetime(format = ""),
##   ..   text = col_character()
##   .. )

Lets have a look at some of the latest tweets from him.

data <- data[!data$text == '',] 
data$text[1:5]

## [1] "b'And so the robots spared humanity ... https://t.co/v7JUJQWfCv'"                                                                                               
## [2] "b\"@ForIn2020 @waltmossberg @mims @defcon_5 Exactly. Tesla is absurdly overvalued if based on the past, but that's irr\\xe2\\x80\\xa6 https://t.co/qQcTqkzgMl\""
## [3] "b'@waltmossberg @mims @defcon_5 Et tu, Walt?'"                                                                                                                  
## [4] "b'Stormy weather in Shortville ...'"                                                                                                                            
## [5] "b\"@DaveLeeBBC @verge Coal is dying due to nat gas fracking. It's basically dead.\""

Text Preprocessing

Which includes converting all letters to lower case, removing URL, removing anything other than English letter and space, removig stopwords,extra white space and perform stemming.

clean_corpus <- function(corpus) {
  # Transform to lower case
  corpus <- tm_map(corpus, content_transformer(tolower))
  # remove URLs 
  removeURL <- function(x) gsub("http[^[:space:]]*", "", x) 
  corpus <- tm_map(corpus, content_transformer(removeURL)) 
  # remove anything other than English letters or space 
  removeNumPunct <- function(x) gsub("[^[:alpha:][:space:]]*", "", x) 
  corpus <- tm_map(corpus, content_transformer(removeNumPunct))
  # Strip whitespace
  corpus <- tm_map(corpus, stripWhitespace)
    # Add more stopwords
  corpus <- tm_map(corpus,removeWords, words = c(stopwords("en"),"brt","amp","s"))
  #stemming
  corpus <- tm_map(corpus, stemDocument)
  return(corpus)
}

we have our text data as a vector, we convert this vector to a corpus.And then call the clean corpus function to the text data.Now reexamine the contents of the first document.
Have a look at the difference before and after Text Preprocessing.

## [1] "b'And so the robots spared humanity ... https://t.co/v7JUJQWfCv'"

## [1] "band robot spare human"

Wow! Thats some serious processing done.

## <<TermDocumentMatrix (terms: 6290, documents: 2819)>>
## Non-/sparse entries: 25533/17705977
## Sparsity           : 100%
## Maximal term length: 26
## Weighting          : term frequency (tf)

Top Frequent Words in his Tweets

Let us have a quick look at the top 15 most frequently used words by Elon Musk in his tweets.

term_frequency <- rowSums(tweets_m)

# Sort term_frequency in decreasing order
term_frequency <- sort(term_frequency,decreasing = TRUE)


# Plot a barchart of the 10 most common words
barplot(term_frequency[1:15], col = "tan", las = 2)

Clustering Words

Here first we perform cluster analysis on the dissimilarities of the distance matrix. And then, visualize the word frequency distances using a dendrogram .

# remove sparse terms
tdm2 <- removeSparseTerms(tweets_tdm, sparse=0.96)
m2 <- as.matrix(tdm2)
# cluster terms
distMatrix <- dist(scale(m2))
hc <- hclust(distMatrix)

hcd <- as.dendrogram(hc)

# Change the branch color to red for "marvin" and "gaye"
hcd_colored <- branches_attr_by_labels(hcd, c("teslamotor", "car","launch","rocket"), "red")

# Plot hcd_colored
plot(hcd_colored, main = "Cluster Dendrogram")

We can see that car and teslamotor are clustered into one group, whereas launch and rocket are clustered into another group.

Word Association

Another way to think about word relationships is with the findAssocs() function in the tm package. For any given word, findAssocs() calculates its correlation with every other word in a TDM.

Let us find the words which are associated with stock,spacex and falcon.

findAssocs(tweets_tdm, "stock", 0.4)

## $stock
##   scream transact  destroy   reader     tabl 
##     0.63     0.63     0.45     0.45     0.45

findAssocs(tweets_tdm, "spacex", 0.2)

## $spacex
##  launch  dragon      et   photo  falcon spacest    nasa     pad 
##    0.32    0.27    0.26    0.25    0.22    0.22    0.21    0.20

findAssocs(tweets_tdm, "falcon", 0.2)

## $falcon
##  heavi launch spacex 
##   0.24   0.23   0.22

Working with Bigrams

Here our focus is on tokens containing two words as it can help to extract useful phrases which leads to some additional insights.

# Make tokenizer function 
tokenizer <- function(x) {
  NGramTokenizer(x, Weka_control(min = 2, max = 2))
}

# Create unigram_dtm
unigram_dtm <- DocumentTermMatrix(clean_tweets)

# Create bigram_dtm
bigram_dtm <- DocumentTermMatrix(
  clean_tweets, 
  control = list(tokenize = tokenizer)
)

# Create bigram_dtm_m
bigram_dtm_m <- as.matrix(bigram_dtm)

# Create freq
freq <- colSums(bigram_dtm_m)

# Create bi_words
bi_words <- names(freq)


# Plot a wordcloud
par(bg="black") 
wordcloud(bi_words, freq,max.words=500,random.order=FALSE,c(4,0.4), col=terrain.colors(length(bi_words) , alpha=0.9) , rot.per=0.3)

Wow!! There are so many interesting Bi-grams - tesla model,model x,spacex dragon,space station,climate change,falcon rocket,launch pad,solar power and so on.

Sentiment Analysis

This sentiment analysis algorithm is based on “Bing” lexicon.The Bing lexicon labels words as positive or negative.The lexicon object is obtained using tidytext’s get_sentiments() function. In the below steps we assign a polarity to each word and classify them as positive or negative based on the polarity.

#Transform sentences into words
data_tibble <- data %>%
  unnest_tokens(output = "words", input = text, token = "words")
#Remove stop words from tibble
tweets_tibble_clean <- data_tibble %>%
  anti_join(stop_words, by=c("words"="word"))

word_freq <- tweets_tibble_clean %>% 
  # Inner join to bing lexicon by words = word
  inner_join(get_sentiments("bing"), by = c("words" = "word")) %>% 
  
  # Count by words and sentiment, weighted by count
  count(words, sentiment) %>%
  # Spread sentiment, using n as values
  spread(sentiment, n, fill = 0) %>%

  # Mutate to add a polarity column
  mutate(polarity = positive - negative)%>%
  filter(abs(polarity) >= 9) %>%
  mutate(
    pos_or_neg = ifelse(polarity > 0, "positive", "negative")
  )

Words that Contribute the most to Positive/Negative Sentiment Scores

# Plot polarity vs. (words reordered by polarity), filled by pos_or_neg
ggplot(word_freq, aes(x = reorder(words, polarity),y =  polarity, fill = pos_or_neg)) +
  geom_col() + 
    labs(
    x = "words",
    title = "Sentiment Word Frequency"
      )+
  theme_func()+

  # Rotate text and vertically justify
  theme(axis.text.x = element_text(angle = 55))

Majority of the words spoken by him are positive with love,awesome and cool having the greatest score.Similarly hard,risk and wrong are given highest score among the negative words.

Sentiment by Year

Now we transition to the AFINN lexicon.The AFINN lexicon has numeric values from 5 to -5, not just positive or negative. Unlike the Bing lexicon’s sentiment, the AFINN lexicon’s sentiment score column is called value.In the below steps I calculate the average sentiment score for each year.

year_tibble <- data%>%
mutate(date = anytime::anydate(data$created_at))%>%
mutate(year = format(as.Date(date, format="%d/%m/%Y"),"%Y"))%>%
unnest_tokens(output = "words", input = text, token = "words")%>%
  anti_join(stop_words, by=c("words"="word"))%>%

  group_by(words)%>%
  mutate(count=n())

word_sentiments <- year_tibble %>%
  inner_join(get_sentiments("afinn"), by = c("words"="word")) %>%
  group_by(year) %>%
  summarize(score = sum(value *count) / sum(count))


word_sentiments %>%
  mutate(year = reorder(year, score)) %>%
  ggplot(aes(year, score, fill = score > 0)) +
  geom_col(show.legend = FALSE) +
  coord_flip() +
  ggtitle("Sentiment Score of Elon Musk Tweets") + 
  labs(subtitle = "2010-2017",
       y = "Average sentiment score") +
  theme_func()

Initially I was puzzled looking at this,why only the year 2010 has an average negative sentiment score that too by a large margin.

Then I found out that there was only one tweet in the year 2010 and it was….

twenty_ten <- data%>%
mutate(date = anytime::anydate(data$created_at))%>%
mutate(year = format(as.Date(date, format="%d/%m/%Y"),"%Y"))%>%
filter(year==2010)%>%
select(text,year)  

twenty_ten

## # A tibble: 1 x 2
##   text                                                                     year 
##   <chr>                                                                    <chr>
## 1 b'Please ignore prior tweets, as that was someone pretending to be me :~ 2010

Sentiment Analysis by hour of the day

Here again we use the AFINN lexicon to calculate the average sentiment score by hour and visualise it using lollipop plot.

hour_tibble <- data%>%
mutate(hour=format(data$created_at, "%H"))%>%
unnest_tokens(output = "words", input = text, token = "words")%>%
  anti_join(stop_words, by=c("words"="word"))%>%

  group_by(words)%>%
  mutate(count=n())

letters_sentiments1 <- hour_tibble %>%
  inner_join(get_sentiments("afinn"), by = c("words"="word")) %>%
  group_by(hour) %>%
  summarize(score = sum(value *count) / sum(count))

ggplot(letters_sentiments1,aes(hour,score))+
  geom_segment(aes(x=hour,xend=hour,y=0,yend=score))+
  geom_point(size=5,color="red",fill=alpha("orange",0.3),alpha=0.7,shape=21,stroke=2)+
  theme_func()+
  labs(title = "Sentiment Score in Each Hour")

On an average we can say the sentiment ranges from 0.5 to 1.5,with a huge spike in the sentiment score in the 11th hour followed by very low sentiment score in the 12th,13th and 14th hour of the day.

NRC Lexicon

The NRC Word Emotion Association Lexicon tags words according to Plutchik’s 8 emotions plus positive and negative.But since we have performed analysis on Positive and Negative sentiments, we will concentrate more on the rest of the 8 sentiments

#Transform sentences into words
data_tibble <- data %>%
  unnest_tokens(output = "words", input = text, token = "words")
#Remove stop words from tibble
tweets_tibble_clean <- data_tibble %>%
  anti_join(stop_words, by=c("words"="word"))%>%
  group_by(words)%>%
  summarise(count=n())


word_freq_nrc <- tweets_tibble_clean %>% 
  # Inner join to bing lexicon by term = word
  inner_join(get_sentiments("nrc"), by = c("words" = "word")) %>% 
      filter(!grepl("positive|negative", sentiment)) %>%
  group_by(sentiment) %>% 
  summarize(total_count = sum(count))

ggplot(word_freq_nrc, aes(reorder(x = sentiment,total_count), y = total_count,fill = sentiment)) +
  # Add a column geom
  geom_col()+
  theme_func()+
    coord_flip() + theme(legend.position="none") +
    labs(x = "Sentiments",title = "Top Sentiments using NRC Lexicon")

It seems Elon Musk’s tweets have been more of anticipation and trust, and a good amount of fear and joy.

data %>%
  unnest_tokens(output = "words", input = text, token = "words")%>%
  anti_join(stop_words, by=c("words"="word"))%>%
  count(words) %>%
  inner_join(get_sentiments("nrc"), by = c("words"="word")) %>%
  group_by(sentiment) %>%
      filter(!grepl("positive|negative", sentiment)) %>%
  top_n(5, n) %>%
  ungroup() %>%
  mutate(word = reorder(words, n)) %>%
   ggplot(aes(word, n,fill=sentiment)) +
   geom_col() +
   coord_flip() +
   facet_wrap(~ sentiment, scales = "free") +
  theme_func()+
  labs(y = "count",title = "Top Words under Each Sentiment")

The above plot clearly tells us what causes the following emotions, for example he might be anticipating launch of a new rocket or production of a tesla car.

Yeah I think with that my first text mining project has come to an end.Looking forward to play with more interesting datasets.Source code for this post can be found here.

Text Mining - Elon Musk Tweets

Siddharth Sudhakar

06/06/2020