Just finished Text Mining with R course on Datacamp,really excited to apply my new skills into practice!!!. The dataset which I will be using can be found here.It contains Elon Musk’s tweets from 2010-06-04 to 2017-04-05.It includes Tweet ID, Date & Time Tweet was created, Tweets & Mentions.
I will be using the following packages
library(tidyverse)
library(wordcloud)
library(RColorBrewer)
library(qdap)
library(tm)
library(gridExtra)
library(dendextend)
library(ggthemes)
library(RWeka)
library(tidytext)
library(textdata)
library(lubridate)
library(anytime)
theme_func <- function() {
theme_minimal() +
theme(
text = element_text(family = "serif", color = "gray25"),
plot.subtitle = element_text(size = 12,hjust = 0.5,color = "gray45"),
plot.caption = element_text(color = "gray30"),
plot.background = element_rect(fill = "gray95"),
plot.margin = unit(c(5, 10, 5, 10), units = "mm"),
plot.title = element_text(hjust = 0.5),
strip.text = element_text(color = "white")
)
}
data <- read_csv("elonmusk_tweets.csv")
str(data)## tibble [2,819 x 3] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ id : num [1:2819] 8.50e+17 8.49e+17 8.49e+17 8.49e+17 8.48e+17 ...
## $ created_at: POSIXct[1:2819], format: "2017-04-05 14:56:29" "2017-04-03 20:01:01" ...
## $ text : chr [1:2819] "b'And so the robots spared humanity ... https://t.co/v7JUJQWfCv'" "b\"@ForIn2020 @waltmossberg @mims @defcon_5 Exactly. Tesla is absurdly overvalued if based on the past, but tha"| __truncated__ "b'@waltmossberg @mims @defcon_5 Et tu, Walt?'" "b'Stormy weather in Shortville ...'" ...
## - attr(*, "spec")=
## .. cols(
## .. id = col_double(),
## .. created_at = col_datetime(format = ""),
## .. text = col_character()
## .. )
Lets have a look at some of the latest tweets from him.
## [1] "b'And so the robots spared humanity ... https://t.co/v7JUJQWfCv'"
## [2] "b\"@ForIn2020 @waltmossberg @mims @defcon_5 Exactly. Tesla is absurdly overvalued if based on the past, but that's irr\\xe2\\x80\\xa6 https://t.co/qQcTqkzgMl\""
## [3] "b'@waltmossberg @mims @defcon_5 Et tu, Walt?'"
## [4] "b'Stormy weather in Shortville ...'"
## [5] "b\"@DaveLeeBBC @verge Coal is dying due to nat gas fracking. It's basically dead.\""
Which includes converting all letters to lower case, removing URL, removing anything other than English letter and space, removig stopwords,extra white space and perform stemming.
clean_corpus <- function(corpus) {
# Transform to lower case
corpus <- tm_map(corpus, content_transformer(tolower))
# remove URLs
removeURL <- function(x) gsub("http[^[:space:]]*", "", x)
corpus <- tm_map(corpus, content_transformer(removeURL))
# remove anything other than English letters or space
removeNumPunct <- function(x) gsub("[^[:alpha:][:space:]]*", "", x)
corpus <- tm_map(corpus, content_transformer(removeNumPunct))
# Strip whitespace
corpus <- tm_map(corpus, stripWhitespace)
# Add more stopwords
corpus <- tm_map(corpus,removeWords, words = c(stopwords("en"),"brt","amp","s"))
#stemming
corpus <- tm_map(corpus, stemDocument)
return(corpus)
}we have our text data as a vector, we convert this vector to a corpus.And then call the clean corpus function to the text data.Now reexamine the contents of the first document.
Have a look at the difference before and after Text Preprocessing.
## [1] "b'And so the robots spared humanity ... https://t.co/v7JUJQWfCv'"
## [1] "band robot spare human"
Wow! Thats some serious processing done.
## <<TermDocumentMatrix (terms: 6290, documents: 2819)>>
## Non-/sparse entries: 25533/17705977
## Sparsity : 100%
## Maximal term length: 26
## Weighting : term frequency (tf)
Let us have a quick look at the top 15 most frequently used words by Elon Musk in his tweets.
term_frequency <- rowSums(tweets_m)
# Sort term_frequency in decreasing order
term_frequency <- sort(term_frequency,decreasing = TRUE)
# Plot a barchart of the 10 most common words
barplot(term_frequency[1:15], col = "tan", las = 2) Here first we perform cluster analysis on the dissimilarities of the distance matrix. And then, visualize the word frequency distances using a dendrogram .
# remove sparse terms
tdm2 <- removeSparseTerms(tweets_tdm, sparse=0.96)
m2 <- as.matrix(tdm2)
# cluster terms
distMatrix <- dist(scale(m2))
hc <- hclust(distMatrix)
hcd <- as.dendrogram(hc)
# Change the branch color to red for "marvin" and "gaye"
hcd_colored <- branches_attr_by_labels(hcd, c("teslamotor", "car","launch","rocket"), "red")
# Plot hcd_colored
plot(hcd_colored, main = "Cluster Dendrogram") We can see that car and teslamotor are clustered into one group, whereas launch and rocket are clustered into another group.
Another way to think about word relationships is with the findAssocs() function in the tm package. For any given word, findAssocs() calculates its correlation with every other word in a TDM.
Let us find the words which are associated with stock,spacex and falcon.
## $stock
## scream transact destroy reader tabl
## 0.63 0.63 0.45 0.45 0.45
## $spacex
## launch dragon et photo falcon spacest nasa pad
## 0.32 0.27 0.26 0.25 0.22 0.22 0.21 0.20
## $falcon
## heavi launch spacex
## 0.24 0.23 0.22
Here our focus is on tokens containing two words as it can help to extract useful phrases which leads to some additional insights.
# Make tokenizer function
tokenizer <- function(x) {
NGramTokenizer(x, Weka_control(min = 2, max = 2))
}
# Create unigram_dtm
unigram_dtm <- DocumentTermMatrix(clean_tweets)
# Create bigram_dtm
bigram_dtm <- DocumentTermMatrix(
clean_tweets,
control = list(tokenize = tokenizer)
)
# Create bigram_dtm_m
bigram_dtm_m <- as.matrix(bigram_dtm)
# Create freq
freq <- colSums(bigram_dtm_m)
# Create bi_words
bi_words <- names(freq)
# Plot a wordcloud
par(bg="black")
wordcloud(bi_words, freq,max.words=500,random.order=FALSE,c(4,0.4), col=terrain.colors(length(bi_words) , alpha=0.9) , rot.per=0.3)Wow!! There are so many interesting Bi-grams - tesla model,model x,spacex dragon,space station,climate change,falcon rocket,launch pad,solar power and so on.
This sentiment analysis algorithm is based on “Bing” lexicon.The Bing lexicon labels words as positive or negative.The lexicon object is obtained using tidytext’s get_sentiments() function. In the below steps we assign a polarity to each word and classify them as positive or negative based on the polarity.
#Transform sentences into words
data_tibble <- data %>%
unnest_tokens(output = "words", input = text, token = "words")
#Remove stop words from tibble
tweets_tibble_clean <- data_tibble %>%
anti_join(stop_words, by=c("words"="word"))
word_freq <- tweets_tibble_clean %>%
# Inner join to bing lexicon by words = word
inner_join(get_sentiments("bing"), by = c("words" = "word")) %>%
# Count by words and sentiment, weighted by count
count(words, sentiment) %>%
# Spread sentiment, using n as values
spread(sentiment, n, fill = 0) %>%
# Mutate to add a polarity column
mutate(polarity = positive - negative)%>%
filter(abs(polarity) >= 9) %>%
mutate(
pos_or_neg = ifelse(polarity > 0, "positive", "negative")
)# Plot polarity vs. (words reordered by polarity), filled by pos_or_neg
ggplot(word_freq, aes(x = reorder(words, polarity),y = polarity, fill = pos_or_neg)) +
geom_col() +
labs(
x = "words",
title = "Sentiment Word Frequency"
)+
theme_func()+
# Rotate text and vertically justify
theme(axis.text.x = element_text(angle = 55)) Majority of the words spoken by him are positive with love,awesome and cool having the greatest score.Similarly hard,risk and wrong are given highest score among the negative words.
Now we transition to the AFINN lexicon.The AFINN lexicon has numeric values from 5 to -5, not just positive or negative. Unlike the Bing lexicon’s sentiment, the AFINN lexicon’s sentiment score column is called value.In the below steps I calculate the average sentiment score for each year.
year_tibble <- data%>%
mutate(date = anytime::anydate(data$created_at))%>%
mutate(year = format(as.Date(date, format="%d/%m/%Y"),"%Y"))%>%
unnest_tokens(output = "words", input = text, token = "words")%>%
anti_join(stop_words, by=c("words"="word"))%>%
group_by(words)%>%
mutate(count=n())
word_sentiments <- year_tibble %>%
inner_join(get_sentiments("afinn"), by = c("words"="word")) %>%
group_by(year) %>%
summarize(score = sum(value *count) / sum(count))
word_sentiments %>%
mutate(year = reorder(year, score)) %>%
ggplot(aes(year, score, fill = score > 0)) +
geom_col(show.legend = FALSE) +
coord_flip() +
ggtitle("Sentiment Score of Elon Musk Tweets") +
labs(subtitle = "2010-2017",
y = "Average sentiment score") +
theme_func()Initially I was puzzled looking at this,why only the year 2010 has an average negative sentiment score that too by a large margin.
Then I found out that there was only one tweet in the year 2010 and it was….
twenty_ten <- data%>%
mutate(date = anytime::anydate(data$created_at))%>%
mutate(year = format(as.Date(date, format="%d/%m/%Y"),"%Y"))%>%
filter(year==2010)%>%
select(text,year)
twenty_ten## # A tibble: 1 x 2
## text year
## <chr> <chr>
## 1 b'Please ignore prior tweets, as that was someone pretending to be me :~ 2010
Here again we use the AFINN lexicon to calculate the average sentiment score by hour and visualise it using lollipop plot.
hour_tibble <- data%>%
mutate(hour=format(data$created_at, "%H"))%>%
unnest_tokens(output = "words", input = text, token = "words")%>%
anti_join(stop_words, by=c("words"="word"))%>%
group_by(words)%>%
mutate(count=n())
letters_sentiments1 <- hour_tibble %>%
inner_join(get_sentiments("afinn"), by = c("words"="word")) %>%
group_by(hour) %>%
summarize(score = sum(value *count) / sum(count))
ggplot(letters_sentiments1,aes(hour,score))+
geom_segment(aes(x=hour,xend=hour,y=0,yend=score))+
geom_point(size=5,color="red",fill=alpha("orange",0.3),alpha=0.7,shape=21,stroke=2)+
theme_func()+
labs(title = "Sentiment Score in Each Hour")On an average we can say the sentiment ranges from 0.5 to 1.5,with a huge spike in the sentiment score in the 11th hour followed by very low sentiment score in the 12th,13th and 14th hour of the day.
The NRC Word Emotion Association Lexicon tags words according to Plutchik’s 8 emotions plus positive and negative.But since we have performed analysis on Positive and Negative sentiments, we will concentrate more on the rest of the 8 sentiments
#Transform sentences into words
data_tibble <- data %>%
unnest_tokens(output = "words", input = text, token = "words")
#Remove stop words from tibble
tweets_tibble_clean <- data_tibble %>%
anti_join(stop_words, by=c("words"="word"))%>%
group_by(words)%>%
summarise(count=n())
word_freq_nrc <- tweets_tibble_clean %>%
# Inner join to bing lexicon by term = word
inner_join(get_sentiments("nrc"), by = c("words" = "word")) %>%
filter(!grepl("positive|negative", sentiment)) %>%
group_by(sentiment) %>%
summarize(total_count = sum(count))
ggplot(word_freq_nrc, aes(reorder(x = sentiment,total_count), y = total_count,fill = sentiment)) +
# Add a column geom
geom_col()+
theme_func()+
coord_flip() + theme(legend.position="none") +
labs(x = "Sentiments",title = "Top Sentiments using NRC Lexicon") It seems Elon Musk’s tweets have been more of anticipation and trust, and a good amount of fear and joy.
data %>%
unnest_tokens(output = "words", input = text, token = "words")%>%
anti_join(stop_words, by=c("words"="word"))%>%
count(words) %>%
inner_join(get_sentiments("nrc"), by = c("words"="word")) %>%
group_by(sentiment) %>%
filter(!grepl("positive|negative", sentiment)) %>%
top_n(5, n) %>%
ungroup() %>%
mutate(word = reorder(words, n)) %>%
ggplot(aes(word, n,fill=sentiment)) +
geom_col() +
coord_flip() +
facet_wrap(~ sentiment, scales = "free") +
theme_func()+
labs(y = "count",title = "Top Words under Each Sentiment")The above plot clearly tells us what causes the following emotions, for example he might be anticipating launch of a new rocket or production of a tesla car.
Yeah I think with that my first text mining project has come to an end.Looking forward to play with more interesting datasets.Source code for this post can be found here.