Introduction

Just finished Text Mining with R course on Datacamp,really excited to apply my new skills into practice!!!. The dataset which I will be using can be found here.It contains Elon Musk’s tweets from 2010-06-04 to 2017-04-05.It includes Tweet ID, Date & Time Tweet was created, Tweets & Mentions.

I will be using the following packages

## tibble [2,819 x 3] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ id        : num [1:2819] 8.50e+17 8.49e+17 8.49e+17 8.49e+17 8.48e+17 ...
##  $ created_at: POSIXct[1:2819], format: "2017-04-05 14:56:29" "2017-04-03 20:01:01" ...
##  $ text      : chr [1:2819] "b'And so the robots spared humanity ... https://t.co/v7JUJQWfCv'" "b\"@ForIn2020 @waltmossberg @mims @defcon_5 Exactly. Tesla is absurdly overvalued if based on the past, but tha"| __truncated__ "b'@waltmossberg @mims @defcon_5 Et tu, Walt?'" "b'Stormy weather in Shortville ...'" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   id = col_double(),
##   ..   created_at = col_datetime(format = ""),
##   ..   text = col_character()
##   .. )

Lets have a look at some of the latest tweets from him.

## [1] "b'And so the robots spared humanity ... https://t.co/v7JUJQWfCv'"                                                                                               
## [2] "b\"@ForIn2020 @waltmossberg @mims @defcon_5 Exactly. Tesla is absurdly overvalued if based on the past, but that's irr\\xe2\\x80\\xa6 https://t.co/qQcTqkzgMl\""
## [3] "b'@waltmossberg @mims @defcon_5 Et tu, Walt?'"                                                                                                                  
## [4] "b'Stormy weather in Shortville ...'"                                                                                                                            
## [5] "b\"@DaveLeeBBC @verge Coal is dying due to nat gas fracking. It's basically dead.\""

Text Preprocessing

Which includes converting all letters to lower case, removing URL, removing anything other than English letter and space, removig stopwords,extra white space and perform stemming.

we have our text data as a vector, we convert this vector to a corpus.And then call the clean corpus function to the text data.Now reexamine the contents of the first document.
Have a look at the difference before and after Text Preprocessing.

## [1] "b'And so the robots spared humanity ... https://t.co/v7JUJQWfCv'"
## [1] "band robot spare human"

Wow! Thats some serious processing done.

## <<TermDocumentMatrix (terms: 6290, documents: 2819)>>
## Non-/sparse entries: 25533/17705977
## Sparsity           : 100%
## Maximal term length: 26
## Weighting          : term frequency (tf)

Top Frequent Words in his Tweets

Let us have a quick look at the top 15 most frequently used words by Elon Musk in his tweets.

Clustering Words

Here first we perform cluster analysis on the dissimilarities of the distance matrix. And then, visualize the word frequency distances using a dendrogram .

We can see that car and teslamotor are clustered into one group, whereas launch and rocket are clustered into another group.

Word Association

Another way to think about word relationships is with the findAssocs() function in the tm package. For any given word, findAssocs() calculates its correlation with every other word in a TDM.

Let us find the words which are associated with stock,spacex and falcon.

## $stock
##   scream transact  destroy   reader     tabl 
##     0.63     0.63     0.45     0.45     0.45
## $spacex
##  launch  dragon      et   photo  falcon spacest    nasa     pad 
##    0.32    0.27    0.26    0.25    0.22    0.22    0.21    0.20
## $falcon
##  heavi launch spacex 
##   0.24   0.23   0.22

Working with Bigrams

Here our focus is on tokens containing two words as it can help to extract useful phrases which leads to some additional insights.

Wow!! There are so many interesting Bi-grams - tesla model,model x,spacex dragon,space station,climate change,falcon rocket,launch pad,solar power and so on.

Sentiment Analysis

This sentiment analysis algorithm is based on “Bing” lexicon.The Bing lexicon labels words as positive or negative.The lexicon object is obtained using tidytext’s get_sentiments() function. In the below steps we assign a polarity to each word and classify them as positive or negative based on the polarity.

Words that Contribute the most to Positive/Negative Sentiment Scores

Majority of the words spoken by him are positive with love,awesome and cool having the greatest score.Similarly hard,risk and wrong are given highest score among the negative words.

Sentiment by Year

Now we transition to the AFINN lexicon.The AFINN lexicon has numeric values from 5 to -5, not just positive or negative. Unlike the Bing lexicon’s sentiment, the AFINN lexicon’s sentiment score column is called value.In the below steps I calculate the average sentiment score for each year.

Initially I was puzzled looking at this,why only the year 2010 has an average negative sentiment score that too by a large margin.

Then I found out that there was only one tweet in the year 2010 and it was….

## # A tibble: 1 x 2
##   text                                                                     year 
##   <chr>                                                                    <chr>
## 1 b'Please ignore prior tweets, as that was someone pretending to be me :~ 2010

NRC Lexicon

The NRC Word Emotion Association Lexicon tags words according to Plutchik’s 8 emotions plus positive and negative.But since we have performed analysis on Positive and Negative sentiments, we will concentrate more on the rest of the 8 sentiments

It seems Elon Musk’s tweets have been more of anticipation and trust, and a good amount of fear and joy.

The above plot clearly tells us what causes the following emotions, for example he might be anticipating launch of a new rocket or production of a tesla car.

Yeah I think with that my first text mining project has come to an end.Looking forward to play with more interesting datasets.Source code for this post can be found here.