This a simple application of using R to perform text mining. You may also access your own data to create analysis.

Access the data

First, apply API keys from twitter.

## [1] "Using direct authentication"

The maximize request is 3200 tweets, I got 191, which is not bad.

## [1] 192

Have a look the first three, then convert all these 191 tweets to a data frame.

## [[1]]
## [1] "realDonaldTrump: “Our economy, right now, is the Gold Standard throughout the World.” @IngrahamAngle  So true, and not even close!"
## 
## [[2]]
## [1] "realDonaldTrump: A low level staffer that I hardly knew named Cliff Sims wrote yet another boring book based on made up stories and… https://t.co/SGN2UN9Ohw"
## 
## [[3]]
## [1] "realDonaldTrump: How does Da Nang Dick (Blumenthal) serve on the Senate Judiciary Committee when he defrauded the American people ab… https://t.co/JBKLZ66qx6"

Text cleaning process, which includes convert all letters to lower case, remove URL, remove anything other than English letter and space, remove stopwords and extra white space.

Look at these three tweets again.

## <<SimpleCorpus>>
## Metadata:  corpus specific: 1, document level (indexed): 0
## Content:  documents: 3
## 
## [1] economi right now gold standard throughout world ingrahamangl true even close       
## [2] low level staffer hard knew name cliff sim wrote yet anoth bore book base made stori
## [3] da nang dick blumenth serv senat judiciari committe defraud american peopl ab

Need to replace a few words, such as ‘peopl’ to ‘people’, ‘whitehous’ to ‘whitehouse’, ‘countri’ to ‘country’, ‘secur’ to ‘secure’

Building term document matrix

This is a matrix of numbers (0 and 1) that keeps track of which documents in a corpus use which terms.

## <<TermDocumentMatrix (terms: 962, documents: 192)>>
## Non-/sparse entries: 1939/182765
## Sparsity           : 99%
## Maximal term length: 20
## Weighting          : term frequency (tf)

As you can see, the term-document matrix is composed of 967 terms from 191 documents(tweets). It is very sparse, with 99% of the entries being zero. Let’s have a look at several interesting terms like ‘border’, ‘wall’, ‘great’ and ‘shutdown’, and tweets numbered 21 to 40.

##           Docs
## Terms      21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
##   great     0  0  0  0  1  0  0  0  0  0  0  0  1  0  0  0  0  0  1  0
##   wall      1  1  1  0  0  0  0  0  1  0  0  0  0  1  2  0  0  0  0  0
##   border    0  0  0  0  0  0  1  0  1  0  1  0  0  0  1  0  0  0  0  0
##   secure    0  0  0  0  0  0  1  0  0  0  0  0  0  0  1  0  0  0  0  0
##   shutdown  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0

Word Cloud

Which word/words are associated with ‘will’?

## $will
##       build        fall      negoti         bot        easi         goe 
##        0.37        0.34        0.34        0.33        0.33        0.33 
##      immedi         car chattanooga       congr      electr    tennesse 
##        0.33        0.33        0.33        0.33        0.33        0.33 
##  volkswagen          co     environ      remark      thrive        make 
##        0.33        0.33        0.33        0.33        0.33        0.26 
##       start        wall       quick     testifi       doubt      dollar 
##        0.26        0.24        0.21        0.21        0.21        0.21 
##     million       staff    interior        mick 
##        0.21        0.21        0.21        0.21

Which word/words are associated with ‘great’? - This is obvious.

## $great
##   game   turn deserv   come 
##   0.29   0.23   0.23   0.22

Which word/ words are associated with ‘border’?

## $border
##   secure southern   danger      tea     wall  support    parti    crime 
##     0.61     0.45     0.37     0.37     0.35     0.30     0.30     0.28 
##     drug    crisi traffick  concern    cross 
##     0.26     0.25     0.24     0.24     0.24

Clustering Words

##      now   people   presid    great     make    state      new    crime 
##        1        2        3        4        1        1        1        1 
##     wall     will     fake     just     year  country     want democrat 
##        5        6        1        7        7        8        1        9 
##   border   secure     mani     time     work      get 
##       10        1        1        1        1        1

Clustering Tweets with the k-means Algorithm

##     now people presid great  make state   new crime  wall  will  fake
## 1 0.167  0.000  0.000 0.000 0.000 0.000 0.167 0.500 2.000 0.667 0.000
## 2 0.071  0.071  0.214 0.000 0.000 0.143 0.000 0.000 0.000 0.000 0.143
## 3 0.176  0.235  0.059 0.059 0.059 0.059 0.059 0.176 0.235 0.059 0.000
## 4 0.000  0.000  0.105 0.000 0.158 0.053 0.105 0.211 0.211 1.158 0.000
## 5 0.063  0.084  0.137 0.000 0.021 0.063 0.042 0.011 0.021 0.000 0.084
## 6 0.040  0.080  0.000 1.000 0.160 0.080 0.200 0.000 0.000 0.080 0.000
## 7 0.000  0.444  0.111 0.333 0.000 0.000 0.000 0.333 0.333 0.111 0.000
## 8 0.000  0.571  0.000 0.286 0.000 0.000 0.143 0.000 0.143 0.000 0.000
##    just  year country  want democrat border secure  mani  time  work   get
## 1 0.000 0.167   0.333 0.167    0.000  0.667  0.500 0.000 0.000 0.167 0.000
## 2 0.357 1.143   0.000 0.000    0.000  0.071  0.000 0.071 0.143 0.071 0.000
## 3 0.000 0.000   0.059 0.000    0.176  1.059  0.412 0.059 0.000 0.118 0.118
## 4 0.000 0.053   0.000 0.000    0.053  0.053  0.000 0.053 0.000 0.000 0.053
## 5 0.074 0.000   0.074 0.000    0.084  0.000  0.011 0.063 0.063 0.000 0.011
## 6 0.120 0.000   0.080 0.040    0.040  0.000  0.000 0.000 0.040 0.040 0.120
## 7 0.000 0.000   0.222 1.111    0.333  0.333  0.111 0.111 0.000 0.000 0.111
## 8 0.286 0.000   0.286 0.000    0.143  0.000  0.000 0.000 0.143 1.143 0.429

Check the top three words in every cluster

## cluster 1: wall will border 
## cluster 2: year just presid 
## cluster 3: border secure people 
## cluster 4: will crime wall 
## cluster 5: presid people fake 
## cluster 6: great new make 
## cluster 7: want people great 
## cluster 8: work people get

It’s very obvious that Trump’s recent tweets are very focused on the wall and border, however, he didn’t pay much attention to the government shutdown.

Sentiment Analysis

The sentiment analysis algorithm used here is based on NRC Word Emotion Association Lexion, available from the tidytext package which developed by Julia Silge and David Robinson. The algorithm associates with eight basic emotions (anger, fear, anticipation, trust, surprise, sadness, joy, and disgust) and two sentiments (negative and positive).

Have a look the head of Trump’s tweets sentiment scores:

##   anger anticipation disgust fear joy sadness surprise trust negative
## 1     0            0       0    0   0       0        0     1        0
## 2     0            0       0    1   0       0        0     1        1
## 3     0            1       0    0   0       0        0     4        1
## 4     0            0       0    0   1       0        0     0        0
## 5     1            0       0    1   0       1        0     1        1
## 6     0            0       0    0   0       0        0     1        0
##   positive
## 1        1
## 2        1
## 3        0
## 4        1
## 5        2
## 6        3

Then combine Trump tweets dataframe and Trump sentiment dataframe together.

Let’s visualize it!

Trump’s tweets appear more positive than negative, more trust than anger. Has Trump’s tweets always been positive or only after he won the election?

What we can see is that the positive sentiment scores are always higher than the negative sentiment scores. And the positive sentiment experienced a significant drop recently, and the positive sentiment increased to the highlest point. Given the current situations between democrats and republicans, it is reasonable to have this change.

The end