Analysing Donald Trump’s tweets

In this tutorial we will examine the twitter account of Donald Trump. Much of the content and code has been borrowed from this blog post authored by David Robinson.

As with most scientific ideas, we start with a hypothesis, and as with most scientific ideas, we didn’t come up with this hypothesis ourselves. When you post on twitter, it notes what types of device you are posting from. Various twitter commentors noted that on Donald Trump’s twitter account realDonaldTrump post were being made from both Android and iPhone devices, and that there was a significant difference in the type on tone of the posts from these two devices. This led to the hypothesis that Trump himself was posting using the Android device, and that his staff were using the iPhone.

We will examine this idea using some of the Natural Language Processing tools we have told you about. We will use the data science software R with the tidytext package to do text analytics and the twitteR package to connect and download data from twitter.

Before we get started we need to create what is called a twitter application. After doing this we will be able to connect to the app and download data from twitter. Creating the applicaiton is a simple process (but you need a twitter account before you start):

  1. go to apps.twitter.com
  2. click create a new application, giving it a name and description
  3. once you have created your app, copy the following info from the app homepage:
  1. click the create my access token button, and copy

Now we can get started.

The first step in any R analysis is to load the required packages

library(twitteR) #connect to twitter
library(dplyr) #manipulate table data
library(purrr) #writing neat functions
library(tidyr) #manipulate table data
library(lubridate) #manipulate dates
library(scales) #for plotting
library(ggplot2) #for plotting
library(tidytext) #manipulating text data and natural language processing
library(sentimentr) #natural language processing
library(lexicon) #sentiment dictionaries
library(stringr) #manipulating text data
library(wordcloud2) #create wordclouds

Lets connect to our twitter app using the keys we copied earlier, and download Donald Trump’s tweets

consumer_key <- "paste key"
consumer_secret <- "paste key"
access_token <- "paste key"
access_secret <- "paste key"

#now lets connect
setup_twitter_oauth(consumer_key, consumer_secret, access_token, access_secret)

# We can request only 3200 tweets at a time; it will return fewer
# depending on the API
trump_tweets <- userTimeline("realDonaldTrump", n = 3200)
trump_tweets_df <- tbl_df(map_df(trump_tweets, as.data.frame))

Lets turn the data twitter has given us into a tidy table that is easy to work with

tweets <- trump_tweets_df %>%
  select(id, statusSource, text, created) %>% #select only the columns we want
  extract(statusSource, "source", "Twitter for (.*?)<") %>% #extract a text string
  filter(source %in% c("iPhone", "Android")) #we only want tweets from android or iphone

head(tweets)
## # A tibble: 6 <U+00D7> 4
##                   id  source
##                <chr>   <chr>
## 1 762669882571980801 Android
## 2 762641595439190016  iPhone
## 3 762439658911338496  iPhone
## 4 762425371874557952 Android
## 5 762400869858115588 Android
## 6 762284533341417472 Android
## # ... with 2 more variables: text <chr>, created <dttm>

Now that we have nice clean data we can start to explore the differences between the iphone and android tweets. Before we start to look at language, there are many other features that suggest differences in the users of these devices. First lets look at the time that the tweets are sent out

#tweet time
tweets_hour <-  tweets %>%
  count(source, hour = hour(with_tz(created, "EST"))) %>% #count tweets per hour
  mutate(percent = n / sum(n)) #reformat to percent
  
  
ggplot(data=tweets_hour,aes(hour, percent, color = source)) + #create plot
  geom_line() + #add a line graph
  scale_y_continuous(labels = percent_format()) + #format axes
  labs(x = "Hour of day (EST)", #label axes
       y = "% of tweets",
       color = "")

Already we see a noticeble difference in the time that the tweets are sent out. Tweets from the android device are sent in the early mornings and evening, and tweets from the iphone are mostly made during working hours.

Just looking at the number of hashtags we can see the differnce in behaviour:

tweet_hash_counts <- tweets %>%
  filter(!str_detect(text, '^"')) %>% #remove text in quote marks
  count(source, picture = ifelse(str_detect(text, "#"), "Hashtag", "No Hashtag")) #count number of tweets with/without hashtags

ggplot(tweet_hash_counts, aes(source, n, fill = picture)) + #create plot
  geom_bar(stat = "identity", position = "dodge") + #add bar graph
  labs(x = "", y = "Number of tweets", fill = "") #label

Now lets start to look a the actual language that is being used. Is there a meaningful difference in the sentiments begin expressed? To do this the first steps os to extract he individual wpords from the tweet text:

reg <- "([^A-Za-z\\d#@']|'(?![A-Za-z\\d#@]))" #this is just a list of letters. we want to keep only letters so that we remove non-words
tweet_words <- tweets %>%
  filter(!str_detect(text, '^"')) %>% #remove text in quotes
  mutate(text = str_replace_all(text, "https://t.co/[A-Za-z\\d]+|&amp;", "")) %>% #remove all website links
  unnest_tokens(word, text, token = "regex", pattern = reg) %>% #create lsit of words keeping nly those composed of regualr letters
  filter(!word %in% stop_words$word,str_detect(word, "[a-z]")) #remove stopwords

head(tweet_words)
## # A tibble: 6 <U+00D7> 4
##                   id source             created                   word
##                <chr>  <chr>              <dttm>                  <chr>
## 1 676494179216805888 iPhone 2015-12-14 20:09:15                 record
## 2 676494179216805888 iPhone 2015-12-14 20:09:15                 health
## 3 676494179216805888 iPhone 2015-12-14 20:09:15 #makeamericagreatagain
## 4 676494179216805888 iPhone 2015-12-14 20:09:15             #trump2016
## 5 676509769562251264 iPhone 2015-12-14 21:11:12               accolade
## 6 676509769562251264 iPhone 2015-12-14 21:11:12             @trumpgolf

word clouds are a nice way of visualizing the frequency of words. lets compare the word clouds from the android and iphone tweets. First we look at Android

words1 <- tweet_words %>% filter(source=="Android") %>% select(word) %>% table %>% as.data.frame %>% filter(Freq>5)
wordcloud2(words1,size=1)

and now the iPhone

words2 <- tweet_words %>% filter(source=="iPhone") %>% select(word) %>% table %>% as.data.frame %>% filter(Freq>5)
wordcloud2(words2,size=2)

Probably the best way to compare the frequency of words between these two devices is to look at the relative frequency using a log-odds ratio. This ratio gives us an index of how much more frequent a word is in the android tweets vs the iphone tweets

\[\log_2(\frac{\frac{\mbox{# in Android} + 1}{\mbox{Total Android} + 1}} {\frac{\mbox{# in iPhone} + 1}{\mbox{Total iPhone} + 1}})\]

#create clean data of log-odds ratios
android_iphone_ratios <- tweet_words %>% 
  count(word, source) %>% #count the frequency of words by source
  filter(sum(n) >= 5) %>% #only use words with more than 5 total occurences
  spread(source, n, fill = 0) %>% #create a column for android, column for iphone 
  ungroup() %>%
  mutate_each(funs((. + 1) / sum(. + 1)), -word) %>% #calculate log-odds ratio
  mutate(logratio = log2(Android / iPhone)) %>% #calculate log-odds ratio
  arrange(desc(logratio)) #sort by log-odds ratio

#plot clean data
android_iphone_ratios %>%
  group_by(logratio > 0) %>%
  top_n(15, abs(logratio)) %>% #get 15 biggest differnt words
  ungroup() %>%
  mutate(word = reorder(word, logratio)) %>% #sort by log-odds ratio
  ggplot(aes(word, logratio, fill = logratio < 0)) + #create plot
  geom_bar(stat = "identity") + #add bar graph
  coord_flip() + #we want bars going horizontally, not vertically
  ylab("Android / iPhone log ratio") + #label axes
  scale_fill_manual(name = "", labels = c("Android", "iPhone"),values = c("red", "lightblue")) #color legend

OK so we can see that there are patterns here. Most of the words that are more common from the android device are associated with negative sentiments. How can we quantify this?

…using inbuilt sentiment lexicons

lexicon <- hash_sentiment_jockers #select our sentiemtn dicitonary
names(lexicon) <- c("word","score") #name columns

lexicon
##               word score
##     1:     abandon -0.75
##     2:   abandoned -0.50
##     3:   abandoner -0.25
##     4: abandonment -0.25
##     5:    abandons -1.00
##    ---                  
## 10735:     zealous  0.40
## 10736:      zenith  0.40
## 10737:        zest  0.50
## 10738:      zombie -0.25
## 10739:     zombies -0.25

This is our sentiment dictionary. Each word in the dictionary is given a score to indicate how negative or postive it is. We will use this dictionary to score the sentiment of each tweet. First lets do this by simply scoring sentiment word-by-word and calculating the mean per tweet

tweet_sentiment_word <- tweet_words %>% 
  inner_join(lexicon,by="word") %>%  #join words and dictionary
  group_by(id) %>% #group by tweets
  summarise(mean_sent=mean(score),source=first(source),created=first(created)) #create new column with mean per tweet

head(tweet_sentiment_word)
## # A tibble: 6 <U+00D7> 4
##                   id  mean_sent  source             created
##                <chr>      <dbl>   <chr>              <dttm>
## 1 676509769562251264  0.3750000  iPhone 2015-12-14 21:11:12
## 2 680496083072593920 -0.1166667 Android 2015-12-25 21:11:23
## 3 680503951440121856 -1.0000000 Android 2015-12-25 21:42:39
## 4 680505672476262400  0.2500000 Android 2015-12-25 21:49:30
## 5 680734915718176768 -0.5000000 Android 2015-12-26 13:00:26
## 6 682764544402440192  0.7500000  iPhone 2016-01-01 03:25:27
ggplot(tweet_sentiment_word,aes(mean_sent,color=source)) + #create plot
        geom_freqpoly(binwidth=0.2) + #add histogram
        xlim(-1, 1) + #axis limits
        xlab("Sentiment by word") 

This is nice, but we can run into a lot of problems when just scoring single words and then averageing. It can be much better to evaluate the sentiment of a whole sentence. R has functions that will do this for us:

new_id <- data.frame(id=tweets$id,element_id=1:nrow(tweets)) #create id to link tweets and sentences
tweet_sentiment_sentence <- sentiment_by(tweets$text,polarity_dt = hash_sentiment_jockers) %>%  #score each tweet for its sentiment
  inner_join(new_id) %>% #attach tweet ids
  inner_join(tweets) #attached tweet data

head(tweet_sentiment_sentence)
##   element_id word_count        sd ave_sentiment                 id  source
## 1          1         11 0.5303301    0.40926475 762669882571980801 Android
## 2          2         16 0.1750384    0.20282779 762641595439190016  iPhone
## 3          3          8        NA    0.24748737 762439658911338496  iPhone
## 4          4         23        NA   -0.09383149 762425371874557952 Android
## 5          5         23 0.5709214   -0.46392244 762400869858115588 Android
## 6          6         25 0.2593755    0.02947645 762284533341417472 Android
##                                                                                                                                         text
## 1                                                                        My economic policy speech will be carried live at 12:15 P.M. Enjoy!
## 2                         Join me in Fayetteville, North Carolina tomorrow evening at 6pm. Tickets now available at: https://t.co/Z80d4MYIg8
## 3                                                                           #ICYMI: "Will Media Apologize to Trump?" https://t.co/ia7rKBmioA
## 4     Michael Morell, the lightweight former Acting Director of C.I.A., and a man who has made serious bad calls, is a total Clinton flunky!
## 5    The media is going crazy. They totally distort so many things on purpose. Crimea, nuclear, "the baby" and so much more. Very dishonest!
## 6 I see where Mayor Stephanie Rawlings-Blake of Baltimore is pushing Crooked hard. Look at the job she has done in Baltimore. She is a joke!
##               created
## 1 2016-08-08 15:20:44
## 2 2016-08-08 13:28:20
## 3 2016-08-08 00:05:54
## 4 2016-08-07 23:09:08
## 5 2016-08-07 21:31:46
## 6 2016-08-07 13:49:29
ggplot(tweet_sentiment_sentence,aes(ave_sentiment,color=source,fill=source)) + #create plot
  geom_freqpoly(binwidth=0.1) + #add histogram
  xlim(-1, 1) + #axis limits
  xlab("Sentiment by sentence") 

We can see that the tweets coming from the android device are more negative, and the tweets from the iPhone more positive


It looks like Donald Trump or one of his staff picked up on this after it received coverage in the news. He has recently changed from android to iphone.

Does this mean that we can no longer distinguish his genuine tweets from those written by his staff, or can we outsmart him?