During the 2016 US presidential election, then-candidate Donald J. Trump used his Twitter account as a way to communicate with potential voters. On August 6, 2016 Todd Vaziri tweeted about Trump that “Every non-hyperbolic tweet is from iPhone (his staff). Every hyperbolic tweet is from Android (from him).” Data scientist David Robinson conducted an analysis to determine if data supported this assertion. Here we go through David’s analysis to learn some of the basics of text mining.
library(tidyverse)
## ── Attaching packages ───────────────────────────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 3.1.1 ✔ purrr 0.3.2
## ✔ tibble 2.1.1 ✔ dplyr 0.8.1
## ✔ tidyr 0.8.3 ✔ stringr 1.4.0
## ✔ readr 1.3.1 ✔ forcats 0.4.0
## ── Conflicts ──────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(ggplot2)
library(lubridate)
##
## Attaching package: 'lubridate'
## The following object is masked from 'package:base':
##
## date
library(tidyr)
library(scales)
##
## Attaching package: 'scales'
## The following object is masked from 'package:purrr':
##
## discard
## The following object is masked from 'package:readr':
##
## col_factor
set.seed(1)
In general, we can extract data directly from Twitter using the package. However, in this case, a group has already compiled data for us and made it available at http://www.trumptwitterarchive.com.
url <- 'http://www.trumptwitterarchive.com/data/realdonaldtrump/%s.json'
trump_tweets <- map(2009:2017, ~sprintf(url, .x)) %>%
map_df(jsonlite::fromJSON, simplifyDataFrame = TRUE) %>%
filter(!is_retweet & !str_detect(text, '^"')) %>%
mutate(created_at = parse_date_time(created_at, orders = "a b! d! H!:M!:S! z!* Y!", tz="EST"))
## Date in ISO8601 format; converting timezone from UTC to "EST".
trump_tweets %>% str()
## 'data.frame': 20761 obs. of 8 variables:
## $ source : chr "Twitter Web Client" "Twitter Web Client" "Twitter Web Client" "Twitter Web Client" ...
## $ id_str : chr "6971079756" "6312794445" "6090839867" "5775731054" ...
## $ text : chr "From Donald Trump: Wishing everyone a wonderful holiday & a happy, healthy, prosperous New Year. Let’s think li"| __truncated__ "Trump International Tower in Chicago ranked 6th tallest building in world by Council on Tall Buildings & Urban "| __truncated__ "Wishing you and yours a very Happy and Bountiful Thanksgiving!" "Donald Trump Partners with TV1 on New Reality Series Entitled, Omarosa's Ultimate Merger: http://tinyurl.com/yk5m3lc" ...
## $ created_at : POSIXct, format: "2009-12-23 12:38:18" "2009-12-03 14:39:09" ...
## $ retweet_count : int 28 33 13 5 7 4 2 4 1 22 ...
## $ in_reply_to_user_id_str: chr NA NA NA NA ...
## $ favorite_count : int 12 6 11 3 6 5 2 10 4 30 ...
## $ is_retweet : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
names(trump_tweets)
## [1] "source" "id_str"
## [3] "text" "created_at"
## [5] "retweet_count" "in_reply_to_user_id_str"
## [7] "favorite_count" "is_retweet"
The tweets are represented by “text” variable.
trump_tweets %>% select(text) %>% head()
## text
## 1 From Donald Trump: Wishing everyone a wonderful holiday & a happy, healthy, prosperous New Year. Let’s think like champions in 2010!
## 2 Trump International Tower in Chicago ranked 6th tallest building in world by Council on Tall Buildings & Urban Habitat http://bit.ly/sqvQq
## 3 Wishing you and yours a very Happy and Bountiful Thanksgiving!
## 4 Donald Trump Partners with TV1 on New Reality Series Entitled, Omarosa's Ultimate Merger: http://tinyurl.com/yk5m3lc
## 5 --Work has begun, ahead of schedule, to build the greatest golf course in history: Trump International – Scotland.
## 6 --From Donald Trump: "Ivanka and Jared’s wedding was spectacular, and they make a beautiful couple. I’m a very proud father."
The “source” variable tells us the device that was used to compose and upload each tweet
trump_tweets %>% select(source) %>% tail()
## source
## 20756 Twitter for Android
## 20757 Twitter for Android
## 20758 Twitter for Android
## 20759 Twitter for Android
## 20760 Twitter for Android
## 20761 Twitter for iPhone
trump_tweets %>% count(source) %>% arrange(desc(n))
## # A tibble: 19 x 2
## source n
## <chr> <int>
## 1 Twitter Web Client 10718
## 2 Twitter for Android 4652
## 3 Twitter for iPhone 3962
## 4 TweetDeck 468
## 5 TwitLonger Beta 288
## 6 Instagram 133
## 7 Media Studio 114
## 8 Facebook 104
## 9 Twitter Ads 96
## 10 Twitter for BlackBerry 78
## 11 Mobile Web (M5) 54
## 12 Twitter for iPad 39
## 13 Twitlonger 22
## 14 Twitter QandA 10
## 15 Vine - Make a Scene 10
## 16 Periscope 7
## 17 Neatly For BlackBerry 10 4
## 18 Twitter for Websites 1
## 19 Twitter Mirror for iPad 1
Remove the “Twitter for” part of the source and filter out retweets
trump_tweets %>% extract(source, "source", "Twitter for (.*)") %>% count(source)
## # A tibble: 6 x 2
## source n
## <chr> <int>
## 1 <NA> 12029
## 2 Android 4652
## 3 BlackBerry 78
## 4 iPad 39
## 5 iPhone 3962
## 6 Websites 1
We will focus on what was tweeted between the day Trump announced his campaign and election day.
campaign_tweets <- trump_tweets %>%
extract(source, "source", "Twitter for (.*)") %>%
filter(source %in% c("Android", "iPhone") &
created_at >= ymd("2015-06-17") &
created_at < ymd("2016-11-08")) %>%
filter(!is_retweet) %>%
arrange(created_at)
campaign_tweets %>% head()
## source id_str
## 1 Android 612063082186174464
## 2 Android 612066176294866945
## 3 Android 612076000529268736
## 4 Android 612083064945180672
## 5 Android 612406758351478788
## 6 Android 612415976890593284
## text
## 1 Why did @DanaPerino beg me for a tweet (endorsement) when her book was launched?
## 2 I like Mexico and love the spirit of Mexican people, but we must protect our borders from people, from all over, pouring into the U.S.
## 3 Mexico is killing the United States economically because their leaders and negotiators are FAR smarter than ours. But nobody beats Trump!
## 4 Druggies, drug dealers, rapists and killers are coming across the southern border. When will the U.S. get smart and stop this travesty?
## 5 My speech is right now on C-SPAN 1
## 6 Thank you @AnnCoulter for your nice words. The U.S. is becoming a dumping ground for the world. Pols don't get it. Make America Great Again!
## created_at retweet_count in_reply_to_user_id_str favorite_count
## 1 2015-06-19 20:03:05 166 <NA> 348
## 2 2015-06-19 20:15:22 1266 <NA> 2118
## 3 2015-06-19 20:54:25 1033 <NA> 1770
## 4 2015-06-19 21:22:29 1578 <NA> 2181
## 5 2015-06-20 18:48:43 86 <NA> 312
## 6 2015-06-20 19:25:21 299 <NA> 643
## is_retweet
## 1 FALSE
## 2 FALSE
## 3 FALSE
## 4 FALSE
## 5 FALSE
## 6 FALSE
We can now use data visualization to explore the possibility that two different groups were tweeting from these devices.
campaign_tweets %>% mutate(hour = hour(with_tz(created_at, "EST"))) %>%
count(source, hour) %>%
group_by(source) %>%
mutate(percent = n / sum(n)) %>%
ungroup %>%
ggplot(aes(hour, percent, color = source)) +
geom_line() +
geom_point() +
scale_y_continuous(labels = percent_format()) +
labs(x = "Hour of day (EST)",
y = "% of tweets",
color = "")
We notice a big peak for the Android in early hours of the morning, between 6 and 8 AM. There seems to be a clear different in these patterns. We will therefore assume that two different entities are using these two devices. Now we will study how their tweets differ.
library(tidytext)
Let’s take a quick look at one of the tweets
i <- 3008
campaign_tweets$text[i]
## [1] "Great to be back in Iowa! #TBT with @JerryJrFalwell joining me in Davenport- this past winter. #MAGA https://t.co/A5IF0QHnic"
campaign_tweets[i,] %>%
unnest_tokens(word, text) %>%
select(word)
## word
## 3008 great
## 3008.1 to
## 3008.2 be
## 3008.3 back
## 3008.4 in
## 3008.5 iowa
## 3008.6 tbt
## 3008.7 with
## 3008.8 jerryjrfalwell
## 3008.9 joining
## 3008.10 me
## 3008.11 in
## 3008.12 davenport
## 3008.13 this
## 3008.14 past
## 3008.15 winter
## 3008.16 maga
## 3008.17 https
## 3008.18 t.co
## 3008.19 a5if0qhnic
It seems that we need a regex that defines/captures Twitter characters.
pattern <- "([^A-Za-z\\d#@']|'(?![A-Za-z\\d#@]))"
tweet_words <- campaign_tweets %>%
mutate(text = str_replace_all(text, "https://t.co/[A-Za-z\\d]+|&", "")) %>%
unnest_tokens(word, text, token = "regex", pattern = pattern)
tweet_words$word %>% head()
## [1] "why" "did" "@danaperino" "beg" "me"
## [6] "for"
tweet_words %>% count(word) %>% arrange(desc(n))
## # A tibble: 6,211 x 2
## word n
## <chr> <int>
## 1 the 2335
## 2 to 1413
## 3 and 1246
## 4 a 1210
## 5 in 1189
## 6 i 1161
## 7 you 1000
## 8 of 983
## 9 is 944
## 10 on 880
## # … with 6,201 more rows
The top words are not very informative. So we get rid of them. The tidytext package has database of these commonly used words, referred to as stop words, in text mining
tweet_words <- campaign_tweets %>%
mutate(text = str_replace_all(text, "https://t.co/[A-Za-z\\d]+|&", "")) %>%
unnest_tokens(word, text, token = "regex", pattern = pattern) %>%
filter(!word %in% stop_words$word )
tweet_words %>%
count(word) %>%
top_n(10, n) %>%
mutate(word = reorder(word, n)) %>%
arrange(desc(n))
## # A tibble: 10 x 2
## word n
## <fct> <int>
## 1 #trump2016 415
## 2 hillary 407
## 3 people 303
## 4 #makeamericagreatagain 296
## 5 america 254
## 6 clinton 239
## 7 poll 220
## 8 crooked 206
## 9 trump 199
## 10 cruz 162
tweet_words <- campaign_tweets %>%
mutate(text = str_replace_all(text, "https://t.co/[A-Za-z\\d]+|&", "")) %>%
unnest_tokens(word, text, token = "regex", pattern = pattern) %>%
filter(!word %in% stop_words$word &
!str_detect(word, "^\\d+$")) %>%
mutate(word = str_replace(word, "^'", ""))
Now that we have all our words in a table, along with information about what device was used to compose the tweet they came from, we can start exploring which words are more common when comparing Android to iPhone.
android_iphone_or <- tweet_words %>%
count(word, source) %>%
spread(source, n, fill = 0) %>%
mutate(or = (Android + 0.5) / (sum(Android) - Android + 0.5) /
( (iPhone + 0.5) / (sum(iPhone) - iPhone + 0.5)))
android_iphone_or %>% arrange(desc(or))
## # A tibble: 5,509 x 4
## word Android iPhone or
## <chr> <dbl> <dbl> <dbl>
## 1 mails 22 0 39.3
## 2 poor 13 0 23.6
## 3 poorly 12 0 21.8
## 4 @cbsnews 11 0 20.1
## 5 bosses 11 0 20.1
## 6 turnberry 11 0 20.1
## 7 angry 10 0 18.3
## 8 write 10 0 18.3
## 9 brexit 9 0 16.6
## 10 defend 9 0 16.6
## # … with 5,499 more rows
android_iphone_or %>% arrange(or)
## # A tibble: 5,509 x 4
## word Android iPhone or
## <chr> <dbl> <dbl> <dbl>
## 1 #makeamericagreatagain 0 296 0.00144
## 2 #americafirst 0 71 0.00607
## 3 #draintheswamp 0 63 0.00683
## 4 #trump2016 3 412 0.00718
## 5 #votetrump 0 56 0.00769
## 6 join 1 157 0.00821
## 7 #imwithyou 0 51 0.00843
## 8 #crookedhillary 0 30 0.0143
## 9 #fitn 0 30 0.0143
## 10 #gopdebate 0 30 0.0143
## # … with 5,499 more rows
Given that several of these words are overall low frequency words we can impose a filter based on the total frequency like this:
android_iphone_or %>% filter(Android+iPhone > 100) %>%
arrange(desc(or))
## # A tibble: 30 x 4
## word Android iPhone or
## <chr> <dbl> <dbl> <dbl>
## 1 @cnn 104 18 4.95
## 2 bad 104 26 3.45
## 3 crooked 157 49 2.79
## 4 ted 85 28 2.62
## 5 interviewed 76 25 2.62
## 6 media 77 26 2.56
## 7 cruz 116 46 2.19
## 8 hillary 290 119 2.14
## 9 win 74 30 2.14
## 10 president 84 35 2.08
## # … with 20 more rows
android_iphone_or %>% filter(Android+iPhone > 100) %>%
arrange(or)
## # A tibble: 30 x 4
## word Android iPhone or
## <chr> <dbl> <dbl> <dbl>
## 1 #makeamericagreatagain 0 296 0.00144
## 2 #trump2016 3 412 0.00718
## 3 join 1 157 0.00821
## 4 tomorrow 25 101 0.218
## 5 vote 46 67 0.600
## 6 america 114 141 0.703
## 7 tonight 71 84 0.737
## 8 iowa 62 65 0.831
## 9 poll 117 103 0.990
## 10 trump 112 92 1.06
## # … with 20 more rows
We already see somewhat of a pattern in the types of words that are being tweeted more in one device versus the other. However, we are not interested in specific words but rather in the tone.
Vaziri’s assertion is that the Android tweets are more hyperbolic. Hyperbolic is a hard sentiment to extract from words as it relies on interpreting phrases. However, words can be associated to more basic sentiment such as as anger, fear, joy and surprise. In the next section we demonstrate basic sentiment analysis.
sentiments
## # A tibble: 6,786 x 2
## word sentiment
## <chr> <chr>
## 1 2-faces negative
## 2 abnormal negative
## 3 abolish negative
## 4 abominable negative
## 5 abominably negative
## 6 abominate negative
## 7 abomination negative
## 8 abort negative
## 9 aborted negative
## 10 aborts negative
## # … with 6,776 more rows
library(textdata)
get_sentiments(lexicon = "bing")
## # A tibble: 6,786 x 2
## word sentiment
## <chr> <chr>
## 1 2-faces negative
## 2 abnormal negative
## 3 abolish negative
## 4 abominable negative
## 5 abominably negative
## 6 abominate negative
## 7 abomination negative
## 8 abort negative
## 9 aborted negative
## 10 aborts negative
## # … with 6,776 more rows
get_sentiments("afinn")
## # A tibble: 2,477 x 2
## word value
## <chr> <dbl>
## 1 abandon -2
## 2 abandoned -2
## 3 abandons -2
## 4 abducted -2
## 5 abduction -2
## 6 abductions -2
## 7 abhor -3
## 8 abhorred -3
## 9 abhorrent -3
## 10 abhors -3
## # … with 2,467 more rows
get_sentiments("loughran")
## # A tibble: 4,150 x 2
## word sentiment
## <chr> <chr>
## 1 abandon negative
## 2 abandoned negative
## 3 abandoning negative
## 4 abandonment negative
## 5 abandonments negative
## 6 abandons negative
## 7 abdicated negative
## 8 abdicates negative
## 9 abdicating negative
## 10 abdication negative
## # … with 4,140 more rows
get_sentiments("nrc")
## # A tibble: 13,901 x 2
## word sentiment
## <chr> <chr>
## 1 abacus trust
## 2 abandon fear
## 3 abandon negative
## 4 abandon sadness
## 5 abandoned anger
## 6 abandoned fear
## 7 abandoned negative
## 8 abandoned sadness
## 9 abandonment anger
## 10 abandonment fear
## # … with 13,891 more rows
For thsi analysis we are going to use nrc lexicon.
nrc <- get_sentiments("nrc")
tweet_words %>% inner_join(nrc, by = "word") %>%
select(source, word, sentiment) %>% sample_n(10)
## source word sentiment
## 1 Android phony anger
## 2 Android vote joy
## 3 Android virtue trust
## 4 iPhone tomorrow anticipation
## 5 Android finally trust
## 6 iPhone honor trust
## 7 iPhone vote negative
## 8 iPhone bomb sadness
## 9 iPhone negative sadness
## 10 Android spirit positive
Now we are ready to perform a quantitative analysis comparing Android and iPhone by comparing the sentiments of the tweets posted from each device. Here we could perform a tweet by tweet analysis, assigning a sentiment to each tweet. However, this somewhat complex since each tweet will have several sentiments attached to it, one for each word appearing in the lexicon. For illustrative purposes, we will perform a much simpler analysis: we will count and compare the frequencies of each sentiment appears for each device.
sentiment_counts <- tweet_words %>%
left_join(nrc, by = "word") %>%
count(source, sentiment) %>%
spread(source, n) %>%
mutate(sentiment = replace_na(sentiment, replace = "none"))
sentiment_counts
## # A tibble: 11 x 3
## sentiment Android iPhone
## <chr> <int> <int>
## 1 anger 965 527
## 2 anticipation 915 710
## 3 disgust 641 318
## 4 fear 802 486
## 5 joy 698 540
## 6 negative 1668 935
## 7 positive 1834 1497
## 8 sadness 907 514
## 9 surprise 530 365
## 10 trust 1253 1010
## 11 none 11523 10739
Because more words were used on the Android than on the phone.
tweet_words %>% group_by(source) %>% summarize(n = n())
## # A tibble: 2 x 2
## source n
## <chr> <int>
## 1 Android 15829
## 2 iPhone 13802
for each sentiment we can compute the odds of being in the device: proportion of words with sentiment versus proportion of words without and then compute the odds ratio comparing the two devices.
sentiment_counts %>%
mutate(Android = Android / (sum(Android) - Android) ,
iPhone = iPhone / (sum(iPhone) - iPhone),
or = Android/iPhone) %>%
arrange(desc(or))
## # A tibble: 11 x 4
## sentiment Android iPhone or
## <chr> <dbl> <dbl> <dbl>
## 1 disgust 0.0304 0.0184 1.66
## 2 anger 0.0465 0.0308 1.51
## 3 negative 0.0831 0.0560 1.49
## 4 sadness 0.0435 0.0300 1.45
## 5 fear 0.0383 0.0283 1.35
## 6 surprise 0.0250 0.0211 1.18
## 7 joy 0.0332 0.0316 1.05
## 8 anticipation 0.0439 0.0419 1.05
## 9 trust 0.0612 0.0607 1.01
## 10 positive 0.0922 0.0927 0.994
## 11 none 1.13 1.56 0.725
So we do see some difference and the order is interesting: the largest three sentiments are disgust, anger, and negative! But are they statistically significant? How does this compare if we are just assigning sentiments at random?
To answer that question we can compute, for each sentiment, an odds ratio and confidence interval. We will add the two values we need to form a two-by-two table and the odds ratio.
library(broom)
log_or <- sentiment_counts %>%
mutate( log_or = log( (Android / (sum(Android) - Android)) / (iPhone / (sum(iPhone) - iPhone))),
se = sqrt( 1/Android + 1/(sum(Android) - Android) + 1/iPhone + 1/(sum(iPhone) - iPhone)),
conf.low = log_or - qnorm(0.975)*se,
conf.high = log_or + qnorm(0.975)*se) %>%
arrange(desc(log_or))
log_or
## # A tibble: 11 x 7
## sentiment Android iPhone log_or se conf.low conf.high
## <chr> <int> <int> <dbl> <dbl> <dbl> <dbl>
## 1 disgust 641 318 0.504 0.0694 0.368 0.640
## 2 anger 965 527 0.411 0.0551 0.303 0.519
## 3 negative 1668 935 0.395 0.0422 0.313 0.478
## 4 sadness 907 514 0.372 0.0562 0.262 0.482
## 5 fear 802 486 0.302 0.0584 0.187 0.416
## 6 surprise 530 365 0.168 0.0688 0.0332 0.303
## 7 joy 698 540 0.0495 0.0582 -0.0647 0.164
## 8 anticipation 915 710 0.0468 0.0511 -0.0533 0.147
## 9 trust 1253 1010 0.00726 0.0436 -0.0781 0.0926
## 10 positive 1834 1497 -0.00624 0.0364 -0.0776 0.0651
## 11 none 11523 10739 -0.321 0.0206 -0.362 -0.281
A graphical visualization shows some sentiments that are clearly overrepresented.
log_or %>%
mutate(sentiment = reorder(sentiment, log_or),) %>%
ggplot(aes(x = sentiment, ymin = conf.low, ymax = conf.high)) +
geom_errorbar() +
geom_point(aes(sentiment, log_or)) +
ylab("Log odds ratio for association between Android and sentiment") +
coord_flip()
We see that the disgust, anger, negative sadness and fear sentiments are associated with the Android in a way that is hard to explain by chance alone. Words not associated to a sentiment were strongly associated with the iPhone source, which is in agreement with the original claim about hyperbolic tweets.
If we are interested in exploring which specific words are driving these differences, we can back to our android_iphone_or object.
android_iphone_or %>% inner_join(nrc) %>%
filter(sentiment == "disgust" & Android + iPhone > 10) %>%
arrange(desc(or))
## Joining, by = "word"
## # A tibble: 20 x 5
## word Android iPhone or sentiment
## <chr> <dbl> <dbl> <dbl> <chr>
## 1 mess 15 2 5.41 disgust
## 2 finally 12 2 4.36 disgust
## 3 unfair 12 2 4.36 disgust
## 4 bad 104 26 3.45 disgust
## 5 lie 13 3 3.37 disgust
## 6 terrible 31 8 3.24 disgust
## 7 lying 9 3 2.37 disgust
## 8 waste 12 5 1.98 disgust
## 9 phony 21 9 1.97 disgust
## 10 illegal 32 14 1.96 disgust
## 11 nasty 14 6 1.95 disgust
## 12 pathetic 11 5 1.82 disgust
## 13 horrible 14 7 1.69 disgust
## 14 disaster 21 11 1.63 disgust
## 15 winning 14 9 1.33 disgust
## 16 liar 6 5 1.03 disgust
## 17 dishonest 37 32 1.01 disgust
## 18 john 24 21 0.994 disgust
## 19 dying 6 6 0.872 disgust
## 20 terrorism 9 9 0.872 disgust
android_iphone_or %>% inner_join(nrc, by = "word") %>%
mutate(sentiment = factor(sentiment, levels = log_or$sentiment)) %>%
mutate(log_or = log(or)) %>%
filter(Android + iPhone > 10 & abs(log_or)>1) %>%
mutate(word = reorder(word, log_or)) %>%
ggplot(aes(word, log_or, fill = log_or < 0)) +
facet_wrap(~sentiment, scales = "free_x", nrow = 2) +
geom_bar(stat="identity", show.legend = FALSE) +
theme(axis.text.x = element_text(angle = 90, hjust = 1))