In this tutorial we will examine the twitter account of Donald Trump. Much of the content and code has been borrowed from this blog post authored by David Robinson.
As with most scientific ideas, we start with a hypothesis, and as with most scientific ideas, we didn’t come up with this hypothesis ourselves. When you post on twitter, it notes what types of device you are posting from. Various twitter commentors noted that on Donald Trump’s twitter account realDonaldTrump post were being made from both Android and iPhone devices, and that there was a significant difference in the type on tone of the posts from these two devices. This led to the hypothesis that Trump himself was posting using the Android device, and that his staff were using the iPhone.
Every non-hyperbolic tweet is from iPhone (his staff).
— Todd Vaziri (@tvaziri) August 6, 2016
Every hyperbolic tweet is from Android (from him). pic.twitter.com/GWr6D8h5ed
We will examine this idea using some of the Natural Language Processing tools we have told you about. We will use the data science software R with the tidytext package to do text analytics and the twitteR package to connect and download data from twitter.
Before we get started we need to create what is called a twitter application. After doing this we will be able to connect to the app and download data from twitter. Creating the applicaiton is a simple process (but you need a twitter account before you start):
library(twitteR) #connect to twitter
library(dplyr) #manipulate table data
library(purrr) #writing neat functions
library(tidyr) #manipulate table data
library(lubridate) #manipulate dates
library(scales) #for plotting
library(ggplot2) #for plotting
library(tidytext) #manipulating text data and natural language processing
library(sentimentr) #natural language processing
library(lexicon) #sentiment dictionaries
library(stringr) #manipulating text data
library(wordcloud2) #create wordclouds
consumer_key <- "paste key"
consumer_secret <- "paste key"
access_token <- "paste key"
access_secret <- "paste key"
#now lets connect
setup_twitter_oauth(consumer_key, consumer_secret, access_token, access_secret)
# We can request only 3200 tweets at a time; it will return fewer
# depending on the API
trump_tweets <- userTimeline("realDonaldTrump", n = 3200)
trump_tweets_df <- tbl_df(map_df(trump_tweets, as.data.frame))
tweets <- trump_tweets_df %>%
select(id, statusSource, text, created) %>% #select only the columns we want
extract(statusSource, "source", "Twitter for (.*?)<") %>% #extract a text string
filter(source %in% c("iPhone", "Android")) #we only want tweets from android or iphone
head(tweets)
## # A tibble: 6 <U+00D7> 4
## id source
## <chr> <chr>
## 1 762669882571980801 Android
## 2 762641595439190016 iPhone
## 3 762439658911338496 iPhone
## 4 762425371874557952 Android
## 5 762400869858115588 Android
## 6 762284533341417472 Android
## # ... with 2 more variables: text <chr>, created <dttm>
#tweet time
tweets_hour <- tweets %>%
count(source, hour = hour(with_tz(created, "EST"))) %>% #count tweets per hour
mutate(percent = n / sum(n)) #reformat to percent
ggplot(data=tweets_hour,aes(hour, percent, color = source)) + #create plot
geom_line() + #add a line graph
scale_y_continuous(labels = percent_format()) + #format axes
labs(x = "Hour of day (EST)", #label axes
y = "% of tweets",
color = "")
reg <- "([^A-Za-z\\d#@']|'(?![A-Za-z\\d#@]))" #this is just a list of letters. we want to keep only letters so that we remove non-words
tweet_words <- tweets %>%
filter(!str_detect(text, '^"')) %>% #remove text in quotes
mutate(text = str_replace_all(text, "https://t.co/[A-Za-z\\d]+|&", "")) %>% #remove all website links
unnest_tokens(word, text, token = "regex", pattern = reg) %>% #create lsit of words keeping nly those composed of regualr letters
filter(!word %in% stop_words$word,str_detect(word, "[a-z]")) #remove stopwords
head(tweet_words)
## # A tibble: 6 <U+00D7> 4
## id source created word
## <chr> <chr> <dttm> <chr>
## 1 676494179216805888 iPhone 2015-12-14 20:09:15 record
## 2 676494179216805888 iPhone 2015-12-14 20:09:15 health
## 3 676494179216805888 iPhone 2015-12-14 20:09:15 #makeamericagreatagain
## 4 676494179216805888 iPhone 2015-12-14 20:09:15 #trump2016
## 5 676509769562251264 iPhone 2015-12-14 21:11:12 accolade
## 6 676509769562251264 iPhone 2015-12-14 21:11:12 @trumpgolf
words1 <- tweet_words %>% filter(source=="Android") %>% select(word) %>% table %>% as.data.frame %>% filter(Freq>5)
wordcloud2(words1,size=1)
words2 <- tweet_words %>% filter(source=="iPhone") %>% select(word) %>% table %>% as.data.frame %>% filter(Freq>5)
wordcloud2(words2,size=2)
\[\log_2(\frac{\frac{\mbox{# in Android} + 1}{\mbox{Total Android} + 1}} {\frac{\mbox{# in iPhone} + 1}{\mbox{Total iPhone} + 1}})\]
#create clean data of log-odds ratios
android_iphone_ratios <- tweet_words %>%
count(word, source) %>% #count the frequency of words by source
filter(sum(n) >= 5) %>% #only use words with more than 5 total occurences
spread(source, n, fill = 0) %>% #create a column for android, column for iphone
ungroup() %>%
mutate_each(funs((. + 1) / sum(. + 1)), -word) %>% #calculate log-odds ratio
mutate(logratio = log2(Android / iPhone)) %>% #calculate log-odds ratio
arrange(desc(logratio)) #sort by log-odds ratio
#plot clean data
android_iphone_ratios %>%
group_by(logratio > 0) %>%
top_n(15, abs(logratio)) %>% #get 15 biggest differnt words
ungroup() %>%
mutate(word = reorder(word, logratio)) %>% #sort by log-odds ratio
ggplot(aes(word, logratio, fill = logratio < 0)) + #create plot
geom_bar(stat = "identity") + #add bar graph
coord_flip() + #we want bars going horizontally, not vertically
ylab("Android / iPhone log ratio") + #label axes
scale_fill_manual(name = "", labels = c("Android", "iPhone"),values = c("red", "lightblue")) #color legend
lexicon <- hash_sentiment_jockers #select our sentiemtn dicitonary
names(lexicon) <- c("word","score") #name columns
lexicon
## word score
## 1: abandon -0.75
## 2: abandoned -0.50
## 3: abandoner -0.25
## 4: abandonment -0.25
## 5: abandons -1.00
## ---
## 10735: zealous 0.40
## 10736: zenith 0.40
## 10737: zest 0.50
## 10738: zombie -0.25
## 10739: zombies -0.25
tweet_sentiment_word <- tweet_words %>%
inner_join(lexicon,by="word") %>% #join words and dictionary
group_by(id) %>% #group by tweets
summarise(mean_sent=mean(score),source=first(source),created=first(created)) #create new column with mean per tweet
head(tweet_sentiment_word)
## # A tibble: 6 <U+00D7> 4
## id mean_sent source created
## <chr> <dbl> <chr> <dttm>
## 1 676509769562251264 0.3750000 iPhone 2015-12-14 21:11:12
## 2 680496083072593920 -0.1166667 Android 2015-12-25 21:11:23
## 3 680503951440121856 -1.0000000 Android 2015-12-25 21:42:39
## 4 680505672476262400 0.2500000 Android 2015-12-25 21:49:30
## 5 680734915718176768 -0.5000000 Android 2015-12-26 13:00:26
## 6 682764544402440192 0.7500000 iPhone 2016-01-01 03:25:27
ggplot(tweet_sentiment_word,aes(mean_sent,color=source)) + #create plot
geom_freqpoly(binwidth=0.2) + #add histogram
xlim(-1, 1) + #axis limits
xlab("Sentiment by word")
new_id <- data.frame(id=tweets$id,element_id=1:nrow(tweets)) #create id to link tweets and sentences
tweet_sentiment_sentence <- sentiment_by(tweets$text,polarity_dt = hash_sentiment_jockers) %>% #score each tweet for its sentiment
inner_join(new_id) %>% #attach tweet ids
inner_join(tweets) #attached tweet data
head(tweet_sentiment_sentence)
## element_id word_count sd ave_sentiment id source
## 1 1 11 0.5303301 0.40926475 762669882571980801 Android
## 2 2 16 0.1750384 0.20282779 762641595439190016 iPhone
## 3 3 8 NA 0.24748737 762439658911338496 iPhone
## 4 4 23 NA -0.09383149 762425371874557952 Android
## 5 5 23 0.5709214 -0.46392244 762400869858115588 Android
## 6 6 25 0.2593755 0.02947645 762284533341417472 Android
## text
## 1 My economic policy speech will be carried live at 12:15 P.M. Enjoy!
## 2 Join me in Fayetteville, North Carolina tomorrow evening at 6pm. Tickets now available at: https://t.co/Z80d4MYIg8
## 3 #ICYMI: "Will Media Apologize to Trump?" https://t.co/ia7rKBmioA
## 4 Michael Morell, the lightweight former Acting Director of C.I.A., and a man who has made serious bad calls, is a total Clinton flunky!
## 5 The media is going crazy. They totally distort so many things on purpose. Crimea, nuclear, "the baby" and so much more. Very dishonest!
## 6 I see where Mayor Stephanie Rawlings-Blake of Baltimore is pushing Crooked hard. Look at the job she has done in Baltimore. She is a joke!
## created
## 1 2016-08-08 15:20:44
## 2 2016-08-08 13:28:20
## 3 2016-08-08 00:05:54
## 4 2016-08-07 23:09:08
## 5 2016-08-07 21:31:46
## 6 2016-08-07 13:49:29
ggplot(tweet_sentiment_sentence,aes(ave_sentiment,color=source,fill=source)) + #create plot
geom_freqpoly(binwidth=0.1) + #add histogram
xlim(-1, 1) + #axis limits
xlab("Sentiment by sentence")