Although I was tempted to do an analysis of Twitter data surronding the election, I decided to go with something a bit different (although related I’m sure). I pulled in tweets with a few different hashtags or @users before settling to do my analysis on tweets using the hashtag “#refugees.” To begin, I connected to Twitter and pulled the data for 1000 tweets using my selected hashtag.
I pulled in the tweets and initially formatted them as a dataframe and then ran the regular expression formating for the tweet’s text. Initally I used the same regular expression as in the notes, but as I worked later on joining the words to another data set I wanted to remove the hashtag at the start of the words, so that is reflected in the reg <- statement below.
num_tweets <- 1000
un <- searchTwitter('#refugees', n = num_tweets)
un_df <- twListToDF(un)
head(un_df)
reg <- "([^A-Za-z\\d@']|'(?![A-Za-z\\d#@]))"
un_words <- un_df %>%
filter(!str_detect(text, '^"')) %>%
mutate(text = str_replace_all(text, "https://t.co/[A-Za-z\\d]+|&", "")) %>%
unnest_tokens(word, text, token = "regex", pattern = reg) %>%
filter(!word %in% stop_words$word,
str_detect(word, "[a-z]"))
I utilized the library WordCloud to create a word cloud with the most common terms in the tweets that used #refugees. I did two versions, one using a color spectrum and one without. Although I would have liked to use colors to illustrate the frequency, I felt the palettes I tried were not adding to the visual and so I also created one in black. Per the feedback from classmates, I decided to exclude https, rt and the word I searched for. It really helped to illustrate a better selection of words.
##Create WordCloud for the Text in #refugees Tweets
library(wordcloud)
un_words %>% filter(word != "rt", word != "refugees", word != "https", word != "rt") %>% count(word) %>% arrange(desc(n)) %>% with (wordcloud(word, n, max.words = 100, scale=c(5,.5),min.freq=5, random.order=FALSE, rot.per=.15, colors=brewer.pal(9,"Blues")))
##Create WordCloud for the Text in #refugees Tweets Without Colors
library(wordcloud)
un_words %>% filter(word != "rt", word != "refugees", word != "https", word != "rt") %>% count(word) %>% arrange(desc(n)) %>% with (wordcloud(word, n, max.words = 100, scale=c(5,.5),min.freq=5, random.order=FALSE, rot.per=.15))
Then I hoped to see a list of the most common words represented in a table, rather than as a word cloud. I also excluded refugees, rt and https from the table below as well in order to be consistent.
kable(un_words %>% group_by(word)%>% filter(word != "rt", word != "refugees", word != "https", word != "rt") %>% summarize(n = n()) %>% mutate(frequency = n/ sum(n) ) %>% arrange(desc(n)) %>% top_n(15))
| word | n | frequency |
|---|---|---|
| job | 229 | 0.0254360 |
| @kon | 157 | 0.0174386 |
| australia | 117 | 0.0129957 |
| advocacy | 97 | 0.0107742 |
| woman | 96 | 0.0106631 |
| @gilliantriggs | 95 | 0.0105520 |
| fearless | 95 | 0.0105520 |
| loses | 94 | 0.0104410 |
| en | 72 | 0.0079973 |
| gratis | 64 | 0.0071087 |
| rights | 48 | 0.0053316 |
| human | 46 | 0.0051094 |
| speaking | 43 | 0.0047762 |
| world | 43 | 0.0047762 |
| gillian | 42 | 0.0046651 |
| triggs | 42 | 0.0046651 |
I also wanted to utilize the sentiment analysis that we have learned. I think apply it to tweets using #refugees will be an interesting perspective. I anticipated a pretty big spread on the sentiments, with emotions running high and a lot of people with strong feelings about immigration and refugees.
##Join Words from #refugees Tweets to Sentiments
nrc <- sentiments %>%
filter(lexicon == "nrc") %>%
select(word, sentiment)
head(nrc)
## # A tibble: 6 × 2
## word sentiment
## <chr> <chr>
## 1 abacus trust
## 2 abandon fear
## 3 abandon negative
## 4 abandon sadness
## 5 abandoned anger
## 6 abandoned fear
un_sentiments <- un_words %>% inner_join(nrc, by = "word")
un_sentiments %>% group_by(sentiment) %>% summarize(n = n()) %>% mutate(frequency = n/ sum(n) ) %>% arrange(desc(n))
## # A tibble: 10 × 3
## sentiment n frequency
## <chr> <int> <dbl>
## 1 positive 1035 0.26668384
## 2 trust 507 0.13063643
## 3 anticipation 430 0.11079619
## 4 negative 410 0.10564288
## 5 joy 389 0.10023190
## 6 fear 340 0.08760629
## 7 anger 281 0.07240402
## 8 sadness 214 0.05514043
## 9 surprise 159 0.04096882
## 10 disgust 116 0.02988920
I wanted to illustrate the sentiments of the tweets with a bar chart in order to have an easy way to see what the most common feelings and emotions were in the tweets the included #refugees. The graph clearly shows the most common sentiment, both by frequency and total number of words is positive.
##Summarize Data for Bar Chart
summary <- un_sentiments %>% group_by(sentiment) %>% summarize(n = n()) %>% mutate(frequency = n/ sum(n) ) %>% arrange(desc(n)) %>% top_n(8)
##Create Table with Data and Create Graphic Representation
kable(summary, digits = 3)
| sentiment | n | frequency |
|---|---|---|
| positive | 1035 | 0.267 |
| trust | 507 | 0.131 |
| anticipation | 430 | 0.111 |
| negative | 410 | 0.106 |
| joy | 389 | 0.100 |
| fear | 340 | 0.088 |
| anger | 281 | 0.072 |
| sadness | 214 | 0.055 |
ggplot(summary, aes(x = sentiment, y= frequency, fill = n)) + geom_bar(stat = "identity", position = "dodge") + labs(x = "Sentiment", y= "Frequency of Senitment", title = "Overall Sentiments of Tweets Using #refugees")
I also thought it might be interesting to break down the tweets by user. This illustrated that there are a huge variety of people tweeting using the hashtag #refugees. The highest percentage of tweets coming from one user is only 0.8%.
#Group Tweets by Users
un_users <- un_df %>% group_by(screenName) %>% summarize(n = n()) %>% mutate(percent = n / sum(n)) %>% arrange(desc(n)) %>% top_n(10)
kable(un_users)
| screenName | n | percent |
|---|---|---|
| SolidaridEstela | 9 | 0.009 |
| 5QU1RR3LZ | 4 | 0.004 |
| Kon__K | 4 | 0.004 |
| OmNico72 | 4 | 0.004 |
| TullyNYCity | 4 | 0.004 |
| chris_vd_post | 3 | 0.003 |
| Ebird2015 | 3 | 0.003 |
| ireneogrizek | 3 | 0.003 |
| labor4refugees1 | 3 | 0.003 |
| pcliers | 3 | 0.003 |
| Prison_Health | 3 | 0.003 |
| Rachel_Mantell | 3 | 0.003 |
| RaminFarhangniy | 3 | 0.003 |
| StrongInfidel | 3 | 0.003 |
I also was interested in connecting the words from the tweets to locations, if at all possible. I tried to pull the user locations or utilize the longitude and latitude, but these were pulled in with NAs or timing out repeatedly.
I took an alternative approach and attempted to find a dataframe that would have names of countries, so that I could join the words from the tweets with country names.
I wanted to include a map of the countries noted in the tweets. So I’ve joined the words from the tweets with the list of country names.
MappingTweets <- inner_join(un_words, countrycode_data, by = c("word" = "country.name"))
head(MappingTweets)
## # A tibble: 6 × 30
## favorited favoriteCount replyToSN created truncated
## <lgl> <dbl> <chr> <dttm> <lgl>
## 1 FALSE 0 KagutaMuseveni 2016-11-16 19:16:58 TRUE
## 2 FALSE 0 qatarairways 2016-11-16 19:49:06 FALSE
## 3 FALSE 0 qatarairways 2016-11-16 22:36:09 FALSE
## 4 FALSE 0 sophia_christos 2016-11-16 23:27:24 FALSE
## 5 FALSE 0 V_of_Europe 2016-11-16 22:42:57 FALSE
## 6 FALSE 0 <NA> 2016-11-16 18:32:22 FALSE
## # ... with 25 more variables: replyToSID <chr>, id <chr>,
## # replyToUID <chr>, statusSource <chr>, screenName <chr>,
## # retweetCount <dbl>, isRetweet <lgl>, retweeted <lgl>, longitude <lgl>,
## # latitude <lgl>, word <chr>, cowc <chr>, cown <int>, fao <int>,
## # fips104 <chr>, imf <int>, ioc <chr>, iso2c <chr>, iso3c <chr>,
## # iso3n <int>, un <int>, wb <chr>, regex <chr>, continent <chr>,
## # region <chr>
I felt it would be useful to summarize the data in a couple of ways, since the countrycode_data provided a fair amount of data to use. To being I thought showing the occurance of countries by region and continent would be an interesting visualization.
TweetbyRegion <- MappingTweets %>% group_by(region) %>% summarize(n = n()) %>% mutate(frequency = n/ sum(n) ) %>% arrange(desc(n)) %>% top_n(8)
##Plot Tweets by Region (Top 8 Regions)
ggplot(TweetbyRegion, aes(x = region, y= n, fill= frequency)) + geom_bar(stat = "identity", position = "dodge") + labs(x = "Regions", y= "Count of Mentions by Region", title = "Regions Mentioned in Tweets Using #refugees") + theme(axis.text.x = element_text(angle = 45, hjust = 1))
I then utilized the list of countries in the inner join, to map them. I needed to add some additional libraries in order to do this. I added markers to the countries that were included in the tweets. It is clearly concentrated in Europe and Africa which makes sense with the refugee crisis in Europe, the Middle East and Northern and Eastern Africa.
library(leaflet)
library(rgdal)
library(ggmap)
TweetbyCountry <- MappingTweets %>% group_by(word) %>% summarize(n = n())
##Pull the Latitude and Longitude for the Countries in the Tweets
Countries <- geocode(TweetbyCountry$word, output="latlon", source = "google")
map <- leaflet() %>% setView(lng ="0", lat ="0", zoom = 1)
map %>% addProviderTiles("CartoDB.Positron") %>% addMarkers(lat = Countries$lat, Countries$lon)