MBA 676: Assignment 4

Jordan Cunningham

Gathering the Data from Twitter and Cleaning it Up

Although I was tempted to do an analysis of Twitter data surronding the election, I decided to go with something a bit different (although related I’m sure). I pulled in tweets with a few different hashtags or @users before settling to do my analysis on tweets using the hashtag “#refugees.” To begin, I connected to Twitter and pulled the data for 1000 tweets using my selected hashtag.

I pulled in the tweets and initially formatted them as a dataframe and then ran the regular expression formating for the tweet’s text. Initally I used the same regular expression as in the notes, but as I worked later on joining the words to another data set I wanted to remove the hashtag at the start of the words, so that is reflected in the reg <- statement below.

num_tweets <- 1000
un <- searchTwitter('#refugees', n = num_tweets)
un_df <- twListToDF(un)
head(un_df)

reg <- "([^A-Za-z\\d@']|'(?![A-Za-z\\d#@]))"
un_words <- un_df %>% 
  filter(!str_detect(text, '^"')) %>%
  mutate(text = str_replace_all(text, "https://t.co/[A-Za-z\\d]+|&amp;", "")) %>%
  unnest_tokens(word, text, token = "regex", pattern = reg) %>%
  filter(!word %in% stop_words$word,
         str_detect(word, "[a-z]"))

Creating Some Initial Visualizations

I utilized the library WordCloud to create a word cloud with the most common terms in the tweets that used #refugees. I did two versions, one using a color spectrum and one without. Although I would have liked to use colors to illustrate the frequency, I felt the palettes I tried were not adding to the visual and so I also created one in black. Per the feedback from classmates, I decided to exclude https, rt and the word I searched for. It really helped to illustrate a better selection of words.

##Create WordCloud for the Text in #refugees Tweets
library(wordcloud)
un_words %>% filter(word != "rt", word != "refugees", word != "https", word != "rt") %>% count(word) %>% arrange(desc(n)) %>% with (wordcloud(word, n, max.words = 100, scale=c(5,.5),min.freq=5, random.order=FALSE, rot.per=.15, colors=brewer.pal(9,"Blues")))

##Create WordCloud for the Text in #refugees Tweets Without Colors
library(wordcloud)
un_words %>% filter(word != "rt", word != "refugees", word != "https", word != "rt") %>% count(word) %>% arrange(desc(n)) %>% with (wordcloud(word, n, max.words = 100, scale=c(5,.5),min.freq=5, random.order=FALSE, rot.per=.15))

Then I hoped to see a list of the most common words represented in a table, rather than as a word cloud. I also excluded refugees, rt and https from the table below as well in order to be consistent.

kable(un_words %>% group_by(word)%>%  filter(word != "rt", word != "refugees", word != "https", word != "rt")  %>% summarize(n = n()) %>% mutate(frequency = n/ sum(n) ) %>% arrange(desc(n)) %>% top_n(15))
word n frequency
job 229 0.0254360
@kon 157 0.0174386
australia 117 0.0129957
advocacy 97 0.0107742
woman 96 0.0106631
@gilliantriggs 95 0.0105520
fearless 95 0.0105520
loses 94 0.0104410
en 72 0.0079973
gratis 64 0.0071087
rights 48 0.0053316
human 46 0.0051094
speaking 43 0.0047762
world 43 0.0047762
gillian 42 0.0046651
triggs 42 0.0046651

Sentiment Analysis

I also wanted to utilize the sentiment analysis that we have learned. I think apply it to tweets using #refugees will be an interesting perspective. I anticipated a pretty big spread on the sentiments, with emotions running high and a lot of people with strong feelings about immigration and refugees.

##Join Words from #refugees Tweets to Sentiments
nrc <- sentiments %>%
  filter(lexicon == "nrc") %>%
  select(word, sentiment)
head(nrc)
## # A tibble: 6 × 2
##        word sentiment
##       <chr>     <chr>
## 1    abacus     trust
## 2   abandon      fear
## 3   abandon  negative
## 4   abandon   sadness
## 5 abandoned     anger
## 6 abandoned      fear
un_sentiments <- un_words %>% inner_join(nrc, by = "word")

un_sentiments %>% group_by(sentiment) %>% summarize(n = n()) %>% mutate(frequency = n/ sum(n) ) %>% arrange(desc(n))
## # A tibble: 10 × 3
##       sentiment     n  frequency
##           <chr> <int>      <dbl>
## 1      positive  1035 0.26668384
## 2         trust   507 0.13063643
## 3  anticipation   430 0.11079619
## 4      negative   410 0.10564288
## 5           joy   389 0.10023190
## 6          fear   340 0.08760629
## 7         anger   281 0.07240402
## 8       sadness   214 0.05514043
## 9      surprise   159 0.04096882
## 10      disgust   116 0.02988920

I wanted to illustrate the sentiments of the tweets with a bar chart in order to have an easy way to see what the most common feelings and emotions were in the tweets the included #refugees. The graph clearly shows the most common sentiment, both by frequency and total number of words is positive.

##Summarize Data for Bar Chart
summary <- un_sentiments %>% group_by(sentiment) %>% summarize(n = n()) %>% mutate(frequency = n/ sum(n) ) %>% arrange(desc(n)) %>% top_n(8)

##Create Table with Data and Create Graphic Representation
kable(summary, digits = 3)
sentiment n frequency
positive 1035 0.267
trust 507 0.131
anticipation 430 0.111
negative 410 0.106
joy 389 0.100
fear 340 0.088
anger 281 0.072
sadness 214 0.055
ggplot(summary, aes(x = sentiment, y= frequency, fill = n)) + geom_bar(stat = "identity", position = "dodge") + labs(x = "Sentiment", y= "Frequency of Senitment", title = "Overall Sentiments of Tweets Using #refugees")

I also thought it might be interesting to break down the tweets by user. This illustrated that there are a huge variety of people tweeting using the hashtag #refugees. The highest percentage of tweets coming from one user is only 0.8%.

#Group Tweets by Users
un_users <- un_df %>% group_by(screenName) %>%  summarize(n = n()) %>% mutate(percent = n / sum(n)) %>% arrange(desc(n)) %>% top_n(10)

kable(un_users)
screenName n percent
SolidaridEstela 9 0.009
5QU1RR3LZ 4 0.004
Kon__K 4 0.004
OmNico72 4 0.004
TullyNYCity 4 0.004
chris_vd_post 3 0.003
Ebird2015 3 0.003
ireneogrizek 3 0.003
labor4refugees1 3 0.003
pcliers 3 0.003
Prison_Health 3 0.003
Rachel_Mantell 3 0.003
RaminFarhangniy 3 0.003
StrongInfidel 3 0.003

Linking the Tweets to Countries Referenced

I also was interested in connecting the words from the tweets to locations, if at all possible. I tried to pull the user locations or utilize the longitude and latitude, but these were pulled in with NAs or timing out repeatedly.

I took an alternative approach and attempted to find a dataframe that would have names of countries, so that I could join the words from the tweets with country names.

I wanted to include a map of the countries noted in the tweets. So I’ve joined the words from the tweets with the list of country names.

MappingTweets <- inner_join(un_words, countrycode_data, by = c("word" = "country.name"))
head(MappingTweets)
## # A tibble: 6 × 30
##   favorited favoriteCount       replyToSN             created truncated
##       <lgl>         <dbl>           <chr>              <dttm>     <lgl>
## 1     FALSE             0  KagutaMuseveni 2016-11-16 19:16:58      TRUE
## 2     FALSE             0    qatarairways 2016-11-16 19:49:06     FALSE
## 3     FALSE             0    qatarairways 2016-11-16 22:36:09     FALSE
## 4     FALSE             0 sophia_christos 2016-11-16 23:27:24     FALSE
## 5     FALSE             0     V_of_Europe 2016-11-16 22:42:57     FALSE
## 6     FALSE             0            <NA> 2016-11-16 18:32:22     FALSE
## # ... with 25 more variables: replyToSID <chr>, id <chr>,
## #   replyToUID <chr>, statusSource <chr>, screenName <chr>,
## #   retweetCount <dbl>, isRetweet <lgl>, retweeted <lgl>, longitude <lgl>,
## #   latitude <lgl>, word <chr>, cowc <chr>, cown <int>, fao <int>,
## #   fips104 <chr>, imf <int>, ioc <chr>, iso2c <chr>, iso3c <chr>,
## #   iso3n <int>, un <int>, wb <chr>, regex <chr>, continent <chr>,
## #   region <chr>

I felt it would be useful to summarize the data in a couple of ways, since the countrycode_data provided a fair amount of data to use. To being I thought showing the occurance of countries by region and continent would be an interesting visualization.

TweetbyRegion <- MappingTweets %>% group_by(region) %>% summarize(n = n()) %>% mutate(frequency = n/ sum(n) ) %>% arrange(desc(n)) %>% top_n(8)

##Plot Tweets by Region (Top 8 Regions)
ggplot(TweetbyRegion, aes(x = region, y= n, fill= frequency)) + geom_bar(stat = "identity", position = "dodge") + labs(x = "Regions", y= "Count of Mentions by Region", title = "Regions Mentioned in Tweets Using #refugees") +  theme(axis.text.x = element_text(angle = 45, hjust = 1))

I then utilized the list of countries in the inner join, to map them. I needed to add some additional libraries in order to do this. I added markers to the countries that were included in the tweets. It is clearly concentrated in Europe and Africa which makes sense with the refugee crisis in Europe, the Middle East and Northern and Eastern Africa.

library(leaflet)
library(rgdal)
library(ggmap)
TweetbyCountry <- MappingTweets %>% group_by(word) %>% summarize(n = n()) 

##Pull the Latitude and Longitude for the Countries in the Tweets
Countries <- geocode(TweetbyCountry$word, output="latlon", source = "google")

map <- leaflet() %>% setView(lng ="0", lat ="0", zoom = 1)
map %>% addProviderTiles("CartoDB.Positron") %>% addMarkers(lat = Countries$lat, Countries$lon)