I used packages: tidyverse, twitteR, tidytext, stringr, reshape2, formattable, wordcloud, and lubridate.
## [1] FALSE
I chose to investigate the hashtag “#climatechange” on Twitter for this analysis.
num_tweets <- 1000
climate_tweets <- search_tweets('#climatechange', n = num_tweets, include_rts = FALSE)
#head(climate_tweets)
Most Tweets are posted using the Twitter web app, but many are also posted on mobile apps.
| source | n | percent_of_tweets |
|---|---|---|
| Twitter Web App | 325 | 0.32 |
| Twitter for iPhone | 153 | 0.15 |
| Twitter for Android | 127 | 0.13 |
| Hootsuite Inc. | 90 | 0.09 |
| Twitter Web Client | 53 | 0.05 |
| TweetDeck | 49 | 0.05 |
| Buffer | 41 | 0.04 |
| Twitter for iPad | 23 | 0.02 |
| Sprout Social | 13 | 0.01 |
| Twitter Media Studio | 9 | 0.01 |
There is a slight relationship between age of an account and the number of followers, but there is a lot of variation. Some accounts have existed for a long time, but don’t have very many follwers.
user_data<-climate_tweets %>%
dplyr::select(account_created_at, favourites_count, statuses_count, followers_count, friends_count) %>%
mutate(accountage = lubridate::now(tzone = "EST") - account_created_at) %>%
mutate(accountage_num = as.numeric(accountage)) %>%
mutate(accountyears = accountage_num/8760)
user_data %>%
filter(followers_count < 50000) %>%
ggplot(aes(accountyears, followers_count)) + geom_point(color = "#67a9cf") +
geom_smooth(method = "lm", color = "black") +
labs(x = "Age of Twitter account (years)", y="Number of follwers",
title = "Relationship between age of account and number of follwers") +
theme(panel.grid = element_blank(), axis.text = element_text(size=12), axis.title = element_text(size=13))
reg <- "([^A-Za-z\\d#@']|'(?![A-Za-z\\d#@]))"
tweet_words <- climate_tweets %>%
dplyr::select(status_id, text) %>%
filter(!str_detect(text, '^"')) %>%
mutate(text = str_replace_all(text, "https://t.co/[A-Za-z\\d]+|&", "")) %>%
unnest_tokens(word, text, token = "regex", pattern = reg) %>%
filter(!word %in% stop_words$word,
str_detect(word, "[a-z]"))
The hashtag #cop25 is trending right now because the event (the UN Climate Change Conference) is going on right now (Dec 2-13, 2019). There are a lot of other interesting words in this cloud like “debate”, “research”, “action”, and “environment.” Based on this word cloud, it seems like most tweets are focused on spreading awareness about climate change or debating issues related to it.
myrdbupal<-c("#67001f","#b2182b","#d6604d","#f4a582","#92c5de","#4393c3","#2166ac","#053061")
cloudwords<-as.vector(tweet_words$word)
wordcloud::wordcloud(cloudwords,min.freq = 2, scale=c(7,0.6),colors=myrdbupal,
random.color= T, random.order = FALSE, max.words = 150)
positive <- get_sentiments("bing") %>%
filter(sentiment == "positive")
pos_words<-tweet_words %>%
semi_join(positive) %>%
count(word, sort = TRUE) %>%
filter(word != "warm",
word != "fast",
word != "trump",
word != "won",
word != "silent",
word != "hot") %>%
mutate(sentiment = "Positive")
negative <- get_sentiments("bing") %>%
filter(sentiment == "negative")
neg_words<-tweet_words %>%
semi_join(negative) %>%
count(word, sort = TRUE) %>%
mutate(sentiment = "Negative")
There are a similar number of positive-associated and negative-associated words in these Tweets.
bind_rows(pos_words, neg_words) %>%
filter(n > 4) %>%
mutate(n = ifelse(sentiment == "Negative", -n, n)) %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(word, n, fill = sentiment)) +
geom_col() +
coord_flip() +
labs(y = "Contribution to sentiment", x="Word" , fill = "Sentiment",
title = "Common words in #climatechange Tweets") +
theme(panel.grid = element_blank(), axis.text = element_text(size=13), axis.title = element_text(size=13)) +
scale_fill_manual(values = c("#b2182b","#67a9cf"))