In this project I will use Twitter API to pull live tweets directly into R studio. The date is currently 01/031/2020, and I will be pulling tweets focused on the US, Asia and Europe economies.
After pulling all the tweets I will tokenize the data and remove stop words. This will give me a better understanding if there are any related tweets being shared around the world at the same point of time.
Next I will link my personal Twitter API codes into R. The code will be hiden because everyone has their own personal codes and they should not be shared.
setup_twitter_oauth(consumer_key, consumer_secret, access_token, access_secret)
## [1] "Using direct authentication"
Now we have Twitter all set up, I will pull 3 specific datasets based on hashtags.
USA <- twitteR::searchTwitter('#USA + #Economy', n = 400, since = '2020-01-01', retryOnRateLimit = 1e3)
u = twitteR::twListToDF(USA)
EU <- twitteR::searchTwitter('#EU + #Economy', n = 400, since = '2020-01-01', retryOnRateLimit = 1e3)
e = twitteR::twListToDF(EU)
ASIA <- twitteR::searchTwitter('#Asia + #Economy', n = 400, since = '2020-01-01', retryOnRateLimit = 1e3)
a = twitteR::twListToDF(ASIA)
Tokenize dataframes and remove stop words
tidy_usa <- u %>%
unnest_tokens(word, text) %>%
anti_join(stop_words)
## Joining, by = "word"
tidy_eu <- e %>%
unnest_tokens(word, text) %>%
anti_join(stop_words)
## Joining, by = "word"
tidy_asia <- a %>%
unnest_tokens(word, text) %>%
anti_join(stop_words)
## Joining, by = "word"
frequency <- bind_rows(mutate(tidy_usa, author="USA"),
mutate(tidy_eu, author= "EU"),
mutate(tidy_asia, author="ASIA")
)%>%#closing bind_rows
mutate(word=str_extract(word, "[a-z']+")) %>%
count(author, word) %>%
group_by(author) %>%
mutate(proportion = n/sum(n))%>%
select(-n) %>%
spread(author, proportion) %>%
gather(author, proportion, `EU`, `ASIA`)
head(frequency)
## # A tibble: 6 x 4
## word USA author proportion
## <chr> <dbl> <chr> <dbl>
## 1 a 0.00142 EU 0.000366
## 2 aajtak 0.000177 EU NA
## 3 aaron 0.000177 EU NA
## 4 abbymartin 0.000177 EU NA
## 5 abd 0.000354 EU NA
## 6 abmq NA EU NA
ggplot(frequency, aes(x=proportion, y=`USA`,
color = abs(`USA`- proportion)))+
geom_abline(color="grey40", lty=2)+
geom_jitter(alpha=.1, size=2.5, width=0.3, height=0.3)+
geom_text(aes(label=word), check_overlap = TRUE, vjust=1.5) +
scale_x_log10(labels = percent_format())+
scale_y_log10(labels= percent_format())+
scale_color_gradient(limits = c(0,0.001), low = "darkslategray4", high = "gray75")+
facet_wrap(~author, ncol=2)+
theme(legend.position = "none")+
labs(y= "USA", x=NULL)
## Warning: Removed 4234 rows containing missing values (geom_point).
## Warning: Removed 4236 rows containing missing values (geom_text).
Looking at the key words used we can see that if we benchmark the US compared to Asia or Europe there are high searches for China, Health and Virus related news. This is during the same time as the Coronavirus outbreak of January 2020.
cor.test(data=frequency[frequency$author == "EU",],
~proportion + `USA`)
##
## Pearson's product-moment correlation
##
## data: proportion and USA
## t = 20.222, df = 184, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.7797568 0.8703496
## sample estimates:
## cor
## 0.8304655
cor.test(data=frequency[frequency$author == "ASIA",],
~proportion + `USA`)
##
## Pearson's product-moment correlation
##
## data: proportion and USA
## t = 41.179, df = 126, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.9503812 0.9750784
## sample estimates:
## cor
## 0.9647972
We can see that both correlations are strong showing that when users are hashtagging #Economy they are also searching news related to US, Europe and Asia.