Pulling Tweets for Tokenization

In this project I will use Twitter API to pull live tweets directly into R studio. The date is currently 01/031/2020, and I will be pulling tweets focused on the US, Asia and Europe economies.

After pulling all the tweets I will tokenize the data and remove stop words. This will give me a better understanding if there are any related tweets being shared around the world at the same point of time.

Libraries

Connecting Twitter API keys

Next I will link my personal Twitter API codes into R. The code will be hiden because everyone has their own personal codes and they should not be shared.

Link Twitter Authentication

setup_twitter_oauth(consumer_key, consumer_secret, access_token, access_secret)

## [1] "Using direct authentication"

Pull Tweets from Twitter

Now we have Twitter all set up, I will pull 3 specific datasets based on hashtags.

USA Economy
Europe Economy
Asia Economy

USA Economy

USA <- twitteR::searchTwitter('#USA + #Economy', n = 400, since = '2020-01-01', retryOnRateLimit = 1e3)
u = twitteR::twListToDF(USA)

Europe Economy

EU <- twitteR::searchTwitter('#EU + #Economy', n = 400, since = '2020-01-01', retryOnRateLimit = 1e3)
e = twitteR::twListToDF(EU)

China Economy

ASIA <- twitteR::searchTwitter('#Asia + #Economy', n = 400, since = '2020-01-01', retryOnRateLimit = 1e3)
a = twitteR::twListToDF(ASIA)

Tokenize datasets

Tokenize dataframes and remove stop words

USA Tokens

tidy_usa <- u %>%
  unnest_tokens(word, text) %>%
  anti_join(stop_words)

## Joining, by = "word"

Europe Tokens

tidy_eu <- e %>%
  unnest_tokens(word, text) %>%
  anti_join(stop_words)

## Joining, by = "word"

Asia Tokens

tidy_asia <- a %>%
  unnest_tokens(word, text) %>%
  anti_join(stop_words)

## Joining, by = "word"

Merge Datasets

frequency <- bind_rows(mutate(tidy_usa, author="USA"),
                       mutate(tidy_eu, author= "EU"),
                       mutate(tidy_asia, author="ASIA")
                        )%>%#closing bind_rows
  mutate(word=str_extract(word, "[a-z']+")) %>%
  count(author, word) %>%
  group_by(author) %>%
  mutate(proportion = n/sum(n))%>%
  select(-n) %>%
  spread(author, proportion) %>%
  gather(author, proportion, `EU`, `ASIA`)
head(frequency)

## # A tibble: 6 x 4
##   word             USA author proportion
##   <chr>          <dbl> <chr>       <dbl>
## 1 a           0.00142  EU       0.000366
## 2 aajtak      0.000177 EU      NA       
## 3 aaron       0.000177 EU      NA       
## 4 abbymartin  0.000177 EU      NA       
## 5 abd         0.000354 EU      NA       
## 6 abmq       NA        EU      NA

Plot Tokens

ggplot(frequency, aes(x=proportion, y=`USA`, 
                      color = abs(`USA`- proportion)))+
  geom_abline(color="grey40", lty=2)+
  geom_jitter(alpha=.1, size=2.5, width=0.3, height=0.3)+
  geom_text(aes(label=word), check_overlap = TRUE, vjust=1.5) +
  scale_x_log10(labels = percent_format())+
  scale_y_log10(labels= percent_format())+
  scale_color_gradient(limits = c(0,0.001), low = "darkslategray4", high = "gray75")+
  facet_wrap(~author, ncol=2)+
  theme(legend.position = "none")+
  labs(y= "USA", x=NULL)

## Warning: Removed 4234 rows containing missing values (geom_point).

## Warning: Removed 4236 rows containing missing values (geom_text).

Looking at the key words used we can see that if we benchmark the US compared to Asia or Europe there are high searches for China, Health and Virus related news. This is during the same time as the Coronavirus outbreak of January 2020.

Correlation Test

cor.test(data=frequency[frequency$author == "EU",],
         ~proportion + `USA`)

## 
##  Pearson's product-moment correlation
## 
## data:  proportion and USA
## t = 20.222, df = 184, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.7797568 0.8703496
## sample estimates:
##       cor 
## 0.8304655

cor.test(data=frequency[frequency$author == "ASIA",],
         ~proportion + `USA`)

## 
##  Pearson's product-moment correlation
## 
## data:  proportion and USA
## t = 41.179, df = 126, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.9503812 0.9750784
## sample estimates:
##       cor 
## 0.9647972

We can see that both correlations are strong showing that when users are hashtagging #Economy they are also searching news related to US, Europe and Asia.