Executive Summary

Investigating word use and frequency is important when determining trends and sentiment. The term frequency-inverse document frequency analysis and sentiment analysis shows which words are most important within a tweet and the positiveness or negativeness of the tweets. We can follow these trends and add to the conversation regard data analytics and R Studio. We saw Julia uses her Twitter account for personal use and professional development, while David used his Twitter account for professional development. David and Julia had their accounts for about the same amount of time, but Julia has used her account more frequently. We also verified the account use using word frequency and tf-idf analyses. David would mention upcoming R Studio conferences while Julia would mostly mention her family and personal life. Julia tweeted about four times as much as David and this makes sense when one combine business and personal use on Twitter. We also investigated how David and Julia’s word use changed over the course of 2016. David had sharp up ticks in word use depending the upcoming conferences, and Julia had a negative trend with her top word use change. Julia’s negative trends for word use could be explained for by her using a wide range of words in the tweets. We also investigated the sentiment analysis. Julia’s sentiment was more positive than David’s. This is attributed to David mostly commenting on R packages, R Studio conference, and other data science related topics. The sentiment analysis saw those words as a neutral sentiment. Finally, we looked at what words that were positive and negative appeared more frequent. The words from both David and Julia’s tweets showed that the negative sentiment words relate to data science topics and may not be negative given the context of the tweet. The words were regression and error and could have been referencing a model’s effectiveness or some other topic relating to analytics. A follow on analysis would give more insight to the context of each tweet. The tweets written by David and Julia could have similar theme to other data science commentators or other R Studio contributors. We recommend investigating other R Studio contributors. We can collect this data by reach out to followers of Julia and David and request their tweets so we can compare them to David and Julia’s analysis. This added information could give us insight to how R Studio or data science topics are being presented on Twitter.

Background and Objectives

Twitter has been around since 2006 and offers a means of spreading information quickly. Twitter users acquire followers by posting tweets or mentioning someone else’s tweet. Users follow trends by seeing which hashtag is trending or the user posts their tweet for their niche following. The users post certain topics or mention someone else in a hope to build their brand and following. Someone who is starting to gather a following with respect to data science and R would like to know whom to mention and what phrases to add in their tweets. The objective of this analysis is investigating the number of tweets to post, word and phrase frequency, and favorited posts and re-tweets. The text mining analysis of user’s tweets will give insight to what is trending and how a new member of twitter can gather a following to fellow data scientists.

Key Measures and Data Collection

The key measures for this analysis are word frequency, word use, change in word use pattern, and re-tweets sent by users. The data collected is from David Robinson and Julia Silge, who are both regular twitter posters of R Studio and data science. Julia has posted just of thirteen thousand tweets and David posted around forty-two hundred tweets. The period covered is from 2008 to 2017.

Model Specification & Fitting

The type of model we will use for this analysis is a term frequency-inverse document frequency (tf-idf) analysis. We will investigate the term frequencies, word use, user mentions, and changes in word use. We will then perform a sentiment analysis and the tf-idf analysis. The sentiment analysis determines if the tweets are more positive than negative. The tf-if analysis measures the importance of a word within a tweet. All statistical analyses occur at the 0.05 significance level.

Exploratory Data Analysis

Before we create the model, we will explore the data by plotting histograms of tweets sent by Julia and David as well as a scatter plot of the most frequently used words by each person. The histograms will show us who has consistently been posting to twitter and who has just started posting on twitter. The scatter plot shows us word frequencies between both users.

Data Distribution of Tweets

We begin this analysis by downloading the data from Github. The data was originally downloaded from David and Julia’s twitter and then uploaded to Github. We then plot the number of each post from each user. The first plot below shows the distribution of tweets by Julia in blue and David in orange. As you can see, Julia has been consistently posting since 2008 and David has been posting more frequently since 2015. They both have about the same rate of tweet posts in more recent years. Julia, however, has around four times the number of tweets that David has posted.

library(lubridate)
library(tidyverse)
library(tidytext)
library(textdata)
library(scales)
library(broom)
library(wordcloud)
library(reshape2)

tweets_julia <- read_csv("https://raw.githubusercontent.com/dgrtwo/tidy-text-mining/master/data/tweets_julia.csv")
tweets_dave <- read_csv("https://raw.githubusercontent.com/dgrtwo/tidy-text-mining/master/data/tweets_dave.csv")

tweets <- bind_rows(tweets_julia %>%
                      mutate(person = "Julia"),
                    tweets_dave %>%
                      mutate(person = "David")) %>%
  mutate(timestamp = ymd_hms(timestamp))

ggplot(tweets, aes(x = timestamp, fill = person)) +
  geom_histogram(position = "identity", bins = 20, show.legend = FALSE) +
  facet_wrap(~person, ncol = 1)+
  labs(title="Tweet Counts for David and Julia",
       x="Timestamp",y="Count")+
  scale_fill_manual(values = c("#E69F00","#56B4E9"))

Word Frequencies

First, let us look at word frequency with a word cloud.The size of the word depends on how frequent the word is used in a tweet. The word cloud uses both David and Julia’s tweets. The first word cloud is of David’s tweets and the second cloud is of Julia’s tweets.

remove_reg <- "&amp;|&lt;|&gt;"
tidy_tweets <- tweets %>% 
  filter(!str_detect(text, "^RT")) %>%
  mutate(text = str_remove_all(text, remove_reg)) %>%
  unnest_tokens(word, text, token = "tweets") %>%
  filter(!word %in% stop_words$word,
         !word %in% str_remove_all(stop_words$word, "'"),
         str_detect(word, "[a-z]"))

tidy_tweets %>%
  filter(!str_detect(word, "^@")) %>%
  filter(person == "David") %>%
  anti_join(stop_words) %>%
  count(word) %>%
  with(wordcloud(word, n, max.words = 100))

tidy_tweets %>%
  filter(!str_detect(word, "^@")) %>%
  filter(person == "Julia") %>%
  anti_join(stop_words) %>%
  count(word) %>%
  with(wordcloud(word,scale=c(3.5,0.15), n, max.words = 100))

Now we move onto calculating the word frequencies. We first group the frequencies by the user and count how many times each word was tweeted by the David or Julia. The scatter plot below shows word frequency as a function of both individuals. As you can see in the scatter plot, Julia’s word frequency is on the x-axis and David’s frequency is on the y-axis. Points closer to the red line indicate word frequencies are similar between Julia and David. Words further from the line are used more by one individual. The text labels are of the most frequently tweeted words. We can also glean the account use type. We can see that Julia has used her account for personal and professional development, while David has used his account for professional development. We see this based on the text labels. We will show this later in the analysis.

frequency <- tidy_tweets %>% 
  group_by(person) %>% 
  count(word, sort = TRUE) %>% 
  left_join(tidy_tweets %>% 
              group_by(person) %>% 
              summarise(total = n())) %>%
  mutate(freq = n/total)

frequency <- frequency %>% 
  select(person, word, freq) %>% 
  spread(person, freq) %>%
  arrange(Julia, David)

ggplot(frequency, aes(Julia, David)) +
  geom_jitter(alpha = 0.1, size = 2.5, width = 0.25, height = 0.25) +
  geom_text(aes(label = word), check_overlap = TRUE, vjust = 1.5) +
  scale_x_log10(labels = percent_format()) +
  scale_y_log10(labels = percent_format()) +
  geom_abline(color = "red")

Comparing Word Use

Now we find the ratio of word frequencies. The ratio is of words used most frequently and least frequently by Julia and David. We take the log odds ratio of the top fifteen words used by David and Julia. We see in the following plot there are two different sets of word use. The positive, orange values are words used more frequently by David, while the negative, blue bars are of words used by Julia. As you can see, David talks about conferences and other words associated with data analysis. Julia talks about more personal topics.

tidy_tweets <- tidy_tweets %>%
  filter(timestamp >= as.Date("2016-01-01"),
         timestamp < as.Date("2017-01-01"))

word_ratios <- tidy_tweets %>%
  filter(!str_detect(word, "^@")) %>%
  count(word, person) %>%
  group_by(word) %>%
  filter(sum(n) >= 10) %>%
  ungroup() %>%
  pivot_wider(names_from = person, values_from = n, values_fill = 0) %>%
  mutate_if(is.numeric, list(~(. + 1) / (sum(.) + 1))) %>%
  mutate(logratio = log(David / Julia)) %>%
  arrange(desc(logratio))

word_ratios %>%
  group_by(logratio < 0) %>%
  slice_max(abs(logratio), n = 15) %>% 
  ungroup() %>%
  mutate(word = reorder(word, logratio)) %>%
  ggplot(aes(word, logratio, fill = logratio < 0)) +
  geom_col(show.legend = FALSE) +
  coord_flip() +
  ylab("Ratio(David/Julia)")+xlab("Word")+
  ggtitle("Ratio of Word Frquency by David and Julia")+
  scale_fill_discrete(name = "", labels = c("David", "Julia"))+
  scale_fill_manual(values = c("#E69F00","#56B4E9"))

Change in Word Use

Now we will investigate how word use frequency has changed over time. Let us look at the year 2016. This is the year that David was observed tweeting more frequently. The first plot is of David’s word frequency changes over the course of 2016. As you can, his most frequented tweet was regarding the UseR conference in 2016 during the month of June. The UseR tweet drastically dropped during the months of July and August. When we look at Julia’s top tweet word frequency changes, we see a decreasing trend for both #rsats and post. This implies that Julia has used varying words over the course of 2016.

words_by_time <- tidy_tweets %>%
  filter(!str_detect(word, "^@")) %>%
  mutate(time_floor = floor_date(timestamp, unit = "1 month")) %>%
  count(time_floor, person, word) %>%
  group_by(person, time_floor) %>%
  mutate(time_total = sum(n)) %>%
  group_by(person, word) %>%
  mutate(word_total = sum(n)) %>%
  ungroup() %>%
  rename(count = n) %>%
  filter(word_total > 30)

nested_data <- words_by_time %>%
  nest(data = c(-word, -person))

nested_models <- nested_data %>%
  mutate(models = map(data, ~ glm(cbind(count, time_total) ~ time_floor, ., 
                                  family = "binomial")))

slopes <- nested_models %>%
  mutate(models = map(models, tidy)) %>%
  unnest(cols = c(models)) %>%
  filter(term == "time_floor") %>%
  mutate(adjusted.p.value = p.adjust(p.value))

top_slopes <- slopes %>% 
  filter(adjusted.p.value < 0.05)

cbp1 <- c("#E69F00", "#56B4E9", "#009E73",
          "#F0E442", "#0072B2", "#D55E00", "#CC79A7")

words_by_time %>%
  inner_join(top_slopes, by = c("word", "person")) %>%
  filter(person == "David") %>%
  ggplot(aes(time_floor, count/time_total, color = word)) +
  geom_line(size = 1.3) +
  labs(x = "Month", y = "Frequency",title="David's Word Frequency Changes in 2016")+
  scale_colour_manual(values=cbp1)

words_by_time %>%
  inner_join(top_slopes, by = c("word", "person")) %>%
  filter(person == "Julia") %>%
  ggplot(aes(time_floor, count/time_total, color = word)) +
  geom_line(size = 1.3) +
  labs(x = "Month", y = "Frequency",title="Julia's Word Frequency Changes in 2016")+
  scale_colour_manual(values=cbp1)

Favorite Tweets and Retweets

Now we will look into the word use on the favorited and re-tweeted tweets. As you can see in the following plot, both David and Julia bring up R packages that lead to more re-tweets. The next plot is of words used in the most favorited tweets. As you can see in the second plot, there are minor word changes between re-tweeted and favorited tweets.

tweets_julia <- read_csv("https://raw.githubusercontent.com/dgrtwo/tidy-text-mining/master/data/juliasilge_tweets.csv")
tweets_dave <- read_csv("https://raw.githubusercontent.com/dgrtwo/tidy-text-mining/master/data/drob_tweets.csv")
tweets <- bind_rows(tweets_julia %>% 
                      mutate(person = "Julia"),
                    tweets_dave %>% 
                      mutate(person = "David")) %>%
  mutate(created_at = ymd_hms(created_at))

tidy_tweets <- tweets %>% 
  filter(!str_detect(text, "^(RT|@)")) %>%
  mutate(text = str_remove_all(text, remove_reg)) %>%
  unnest_tokens(word, text, token = "tweets", strip_url = TRUE) %>%
  filter(!word %in% stop_words$word,
         !word %in% str_remove_all(stop_words$word, "'"))

totals <- tidy_tweets %>% 
  group_by(person, id) %>% 
  summarise(rts = first(retweets)) %>% 
  group_by(person) %>% 
  summarise(total_rts = sum(rts))

word_by_rts <- tidy_tweets %>% 
  group_by(id, word, person) %>% 
  summarise(rts = first(retweets)) %>% 
  group_by(person, word) %>% 
  summarise(retweets = median(rts), uses = n()) %>%
  left_join(totals) %>%
  filter(retweets != 0) %>%
  ungroup()

word_by_rts %>%
  filter(uses >= 5) %>%
  group_by(person) %>%
  slice_max(retweets, n = 10) %>% 
  arrange(retweets) %>%
  ungroup() %>%
  mutate(word = factor(word, unique(word))) %>%
  ungroup() %>%
  ggplot(aes(word, retweets, fill = person)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ person, scales = "free", ncol = 2) +
  coord_flip() +
  labs(x = NULL, 
       y = "Median value of retweets containing each word")+
  scale_fill_manual(values = c("#E69F00","#56B4E9"))

totals <- tidy_tweets %>% 
  group_by(person, id) %>% 
  summarise(favs = first(favorites)) %>% 
  group_by(person) %>% 
  summarise(total_favs = sum(favs))

word_by_favs <- tidy_tweets %>% 
  group_by(id, word, person) %>% 
  summarise(favs = first(favorites)) %>% 
  group_by(person, word) %>% 
  summarise(favorites = median(favs), uses = n()) %>%
  left_join(totals) %>%
  filter(favorites != 0) %>%
  ungroup()

word_by_favs %>%
  filter(uses >= 5) %>%
  group_by(person) %>%
  slice_max(favorites, n = 10) %>% 
  arrange(favorites) %>%
  ungroup() %>%
  mutate(word = factor(word, unique(word))) %>%
  ungroup() %>%
  ggplot(aes(word, favorites, fill = person)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ person, scales = "free", ncol = 2) +
  coord_flip() +
  labs(x = NULL, 
       y = "Median value of favorites for tweets containing each word")+
  scale_fill_manual(values = c("#E69F00","#56B4E9"))

Sentiment Analysis

Now we will look at the types of words used by Julia and David. This is known as sentiment analysis. In the first plot below, we calculate whether more tweets from David and Julia are positive or negative. We subtract the number of negative words from the number of positive words and the results are shown in the histograms. As you can see, Julia uses more positive words than David. This could be attributed to Julia using her account longer and for professional and personal development. They are both positive sentiments overall. The next plot shows which words from both David and Julia’s tweets that contribute to the positive and negative sentiments. One thing to note for the negative sentiment words is the words regression and error are shown. These tweets could have been taken out of context since those words are used to discuss data and data models. Follow on analyses are required to show if those tweets are truly negative.

tweet_sentiment <- tidy_tweets %>%
  inner_join(get_sentiments("bing")) %>%
  count(person, index=1, sentiment) %>%
  pivot_wider(names_from = sentiment, values_from = n, values_fill = 0) %>% 
  mutate(sentiment = positive - negative)

ggplot(tweet_sentiment, aes(index, sentiment, fill = person)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~person, ncol = 2, scales = "free_x")+
  labs(x="Index",y="Sentiment",title="Sentiment Analysis for David and Julia")+
  scale_fill_manual(values = c("#E69F00","#56B4E9"))

bing_word_counts <- tidy_tweets %>%
  inner_join(get_sentiments("bing")) %>%
  count(word, sentiment, sort = TRUE) %>%
  ungroup()

bing_word_counts %>%
  group_by(sentiment) %>%
  slice_max(n, n = 10) %>% 
  ungroup() %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(n, word, fill = sentiment)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~sentiment, scales = "free_y") +
  labs(x = "Contribution to Sentiment",
       y = "")+
  scale_fill_manual(values = c("#E69F00","#56B4E9"))

Zipf’s Law

We now continue onto the tf-idf analysis. We first calculate Zipf’s law. Zipf’s Law describes that a word appears is inversely proportional to its rank. The following plot shows the Zipf’s Law for David and Julia. The line in gray is the is the theoretical values for Zipf’s Law. As you can see, the tweets generally follow Zipf’s Law. Julia’s tweets are in blue, and David’s tweets are in orange.

tfidf <- tidy_tweets %>% 
  group_by(person) %>% 
  filter(!str_detect(word, "^@")) %>%
  count(word, sort = TRUE) %>% 
  left_join(tidy_tweets %>% 
              group_by(person) %>% 
              summarise(total = n()))

freq_by_rank <- tfidf %>% 
  group_by(person) %>% 
  mutate(rank = row_number(), 
         `term frequency` = n/total) %>%
  ungroup()

rank_subset <- freq_by_rank %>% 
  filter(rank < 500,
         rank > 10)

lin_mod<-lm(log10(`term frequency`) ~ log10(rank), data = rank_subset)

freq_by_rank %>% 
  ggplot(aes(rank, `term frequency`, color = person)) + 
  geom_abline(intercept = -0.62, slope = -1.1, 
              color = "gray50", linetype = 2) +
  geom_line(size = 1.1, alpha = 0.8, show.legend = FALSE) + 
  scale_x_log10() +
  scale_y_log10()+
  scale_color_manual(values = c("#E69F00","#56B4E9"))+
  labs(x="Rank",y="Frequency",title="Zipf's Law for David and Julia's Tweets")

TF-IDF Model

Finally, we move onto the calculation for the tf-idf analysis. As stated before, the tf-idf analysis calculates which words play a more significant role for each tweet. The weight for commonly used words is lower than infrequently used words. The plot below shows the words with the highest tf-idf values for Julia and David. As you can see, David references different data science related topics while Julia talks about family related topics.

person_tf_idf <- tfidf %>%
  bind_tf_idf(word, person, n)

per_tfidf<-person_tf_idf %>%
  select(-total) %>%
  arrange(desc(tf_idf))

per_tfidf %>%
  group_by(person) %>%
  slice_max(tf_idf, n = 15) %>%
  ungroup() %>%
  ggplot(aes(tf_idf, fct_reorder(word, tf_idf), fill = person)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~person, ncol = 2, scales = "free") +
  labs(x = "TF-IDF", y = "Word", title="TF-IDF Values for David and Julia")+
  scale_fill_manual(values = c("#E69F00","#56B4E9"))

Conclusions

Recommendations

The tweets written by David and Julia could have similar theme to other data science commentators or other R Studio contributors. We recommend investigating other R Studio contributors. We can collect this data by reach out to followers of Julia and David and request their tweets so we can compare them to David and Julia’s analysis. This added information could give us insight to how R Studio or data science topics are being presented or trending on Twitter.

Technical Notes

## R version 4.1.1 (2021-08-10)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 19044)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=English_United States.1252 
## [2] LC_CTYPE=English_United States.1252   
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.1252    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] reshape2_1.4.4     wordcloud_2.6      RColorBrewer_1.1-2 broom_0.7.12      
##  [5] scales_1.1.1       textdata_0.4.1     tidytext_0.3.2     forcats_0.5.1     
##  [9] stringr_1.4.0      dplyr_1.0.8        purrr_0.3.4        readr_2.1.2       
## [13] tidyr_1.2.0        tibble_3.1.6       ggplot2_3.3.5      tidyverse_1.3.1   
## [17] lubridate_1.8.0   
## 
## loaded via a namespace (and not attached):
##  [1] httr_1.4.2        sass_0.4.0        bit64_4.0.5       vroom_1.5.7      
##  [5] jsonlite_1.8.0    modelr_0.1.8      bslib_0.3.1       assertthat_0.2.1 
##  [9] highr_0.9         cellranger_1.1.0  yaml_2.3.5        pillar_1.7.0     
## [13] backports_1.4.1   lattice_0.20-45   glue_1.6.2        digest_0.6.29    
## [17] rvest_1.0.2       colorspace_2.0-3  htmltools_0.5.2   Matrix_1.4-0     
## [21] plyr_1.8.6        pkgconfig_2.0.3   haven_2.4.3       tzdb_0.2.0       
## [25] farver_2.1.0      generics_0.1.2    ellipsis_0.3.2    withr_2.5.0      
## [29] cli_3.2.0         magrittr_2.0.2    crayon_1.5.0      readxl_1.3.1     
## [33] evaluate_0.15     tokenizers_0.2.1  janeaustenr_0.1.5 fs_1.5.2         
## [37] fansi_1.0.2       SnowballC_0.7.0   xml2_1.3.3        tools_4.1.1      
## [41] hms_1.1.1         lifecycle_1.0.1   munsell_0.5.0     reprex_2.0.1     
## [45] compiler_4.1.1    jquerylib_0.1.4   rlang_1.0.2       grid_4.1.1       
## [49] rstudioapi_0.13   labeling_0.4.2    rmarkdown_2.12    gtable_0.3.0     
## [53] DBI_1.1.2         curl_4.3.2        R6_2.5.1          knitr_1.37       
## [57] fastmap_1.1.0     bit_4.0.4         utf8_1.2.2        stringi_1.7.6    
## [61] parallel_4.1.1    Rcpp_1.0.8        vctrs_0.3.8       dbplyr_2.1.1     
## [65] tidyselect_1.1.2  xfun_0.30

Case study: Comparing Twitter Archives

Lance Ostby

2022-04-24