Updated: 2018-05-31

Motivation

In their very excellent book, Text Mining with R: A Tidy Approach, Julia Silge and David Robinson include a case study comparing their respective twitter archives.1 They do so using the R package they co-authored, tidytext, which is a lot of fun to use. So, since I like playing with tidytext, and obviously want to be cool like Julia and Dave, I thought I’d poke around in my own tweets of yore.2 As I do not have a partner in crime with whom to compare my tweets, I’ll also draw on an older (but still awesome in its own right) post from Julia’s blog, Ten Thousand Tweets.

Getting the data

I’m literally following the TidyText guide, so just mosey over there for a step-by-step break down.

library(tidyverse)
library(lubridate)
library(scales)
library(hrbrthemes)
# read in data
tweets <- read_csv(here::here("data", "tweets.csv"),
  col_types = cols(in_reply_to_status_id = col_character(),
                   in_reply_to_user_id = col_character(),
                   retweeted_status_id = col_character(),
                   retweeted_status_user_id = col_character(),
                   tweet_id = col_character()))

# convert datetimes using lubridate
tweets <- tweets %>%
  mutate(timestamp = ymd_hms(timestamp)) %>%
  mutate(retweeted_status_timestamp = ymd_hms(retweeted_status_timestamp)) %>%
  mutate(person = "Mara") %>%  ## add Mara as person
  arrange(desc(timestamp))  ## arrange descending by date

We can take a look at our shiny new data frame using glimpse().

glimpse(tweets)
## Observations: 27,207
## Variables: 11
## $ tweet_id                   <chr> "1002117432289497088", "10021172079...
## $ in_reply_to_status_id      <chr> "1002024904290656256", "10020296170...
## $ in_reply_to_user_id        <chr> "2343198944", "2355386972", NA, NA,...
## $ timestamp                  <dttm> 2018-05-31 09:19:48, 2018-05-31 09...
## $ source                     <chr> "<a href=\"http://twitter.com/downl...
## $ text                       <chr> "@BecomingDataSci @haziq_azizi @chr...
## $ retweeted_status_id        <chr> NA, NA, NA, NA, NA, NA, NA, NA, "10...
## $ retweeted_status_user_id   <chr> NA, NA, NA, NA, NA, NA, NA, NA, "22...
## $ retweeted_status_timestamp <dttm> NA, NA, NA, NA, NA, NA, NA, NA, 20...
## $ expanded_urls              <chr> NA, NA, "http://bit.ly/not-science,...
## $ person                     <chr> "Mara", "Mara", "Mara", "Mara", "Ma...

The data go (temporally-speaking) from when I first tweeted, to when I downloaded the archive. Let’s find out when that first tweet went out.

first_tweet <- tweets %>%
  summarise(min(timestamp))

first_tweet
## # A tibble: 1 x 1
##   `min(timestamp)`   
##   <dttm>             
## 1 2015-05-14 13:32:33

Tweets over time

So, we’ve basically got two years and one month worth of tweets. Let’s take a look at the distribution of tweets over time. Since we’ve got 25 months, we’ll use 25 bins, so that they each cover approximately a month. I’ll be using theme_ipsum_rc(), which is one of the themes in Bob Rudis’ (a.k.a. @hrbrmstr)’s aptly named hrbrthemes package.

ggplot(tweets, aes(x = timestamp)) +
  geom_histogram(position = "identity", bins = 36, show.legend = FALSE, color = "white", alpha = 0.9) +
  labs(title = "@dataandme tweet volume") +  ## use labs w/ hrbrthemes
  theme_ipsum_rc()

From the look of things, I’ve been tweeting up a storm of late.

The timestamp variable contains quite a bit of information, so it’s worth tidying things up. We can use the datetime to get the date, and put it into yyyy-mm-dd format, which can then be separated into: year, month, and day.3

tweets <- tweets %>%
  mutate(date = as_date(timestamp)) %>%
  separate(date, into = c("year", "month", "day"), sep = "-", remove = FALSE)

Let’s take a look at how my monthly tweet volume has changed over the years.

ggplot(tweets, aes(x = month, fill = year)) +
  geom_histogram(position = "identity", stat = "count", show.legend = FALSE, alpha = 0.9) +
  facet_wrap(~year, ncol = 1) +
  theme_minimal() +
  theme(strip.text.x = element_text(face = "bold"),
        panel.grid.minor = element_blank())

What about the days of the week?

ggplot(data = tweets, aes(x = wday(timestamp, label = TRUE))) +
        geom_histogram(breaks = seq(0.5, 7.5, by =1), stat = "count", show.legend = FALSE, color = "white", alpha = 0.9) +
        theme(legend.position = "none") +
        labs(title = "@dataandme tweets by day of the week",
             x = "days of the week") +
        theme_ipsum_rc()

It looks like, in the aggregate, I’m pretty darn consistent. But, let’s run a chi-squared test to see if the variation in tweet distribution is really in the realm of random sampling error.4

chisq.test(table(wday(tweets$timestamp, label = TRUE)))
## 
##  Chi-squared test for given probabilities
## 
## data:  table(wday(tweets$timestamp, label = TRUE))
## X-squared = 144.41, df = 6, p-value < 2.2e-16

Well well well… Looks can be deceiving.

What’s in those tweets?

Well, let’s take tidytext out for a spin and see.5 We’ll also need to load the stringr package because, although it is part of the tidyverse collection of packages, it’s not among the core packages that are automatically loaded when you run library(tidyverse).

library(tidytext)
library(stringr)

replace_reg <- "https://t.co/[A-Za-z\\d]+|http://[A-Za-z\\d]+|&amp;|&lt;|&gt;|RT|https"
unnest_reg <- "([^A-Za-z_\\d#@']|'(?![A-Za-z_\\d#@]))"
tidy_tweets <- tweets %>% 
  dplyr::filter(!str_detect(text, "^RT")) %>%
  mutate(text = str_replace_all(text, replace_reg, "")) %>%
  unnest_tokens(word, text, token = "regex", pattern = unnest_reg) %>%
  dplyr::filter(!word %in% stop_words$word,
         str_detect(word, "[a-z]"))

By using unnest_tokens(), I broke all of the non-re-tweet tweets down into individual words.

frequency <- tidy_tweets %>% 
  group_by(person) %>% 
  count(word, sort = TRUE) %>% 
  left_join(tidy_tweets %>% 
              group_by(person) %>% 
              summarise(total = n())) %>%
  mutate(freq = n/total)

library(pander)
pandoc.table(top_n(frequency, 5), style = "rmarkdown")
## 
## 
## | person |      word      |  n   | total  |   freq   |
## |:------:|:--------------:|:----:|:------:|:--------:|
## |  Mara  |    #rstats     | 3289 | 199295 |  0.0165  |
## |  Mara  |    #dataviz    | 1955 | 199295 | 0.00981  |
## |  Mara  |      data      | 1509 | 199295 | 0.007572 |
## |  Mara  | @hadleywickham | 954  | 199295 | 0.004787 |
## |  Mara  |     icymi      | 906  | 199295 | 0.004546 |

So, it looks like the things I tweet about most are R (#rstats), data visualization (#dataviz), data (and or stuff pertaining thereto), bugging Hadley Wickham, and making sure the collective “you” didn’t miss things. All noble topics in their own right.

words_by_time <- tidy_tweets %>%
  dplyr::filter(!str_detect(word, "^@")) %>%
  mutate(time_floor = floor_date(timestamp, unit = "1 month")) %>%
  count(time_floor, person, word) %>%
  ungroup() %>%
  group_by(person, time_floor) %>%
  mutate(time_total = sum(n)) %>%
  group_by(word) %>%
  mutate(word_total = sum(n)) %>%
  ungroup() %>%
  rename(count = n) %>%
  dplyr::filter(word_total > 30)  ## just words w/ 30+ mentions

words_by_time
## # A tibble: 21,083 x 6
##    time_floor          person word           count time_total word_total
##    <dttm>              <chr>  <chr>          <int>      <int>      <int>
##  1 2015-05-01 00:00:00 Mara   #data4good         1        209         73
##  2 2015-05-01 00:00:00 Mara   #datascience       1        209         87
##  3 2015-05-01 00:00:00 Mara   #dataviz           3        209       1955
##  4 2015-05-01 00:00:00 Mara   #nptech            1        209         32
##  5 2015-05-01 00:00:00 Mara   #visualization     1        209        187
##  6 2015-05-01 00:00:00 Mara   article            1        209         81
##  7 2015-05-01 00:00:00 Mara   batpig             1        209        228
##  8 2015-05-01 00:00:00 Mara   books              1        209        137
##  9 2015-05-01 00:00:00 Mara   building           1        209         53
## 10 2015-05-01 00:00:00 Mara   chance             1        209         49
## # ... with 21,073 more rows
nested_data <- words_by_time %>%
  nest(-word, -person)

nested_data
## # A tibble: 861 x 3
##    person word           data             
##    <chr>  <chr>          <list>           
##  1 Mara   #data4good     <tibble [24 × 4]>
##  2 Mara   #datascience   <tibble [22 × 4]>
##  3 Mara   #dataviz       <tibble [37 × 4]>
##  4 Mara   #nptech        <tibble [10 × 4]>
##  5 Mara   #visualization <tibble [30 × 4]>
##  6 Mara   article        <tibble [31 × 4]>
##  7 Mara   batpig         <tibble [36 × 4]>
##  8 Mara   books          <tibble [33 × 4]>
##  9 Mara   building       <tibble [27 × 4]>
## 10 Mara   chance         <tibble [24 × 4]>
## # ... with 851 more rows

For modeling purrr-poses (get it?), next we’re gonna do a little bit of nesting.6

library(purrr)

nested_models <- nested_data %>%
  mutate(models = map(data, ~ glm(cbind(count, time_total) ~ time_floor, ., 
                                  family = "binomial")))

nested_models
## # A tibble: 861 x 4
##    person word           data              models   
##    <chr>  <chr>          <list>            <list>   
##  1 Mara   #data4good     <tibble [24 × 4]> <S3: glm>
##  2 Mara   #datascience   <tibble [22 × 4]> <S3: glm>
##  3 Mara   #dataviz       <tibble [37 × 4]> <S3: glm>
##  4 Mara   #nptech        <tibble [10 × 4]> <S3: glm>
##  5 Mara   #visualization <tibble [30 × 4]> <S3: glm>
##  6 Mara   article        <tibble [31 × 4]> <S3: glm>
##  7 Mara   batpig         <tibble [36 × 4]> <S3: glm>
##  8 Mara   books          <tibble [33 × 4]> <S3: glm>
##  9 Mara   building       <tibble [27 × 4]> <S3: glm>
## 10 Mara   chance         <tibble [24 × 4]> <S3: glm>
## # ... with 851 more rows

Using David Robinson’s package, broom, we can get the results of our nested models back into a nice and tidy format for further examination, and plotting.

library(broom)

slopes <- nested_models %>%
  unnest(map(models, tidy)) %>%
  dplyr::filter(term == "time_floor") %>%
  mutate(adjusted.p.value = p.adjust(p.value))

Now let’s take a look at the terms with the top slopes (in absolute value).

top_slopes <- slopes %>% 
  dplyr::filter(adjusted.p.value < 0.1) %>%
  arrange(adjusted.p.value) %>%
  mutate(rank = row_number()) %>%
  dplyr::filter(rank <= 20)

top_slopes
## # A tibble: 20 x 9
##    person word             term      estimate std.error statistic  p.value
##    <chr>  <chr>            <chr>        <dbl>     <dbl>     <dbl>    <dbl>
##  1 Mara   #dataviz         time_flo… -1.31e-8  8.94e-10    -14.7  8.01e-49
##  2 Mara   #rstats          time_flo…  9.64e-9  7.59e-10     12.7  5.67e-37
##  3 Mara   data             time_flo… -1.23e-8  1.02e- 9    -12.1  9.65e-34
##  4 Mara   #machinelearning time_flo… -3.58e-8  3.00e- 9    -11.9  8.65e-33
##  5 Mara   #nba             time_flo… -2.46e-8  2.58e- 9     -9.55 1.32e-21
##  6 Mara   #datascience     time_flo… -5.09e-8  6.05e- 9     -8.41 3.97e-17
##  7 Mara   code             time_flo…  1.40e-8  1.70e- 9      8.22 2.07e-16
##  8 Mara   #maths           time_flo… -4.06e-8  5.10e- 9     -7.96 1.77e-15
##  9 Mara   #data4good       time_flo… -4.82e-8  6.18e- 9     -7.80 6.10e-15
## 10 Mara   sports           time_flo… -2.02e-8  2.76e- 9     -7.33 2.25e-13
## 11 Mara   #polog           time_flo… -7.62e-8  1.04e- 8     -7.31 2.70e-13
## 12 Mara   #datasci         time_flo… -2.97e-8  4.08e- 9     -7.29 3.11e-13
## 13 Mara   #opendata        time_flo… -4.78e-8  6.60e- 9     -7.24 4.38e-13
## 14 Mara   basketball       time_flo… -2.92e-8  4.14e- 9     -7.06 1.62e-12
## 15 Mara   #civictech       time_flo… -9.27e-8  1.33e- 8     -6.97 3.07e-12
## 16 Mara   #stemed          time_flo… -5.10e-8  7.50e- 9     -6.80 1.06e-11
## 17 Mara   totally          time_flo…  1.52e-8  2.36e- 9      6.43 1.26e-10
## 18 Mara   #bigdata         time_flo… -7.39e-8  1.16e- 8     -6.35 2.09e-10
## 19 Mara   impact           time_flo… -4.02e-8  6.52e- 9     -6.16 7.31e-10
## 20 Mara   qt               time_flo… -7.66e-8  1.31e- 8     -5.87 4.35e- 9
## # ... with 2 more variables: adjusted.p.value <dbl>, rank <int>
words_by_time %>%
  inner_join(top_slopes, by = c("word", "person")) %>%
  ggplot(aes(time_floor, count/time_total, color = word)) +
  geom_line(size = 1) +
  labs(title = "Trending words in @dataandme tweets", 
       x = NULL, y = "Word frequency",
       caption = "Extracted from Twitter archive 2018-05-31") +
  theme_ipsum_rc()


All this made possible by…

library(thankr)
thankees <- shoulders()
library(DT)
datatable(thankees)
library(praise)
praise::praise()
## [1] "You are striking!"

  1. Julia (@juliasilge), and Dave (@drob) are excellent tweeters, well worth following– though, given that you’re reading this, I’m guessing the probability that you already do so is pretty high.

  2. See Twitter’s Downloading your Twitter archive if you want to check out your own 140-character collection.

  3. I’m also setting the separate() argument remove = FALSE. This way we keep the date, and have individual year, month, and day values.

  4. Also, as previously mentioned, I want to be super cool like Julia, and she did this for her Ten Thousand Tweets analysis. So, there.

  5. Again, I’m still following Tidy Text Mining with R, specifically 7.2 Word frequencies here.

  6. This is very well explained in the twitter-case-study section of Tidy Text Mining with R.