Updated: 2018-05-31
In their very excellent book, Text Mining with R: A Tidy Approach, Julia Silge and David Robinson include a case study comparing their respective twitter archives.1 They do so using the R package they co-authored, tidytext, which is a lot of fun to use. So, since I like playing with tidytext, and obviously want to be cool like Julia and Dave, I thought I’d poke around in my own tweets of yore.2 As I do not have a partner in crime with whom to compare my tweets, I’ll also draw on an older (but still awesome in its own right) post from Julia’s blog, Ten Thousand Tweets.
I’m literally following the TidyText guide, so just mosey over there for a step-by-step break down.
library(tidyverse)
library(lubridate)
library(scales)
library(hrbrthemes)
# read in data
tweets <- read_csv(here::here("data", "tweets.csv"),
col_types = cols(in_reply_to_status_id = col_character(),
in_reply_to_user_id = col_character(),
retweeted_status_id = col_character(),
retweeted_status_user_id = col_character(),
tweet_id = col_character()))
# convert datetimes using lubridate
tweets <- tweets %>%
mutate(timestamp = ymd_hms(timestamp)) %>%
mutate(retweeted_status_timestamp = ymd_hms(retweeted_status_timestamp)) %>%
mutate(person = "Mara") %>% ## add Mara as person
arrange(desc(timestamp)) ## arrange descending by date
We can take a look at our shiny new data frame using glimpse().
glimpse(tweets)
## Observations: 27,207
## Variables: 11
## $ tweet_id <chr> "1002117432289497088", "10021172079...
## $ in_reply_to_status_id <chr> "1002024904290656256", "10020296170...
## $ in_reply_to_user_id <chr> "2343198944", "2355386972", NA, NA,...
## $ timestamp <dttm> 2018-05-31 09:19:48, 2018-05-31 09...
## $ source <chr> "<a href=\"http://twitter.com/downl...
## $ text <chr> "@BecomingDataSci @haziq_azizi @chr...
## $ retweeted_status_id <chr> NA, NA, NA, NA, NA, NA, NA, NA, "10...
## $ retweeted_status_user_id <chr> NA, NA, NA, NA, NA, NA, NA, NA, "22...
## $ retweeted_status_timestamp <dttm> NA, NA, NA, NA, NA, NA, NA, NA, 20...
## $ expanded_urls <chr> NA, NA, "http://bit.ly/not-science,...
## $ person <chr> "Mara", "Mara", "Mara", "Mara", "Ma...
The data go (temporally-speaking) from when I first tweeted, to when I downloaded the archive. Let’s find out when that first tweet went out.
first_tweet <- tweets %>%
summarise(min(timestamp))
first_tweet
## # A tibble: 1 x 1
## `min(timestamp)`
## <dttm>
## 1 2015-05-14 13:32:33
So, we’ve basically got two years and one month worth of tweets. Let’s take a look at the distribution of tweets over time. Since we’ve got 25 months, we’ll use 25 bins, so that they each cover approximately a month. I’ll be using theme_ipsum_rc(), which is one of the themes in Bob Rudis’ (a.k.a. @hrbrmstr)’s aptly named hrbrthemes package.
ggplot(tweets, aes(x = timestamp)) +
geom_histogram(position = "identity", bins = 36, show.legend = FALSE, color = "white", alpha = 0.9) +
labs(title = "@dataandme tweet volume") + ## use labs w/ hrbrthemes
theme_ipsum_rc()
From the look of things, I’ve been tweeting up a storm of late.
The timestamp variable contains quite a bit of information, so it’s worth tidying things up. We can use the datetime to get the date, and put it into yyyy-mm-dd format, which can then be separated into: year, month, and day.3
tweets <- tweets %>%
mutate(date = as_date(timestamp)) %>%
separate(date, into = c("year", "month", "day"), sep = "-", remove = FALSE)
Let’s take a look at how my monthly tweet volume has changed over the years.
ggplot(tweets, aes(x = month, fill = year)) +
geom_histogram(position = "identity", stat = "count", show.legend = FALSE, alpha = 0.9) +
facet_wrap(~year, ncol = 1) +
theme_minimal() +
theme(strip.text.x = element_text(face = "bold"),
panel.grid.minor = element_blank())
What about the days of the week?
ggplot(data = tweets, aes(x = wday(timestamp, label = TRUE))) +
geom_histogram(breaks = seq(0.5, 7.5, by =1), stat = "count", show.legend = FALSE, color = "white", alpha = 0.9) +
theme(legend.position = "none") +
labs(title = "@dataandme tweets by day of the week",
x = "days of the week") +
theme_ipsum_rc()
It looks like, in the aggregate, I’m pretty darn consistent. But, let’s run a chi-squared test to see if the variation in tweet distribution is really in the realm of random sampling error.4
chisq.test(table(wday(tweets$timestamp, label = TRUE)))
##
## Chi-squared test for given probabilities
##
## data: table(wday(tweets$timestamp, label = TRUE))
## X-squared = 144.41, df = 6, p-value < 2.2e-16
Well well well… Looks can be deceiving.
Well, let’s take tidytext out for a spin and see.5 We’ll also need to load the stringr package because, although it is part of the tidyverse collection of packages, it’s not among the core packages that are automatically loaded when you run library(tidyverse).
library(tidytext)
library(stringr)
replace_reg <- "https://t.co/[A-Za-z\\d]+|http://[A-Za-z\\d]+|&|<|>|RT|https"
unnest_reg <- "([^A-Za-z_\\d#@']|'(?![A-Za-z_\\d#@]))"
tidy_tweets <- tweets %>%
dplyr::filter(!str_detect(text, "^RT")) %>%
mutate(text = str_replace_all(text, replace_reg, "")) %>%
unnest_tokens(word, text, token = "regex", pattern = unnest_reg) %>%
dplyr::filter(!word %in% stop_words$word,
str_detect(word, "[a-z]"))
By using unnest_tokens(), I broke all of the non-re-tweet tweets down into individual words.
frequency <- tidy_tweets %>%
group_by(person) %>%
count(word, sort = TRUE) %>%
left_join(tidy_tweets %>%
group_by(person) %>%
summarise(total = n())) %>%
mutate(freq = n/total)
library(pander)
pandoc.table(top_n(frequency, 5), style = "rmarkdown")
##
##
## | person | word | n | total | freq |
## |:------:|:--------------:|:----:|:------:|:--------:|
## | Mara | #rstats | 3289 | 199295 | 0.0165 |
## | Mara | #dataviz | 1955 | 199295 | 0.00981 |
## | Mara | data | 1509 | 199295 | 0.007572 |
## | Mara | @hadleywickham | 954 | 199295 | 0.004787 |
## | Mara | icymi | 906 | 199295 | 0.004546 |
So, it looks like the things I tweet about most are R (#rstats), data visualization (#dataviz), data (and or stuff pertaining thereto), bugging Hadley Wickham, and making sure the collective “you” didn’t miss things. All noble topics in their own right.
words_by_time <- tidy_tweets %>%
dplyr::filter(!str_detect(word, "^@")) %>%
mutate(time_floor = floor_date(timestamp, unit = "1 month")) %>%
count(time_floor, person, word) %>%
ungroup() %>%
group_by(person, time_floor) %>%
mutate(time_total = sum(n)) %>%
group_by(word) %>%
mutate(word_total = sum(n)) %>%
ungroup() %>%
rename(count = n) %>%
dplyr::filter(word_total > 30) ## just words w/ 30+ mentions
words_by_time
## # A tibble: 21,083 x 6
## time_floor person word count time_total word_total
## <dttm> <chr> <chr> <int> <int> <int>
## 1 2015-05-01 00:00:00 Mara #data4good 1 209 73
## 2 2015-05-01 00:00:00 Mara #datascience 1 209 87
## 3 2015-05-01 00:00:00 Mara #dataviz 3 209 1955
## 4 2015-05-01 00:00:00 Mara #nptech 1 209 32
## 5 2015-05-01 00:00:00 Mara #visualization 1 209 187
## 6 2015-05-01 00:00:00 Mara article 1 209 81
## 7 2015-05-01 00:00:00 Mara batpig 1 209 228
## 8 2015-05-01 00:00:00 Mara books 1 209 137
## 9 2015-05-01 00:00:00 Mara building 1 209 53
## 10 2015-05-01 00:00:00 Mara chance 1 209 49
## # ... with 21,073 more rows
nested_data <- words_by_time %>%
nest(-word, -person)
nested_data
## # A tibble: 861 x 3
## person word data
## <chr> <chr> <list>
## 1 Mara #data4good <tibble [24 × 4]>
## 2 Mara #datascience <tibble [22 × 4]>
## 3 Mara #dataviz <tibble [37 × 4]>
## 4 Mara #nptech <tibble [10 × 4]>
## 5 Mara #visualization <tibble [30 × 4]>
## 6 Mara article <tibble [31 × 4]>
## 7 Mara batpig <tibble [36 × 4]>
## 8 Mara books <tibble [33 × 4]>
## 9 Mara building <tibble [27 × 4]>
## 10 Mara chance <tibble [24 × 4]>
## # ... with 851 more rows
For modeling purrr-poses (get it?), next we’re gonna do a little bit of nesting.6
library(purrr)
nested_models <- nested_data %>%
mutate(models = map(data, ~ glm(cbind(count, time_total) ~ time_floor, .,
family = "binomial")))
nested_models
## # A tibble: 861 x 4
## person word data models
## <chr> <chr> <list> <list>
## 1 Mara #data4good <tibble [24 × 4]> <S3: glm>
## 2 Mara #datascience <tibble [22 × 4]> <S3: glm>
## 3 Mara #dataviz <tibble [37 × 4]> <S3: glm>
## 4 Mara #nptech <tibble [10 × 4]> <S3: glm>
## 5 Mara #visualization <tibble [30 × 4]> <S3: glm>
## 6 Mara article <tibble [31 × 4]> <S3: glm>
## 7 Mara batpig <tibble [36 × 4]> <S3: glm>
## 8 Mara books <tibble [33 × 4]> <S3: glm>
## 9 Mara building <tibble [27 × 4]> <S3: glm>
## 10 Mara chance <tibble [24 × 4]> <S3: glm>
## # ... with 851 more rows
Using David Robinson’s package, broom, we can get the results of our nested models back into a nice and tidy format for further examination, and plotting.
library(broom)
slopes <- nested_models %>%
unnest(map(models, tidy)) %>%
dplyr::filter(term == "time_floor") %>%
mutate(adjusted.p.value = p.adjust(p.value))
Now let’s take a look at the terms with the top slopes (in absolute value).
top_slopes <- slopes %>%
dplyr::filter(adjusted.p.value < 0.1) %>%
arrange(adjusted.p.value) %>%
mutate(rank = row_number()) %>%
dplyr::filter(rank <= 20)
top_slopes
## # A tibble: 20 x 9
## person word term estimate std.error statistic p.value
## <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Mara #dataviz time_flo… -1.31e-8 8.94e-10 -14.7 8.01e-49
## 2 Mara #rstats time_flo… 9.64e-9 7.59e-10 12.7 5.67e-37
## 3 Mara data time_flo… -1.23e-8 1.02e- 9 -12.1 9.65e-34
## 4 Mara #machinelearning time_flo… -3.58e-8 3.00e- 9 -11.9 8.65e-33
## 5 Mara #nba time_flo… -2.46e-8 2.58e- 9 -9.55 1.32e-21
## 6 Mara #datascience time_flo… -5.09e-8 6.05e- 9 -8.41 3.97e-17
## 7 Mara code time_flo… 1.40e-8 1.70e- 9 8.22 2.07e-16
## 8 Mara #maths time_flo… -4.06e-8 5.10e- 9 -7.96 1.77e-15
## 9 Mara #data4good time_flo… -4.82e-8 6.18e- 9 -7.80 6.10e-15
## 10 Mara sports time_flo… -2.02e-8 2.76e- 9 -7.33 2.25e-13
## 11 Mara #polog time_flo… -7.62e-8 1.04e- 8 -7.31 2.70e-13
## 12 Mara #datasci time_flo… -2.97e-8 4.08e- 9 -7.29 3.11e-13
## 13 Mara #opendata time_flo… -4.78e-8 6.60e- 9 -7.24 4.38e-13
## 14 Mara basketball time_flo… -2.92e-8 4.14e- 9 -7.06 1.62e-12
## 15 Mara #civictech time_flo… -9.27e-8 1.33e- 8 -6.97 3.07e-12
## 16 Mara #stemed time_flo… -5.10e-8 7.50e- 9 -6.80 1.06e-11
## 17 Mara totally time_flo… 1.52e-8 2.36e- 9 6.43 1.26e-10
## 18 Mara #bigdata time_flo… -7.39e-8 1.16e- 8 -6.35 2.09e-10
## 19 Mara impact time_flo… -4.02e-8 6.52e- 9 -6.16 7.31e-10
## 20 Mara qt time_flo… -7.66e-8 1.31e- 8 -5.87 4.35e- 9
## # ... with 2 more variables: adjusted.p.value <dbl>, rank <int>
words_by_time %>%
inner_join(top_slopes, by = c("word", "person")) %>%
ggplot(aes(time_floor, count/time_total, color = word)) +
geom_line(size = 1) +
labs(title = "Trending words in @dataandme tweets",
x = NULL, y = "Word frequency",
caption = "Extracted from Twitter archive 2018-05-31") +
theme_ipsum_rc()
library(thankr)
thankees <- shoulders()
library(DT)
datatable(thankees)
library(praise)
praise::praise()
## [1] "You are striking!"
Julia (@juliasilge), and Dave (@drob) are excellent tweeters, well worth following– though, given that you’re reading this, I’m guessing the probability that you already do so is pretty high.↩
See Twitter’s Downloading your Twitter archive if you want to check out your own 140-character collection.↩
I’m also setting the separate() argument remove = FALSE. This way we keep the date, and have individual year, month, and day values.↩
Also, as previously mentioned, I want to be super cool like Julia, and she did this for her Ten Thousand Tweets analysis. So, there.↩
Again, I’m still following Tidy Text Mining with R, specifically 7.2 Word frequencies here.↩
This is very well explained in the twitter-case-study section of Tidy Text Mining with R.↩