Dirk on SO chat:
We need someone to write a script to parse r-devel archives to find how often Luke screamed DO NOT DO THIS!!!. Is today a first?
Archives of the r-devel mailing list are availabe online.
Shall we scrape?
robotstxt::paths_allowed("https://stat.ethz.ch/pipermail/r-devel/")
## [1] TRUE
Let’s then.
library(tidyverse)
library(rvest)
r_devel_links <- read_html("https://stat.ethz.ch/pipermail/r-devel/") %>%
html_node("table") %>%
html_nodes("td ~ td + td a") %>%
html_attr("href")
r_devel_text <- data_frame(date = r_devel_links) %>%
mutate(link = paste0("https://stat.ethz.ch/pipermail/r-devel/", date),
date = str_extract(date, "[^\\.]*")) %>%
mutate(text = map(link, read_lines)) %>%
unnest(text)
Let’s clean up the text, removing quoted text from messages being replied to, and identifying who sent each message and where messages start.
library(lubridate)
r_devel_parsed <- r_devel_text %>%
filter(!str_detect(text, "<|>|^\\|"), ## strip out text from messages being replied to and headers
!str_detect(text, "^From ")) %>%
mutate(date = ymd(paste0(date, "-01")),
author = ifelse(str_detect(text, "^From:"), text, NA),
author = str_extract(author, "\\(.+\\)"),
message_id = ifelse(str_detect(author, "[:alpha:]"), row_number(), NA),
author = str_replace_all(author, "\\(|\\)", "")) %>%
fill(author, message_id, .direction = "down") %>%
mutate(author = case_when(str_detect(str_to_lower(author), "dalgaard") ~ "Peter Dalgaard",
str_detect(str_to_lower(author), "ripley") ~ "Brian Ripley",
str_detect(str_to_lower(author), "maechler") ~ "Martin Maechler",
str_detect(str_to_lower(author), "ligges") ~ "Uwe Ligges",
str_detect(str_to_lower(author), "wickham") ~ "Hadley Wickham",
str_detect(str_to_lower(author), "tierney") ~ "Luke Tierney",
TRUE ~ author)) %>%
filter(author != "en148") ## I think these are some kind of auto generated messages from a decade ago?
Who has sent the most messages?
top_12 <- r_devel_parsed %>%
distinct(author, message_id) %>%
count(author, sort = TRUE) %>%
top_n(12)
top_12_authors <- top_12 %>% pull(author)
top_12 %>%
kable(col.names = c("Author", "Total number of messages since 1997"))
| Author | Total number of messages since 1997 |
|---|---|
| Brian Ripley | 7351 |
| Peter Dalgaard | 5029 |
| Duncan Murdoch | 3242 |
| Martin Maechler | 2158 |
| Kurt Hornik | 1780 |
| Paul Gilbert | 1573 |
| Simon Urbanek | 1310 |
| Thomas Lumley | 1309 |
| Uwe Ligges | 1290 |
| Gabor Grothendieck | 1195 |
| Dirk Eddelbuettel | 951 |
| Henrik Bengtsson | 800 |
Who is sending more or fewer messages over time?
r_devel_parsed %>%
mutate(year = year(date)) %>%
distinct(year, author, message_id) %>%
count(year, author) %>%
filter(author %in% top_12_authors) %>%
ggplot(aes(year, n, color = author)) +
geom_line(size = 1.5, alpha = 0.8) +
expand_limits(y = 0) +
theme(legend.position="none") +
facet_wrap(~author) +
labs(x = NULL, y = "# of messages per year",
title = "Who posts messages on r-devel?",
subtitle = "2017 is not quite a full year yet")
plot of chunk posts_over_time
Did Brian Ripley really post almost 1000 messages to r-devel in 2001 and 2002? As far as I can tell from spot-checking the data, yes.
Who has written the most words since 2004? Before this, I can’t easily parse which words in an email are written by the author and which are part of an email being responded to.
(This accounting of words does include code, links, email signatures, etc and not just written English words.)
library(tidytext)
line_starts <- regex("^From: |^Date: |^Subject: |^=-=-=-=-=-=-=-=-=-=-=-=-|^r-devel mailing list |^Send \"info\", \"help\",|^\\(in the \"body\", not the|^-.-.-.-.-.-.-.-.-.-.-.-.-.|^_._._._._._._._._._._._._._._._._._._._._|^In-Reply-To: |^On ")
tidy_r_devel <- r_devel_parsed %>%
filter(date >= "2004-01-01") %>%
filter(!str_detect(text, line_starts)) %>%
unnest_tokens(word, text)
tidy_r_devel %>%
count(author, sort = TRUE) %>%
top_n(10) %>%
mutate(n = comma(round(n, -2))) %>%
kable(col.names = c("Author", "Total number of words since 1997"))
| Author | Total number of words since 1997 |
|---|---|
| Brian Ripley | 459,700 |
| Duncan Murdoch | 214,700 |
| Martin Maechler | 173,400 |
| Peter Dalgaard | 146,400 |
| Simon Urbanek | 129,000 |
| Henrik Bengtsson | 107,900 |
| Dirk Eddelbuettel | 106,200 |
| Gabor Grothendieck | 99,400 |
| Ben Bolker | 56,100 |
| Martin Morgan | 56,000 |
The more loquacious.
tidy_r_devel %>%
mutate(year = year(date)) %>%
filter(author %in% top_12_authors) %>%
count(year, author, message_id) %>%
group_by(year, author) %>%
summarise(words = median(n)) %>%
ggplot(aes(year, words, color = author)) +
geom_line(size = 1.5, alpha = 0.8) +
geom_smooth(method = "lm", lty = 2, size = 0.9, se = FALSE) +
expand_limits(y = 0) +
theme(legend.position="none") +
facet_wrap(~author) +
labs(x = NULL, y = "Median words per message",
title = "How long are messages on r-devel?",
subtitle = "Most users don't exhibit dramatic changes, even over more than a decade!")
plot of chunk words_over_time
Has Luke Tierney ever specifically told anyone not to do something before?
r_devel_parsed %>%
mutate(text = str_to_lower(text)) %>%
filter(author == "Luke Tierney",
str_detect(text, "do not|should not")) %>%
distinct(date, text) %>%
kable()
| date | text |
|---|---|
| 2017-11-01 | do not do this!!! setlength and set_truelength are not part of the api |
| 2017-09-01 | framework). merging the remainder of the framework should not |
| 2016-11-01 | making assumptions that it should not. |
| 2016-09-01 | should not assume anything about the order in which finalizers are run. |
| 2015-03-01 | implementation. i do not think your proposed approach would do that. |
| 2015-01-01 | r’s semantics do not permit this sort of optimization in general – |
| 2015-01-01 | costs that should not need to be paid. but maybe we can leave that to |
| 2014-08-01 | there should not be – the language manual i believe had language that |
| 2014-08-01 | suggests that this is an implementation detail that should not be |
| 2011-07-01 | able to define replacement functions that do not duplicate in cases |
| 2011-03-01 | i do not. |
| 2011-03-01 | i do not know |
| 2011-01-01 | there are lots of functions in the internals, api or not, that do not |
| 2008-10-01 | have a class. you can argue that this should not be so but it is so, |
| 2008-10-01 | accessors should not retrun r_missingarg either and will be fixed in |
| 2008-09-01 | we definitely do not want this frozen into the public api. |
| 2007-09-01 | environment should not return a list with promises in as promises |
| 2007-09-01 | should not be visible at the r level. (another loophole that needs |
| 2007-09-01 | closing is $ for environments). so behavior of results that should not |
| 2007-09-01 | # tempted to remove it for efficiency but do not: it is needed |
| 2007-09-01 | that should not usually cause much total growth. if you are ooking at |
| 2006-10-01 | any changes do not affect performance will be tricky. |
| 2006-05-01 | it should not be too hard to write a delayedassignmentreset function |
| 2005-10-01 | with the possible ramifications. in any case but i also do not |
| 2005-09-01 | enough to schedule in the near future. so for now you should not use |
| 2004-04-01 | second behavior. but we do not have it. set_slot is a macro that |
| 2002-09-01 | resources that we do not have convenient means of releasing explicitly |
| 2002-05-01 | the dispatch change should not be noticed at all but is needed to |
| 2002-03-01 | we do not yet have a syntactic mechanism like dbi::connect for |
| 2002-02-01 | signals and has a default sigint handler. if mkl/atlas do not provide |
| 2001-11-01 | system calls after signals, others do not. for those that don’t, low |
| 1998-06-01 | functional language. (another reason is that most s users do not come |
| 1998-06-01 | a feature. i do not think it is the best design choice. but it is the |
| 1997-09-01 | thinking they have independent runs when they probably do not in any |
| 1997-05-01 | why does r have the graphical parameters “xlog” and “ylog”? they do not |
| 1997-04-01 | a module (mainly for debugging – maybe this should not |
Nope! No commands to not do things before.