Dirk on SO chat:

We need someone to write a script to parse r-devel archives to find how often Luke screamed DO NOT DO THIS!!!. Is today a first?

Archives of the r-devel mailing list are availabe online.

Shall we scrape?

robotstxt::paths_allowed("https://stat.ethz.ch/pipermail/r-devel/")
## [1] TRUE

Let’s then.

library(tidyverse)
library(rvest)

r_devel_links <- read_html("https://stat.ethz.ch/pipermail/r-devel/") %>%
    html_node("table") %>%
    html_nodes("td ~ td + td a") %>%
    html_attr("href")


r_devel_text <- data_frame(date = r_devel_links) %>%
    mutate(link = paste0("https://stat.ethz.ch/pipermail/r-devel/", date),
           date = str_extract(date, "[^\\.]*")) %>%
    mutate(text = map(link, read_lines)) %>%
    unnest(text)

Let’s clean up the text, removing quoted text from messages being replied to, and identifying who sent each message and where messages start.

library(lubridate)
r_devel_parsed <- r_devel_text %>%
    filter(!str_detect(text, "<|>|^\\|"),   ## strip out text from messages being replied to and headers
           !str_detect(text, "^From ")) %>%
    mutate(date = ymd(paste0(date, "-01")),
           author = ifelse(str_detect(text, "^From:"), text, NA),
           author = str_extract(author, "\\(.+\\)"),
           message_id = ifelse(str_detect(author, "[:alpha:]"), row_number(), NA),
           author = str_replace_all(author, "\\(|\\)", "")) %>%
    fill(author, message_id, .direction = "down") %>%
    mutate(author = case_when(str_detect(str_to_lower(author), "dalgaard") ~ "Peter Dalgaard",
                              str_detect(str_to_lower(author), "ripley") ~ "Brian Ripley",
                              str_detect(str_to_lower(author), "maechler") ~ "Martin Maechler",
                              str_detect(str_to_lower(author), "ligges") ~ "Uwe Ligges",
                              str_detect(str_to_lower(author), "wickham") ~ "Hadley Wickham",
                              str_detect(str_to_lower(author), "tierney") ~ "Luke Tierney",
                              TRUE ~ author)) %>%
    filter(author != "en148") ## I think these are some kind of auto generated messages from a decade ago?

Who has sent the most messages?

top_12 <- r_devel_parsed %>%
    distinct(author, message_id) %>%
    count(author, sort = TRUE) %>%
    top_n(12) 

top_12_authors <- top_12 %>% pull(author)

top_12 %>%
    kable(col.names = c("Author", "Total number of messages since 1997"))
Author Total number of messages since 1997
Brian Ripley 7351
Peter Dalgaard 5029
Duncan Murdoch 3242
Martin Maechler 2158
Kurt Hornik 1780
Paul Gilbert 1573
Simon Urbanek 1310
Thomas Lumley 1309
Uwe Ligges 1290
Gabor Grothendieck 1195
Dirk Eddelbuettel 951
Henrik Bengtsson 800

Who is sending more or fewer messages over time?

r_devel_parsed %>%
    mutate(year = year(date)) %>%
    distinct(year, author, message_id) %>%
    count(year, author) %>%
    filter(author %in% top_12_authors) %>%
    ggplot(aes(year, n, color = author)) +
    geom_line(size = 1.5, alpha = 0.8) +
    expand_limits(y = 0) +
    theme(legend.position="none") +
    facet_wrap(~author) +
    labs(x = NULL, y = "# of messages per year",
         title = "Who posts messages on r-devel?",
         subtitle = "2017 is not quite a full year yet")
plot of chunk posts_over_time

plot of chunk posts_over_time

Did Brian Ripley really post almost 1000 messages to r-devel in 2001 and 2002? As far as I can tell from spot-checking the data, yes.

Who has written the most words since 2004? Before this, I can’t easily parse which words in an email are written by the author and which are part of an email being responded to.

(This accounting of words does include code, links, email signatures, etc and not just written English words.)

library(tidytext)

line_starts <- regex("^From: |^Date: |^Subject: |^=-=-=-=-=-=-=-=-=-=-=-=-|^r-devel mailing list |^Send \"info\", \"help\",|^\\(in the \"body\", not the|^-.-.-.-.-.-.-.-.-.-.-.-.-.|^_._._._._._._._._._._._._._._._._._._._._|^In-Reply-To: |^On ")

tidy_r_devel <- r_devel_parsed %>%
    filter(date >= "2004-01-01") %>%
    filter(!str_detect(text, line_starts)) %>%
    unnest_tokens(word, text)

tidy_r_devel %>%
    count(author, sort = TRUE) %>%
    top_n(10) %>%
    mutate(n = comma(round(n, -2))) %>%
    kable(col.names = c("Author", "Total number of words since 1997"))
Author Total number of words since 1997
Brian Ripley 459,700
Duncan Murdoch 214,700
Martin Maechler 173,400
Peter Dalgaard 146,400
Simon Urbanek 129,000
Henrik Bengtsson 107,900
Dirk Eddelbuettel 106,200
Gabor Grothendieck 99,400
Ben Bolker 56,100
Martin Morgan 56,000

The more loquacious.

tidy_r_devel %>%
    mutate(year = year(date)) %>%
    filter(author %in% top_12_authors) %>%    
    count(year, author, message_id) %>%
    group_by(year, author) %>%
    summarise(words = median(n)) %>%
    ggplot(aes(year, words, color = author)) +
    geom_line(size = 1.5, alpha = 0.8) +
    geom_smooth(method = "lm", lty = 2, size = 0.9, se = FALSE) +
    expand_limits(y = 0) +
    theme(legend.position="none") +
    facet_wrap(~author) +
    labs(x = NULL, y = "Median words per message",
         title = "How long are messages on r-devel?",
         subtitle = "Most users don't exhibit dramatic changes, even over more than a decade!")
plot of chunk words_over_time

plot of chunk words_over_time

Has Luke Tierney ever specifically told anyone not to do something before?

r_devel_parsed %>%
    mutate(text = str_to_lower(text)) %>%
    filter(author == "Luke Tierney",
           str_detect(text, "do not|should not")) %>%
    distinct(date, text) %>%
    kable()
date text
2017-11-01 do not do this!!! setlength and set_truelength are not part of the api
2017-09-01 framework). merging the remainder of the framework should not
2016-11-01 making assumptions that it should not.
2016-09-01 should not assume anything about the order in which finalizers are run.
2015-03-01 implementation. i do not think your proposed approach would do that.
2015-01-01 r’s semantics do not permit this sort of optimization in general –
2015-01-01 costs that should not need to be paid. but maybe we can leave that to
2014-08-01 there should not be – the language manual i believe had language that
2014-08-01 suggests that this is an implementation detail that should not be
2011-07-01 able to define replacement functions that do not duplicate in cases
2011-03-01 i do not.
2011-03-01 i do not know
2011-01-01 there are lots of functions in the internals, api or not, that do not
2008-10-01 have a class. you can argue that this should not be so but it is so,
2008-10-01 accessors should not retrun r_missingarg either and will be fixed in
2008-09-01 we definitely do not want this frozen into the public api.
2007-09-01 environment should not return a list with promises in as promises
2007-09-01 should not be visible at the r level. (another loophole that needs
2007-09-01 closing is $ for environments). so behavior of results that should not
2007-09-01 # tempted to remove it for efficiency but do not: it is needed
2007-09-01 that should not usually cause much total growth. if you are ooking at
2006-10-01 any changes do not affect performance will be tricky.
2006-05-01 it should not be too hard to write a delayedassignmentreset function
2005-10-01 with the possible ramifications. in any case but i also do not
2005-09-01 enough to schedule in the near future. so for now you should not use
2004-04-01 second behavior. but we do not have it. set_slot is a macro that
2002-09-01 resources that we do not have convenient means of releasing explicitly
2002-05-01 the dispatch change should not be noticed at all but is needed to
2002-03-01 we do not yet have a syntactic mechanism like dbi::connect for
2002-02-01 signals and has a default sigint handler. if mkl/atlas do not provide
2001-11-01 system calls after signals, others do not. for those that don’t, low
1998-06-01 functional language. (another reason is that most s users do not come
1998-06-01 a feature. i do not think it is the best design choice. but it is the
1997-09-01 thinking they have independent runs when they probably do not in any
1997-05-01 why does r have the graphical parameters “xlog” and “ylog”? they do not
1997-04-01 a module (mainly for debugging – maybe this should not

Nope! No commands to not do things before.