Motivation

I’m curious to know which R packages and functions I use the most. To answer that question, I wrote some R code (Meta, I know…) to analyze every R file I’ve ever written. Let me show you how.

Finding all R Files

First, I have to find every R file on my machine, which I will do using the list.files function. After finding every R file, I notice than many file names don’t correspond to functions that I have written. So, I prune this list by excluding everything from the Program Files, Win-library, and OneDrive (these are backups) directorys. Finally, I use theread_file function to import, as a single string, the text for each R file.

#Get all path names for R files on my machine
list.files.r <- list.files("C:/", pattern = "\\.R$", recursive = TRUE, full.names = TRUE)

#Prune the list
list.files.r.true <- 
  list.files.r %>%
    as_tibble() %>%
    filter(str_detect(value, "Program Files") != TRUE) %>% #These are not my code
    filter(str_detect(value, "win-library") != TRUE) %>% #These are not my code
    filter(str_detect(value, "OneDrive") != TRUE)  #These are duplicates stored in my one drive

#Import text for each R file
list.text.r.files <-
  list.files.r.true$value %>% 
    map(read_file)

Here are some of the file names in my R SandBox:

Get R Packages

Now that we have every R file loaded into a list, we can find instances where I loaded packages in each of them. Taking advantage of the fact that I load each library on a new line, I can a list of all packages that I use, ordered by frequency of use.

#Write a function to get the packages I load in each R file
get.r.packages <- function(x) {
  x %>%
    str_split("\r\n") %>% #Split by line
    unlist() %>% 
    str_extract("library.+") %>% #Find all calls to load a library
    str_replace("library\\(", "") %>%
    str_replace("\\)","") %>%
    str_replace("\\#.+","") %>%
    trimws() %>%
    table() %>%
    as.tibble() %>%
    arrange(desc(n)) %>%
    set_colnames(c("package","count"))  
}


list.packages <-
  list.text.r.files %>%
    map(safely(get.r.packages))
    
  
df.total.packages <-
  list.packages %>%
    transpose() %>%
    .$result %>% 
    bind_rows() %>%
    group_by(package) %>%
    summarize(total = sum(count)) %>%
    arrange(desc(total)) 

Here are my top ten most used packages as a proportion of the number of R files that I use them in:

package proportion
tidyverse 44.8
readxl 34.0
stringr 31.5
magrittr 29.6
plyr 22.2
WindR 19.2
readr 18.7
rvest 17.7
TTR 13.8
xts 10.8

Hadley is the best.

Get R functions

Taking inspiration from Matt Dancho’s blog post How to Learn R, Part 1: Learn from a Master Data Scientist’s Code, I also want to identify every function that I have ever used. To do so, we exploit the pattern that every function follows this pattern: space + function_name + (. This is the wrapper I wrote that identifies each function used in my R code and counts the uses.

get.function.names <- function(x) {
  x %>%
    str_split("\\(") %>% #split text by all left-hand parenthesis - this signals the end of a function call
    set_names("text") %>%
    as.tibble() %>%
    mutate(tibble.by.line = map(text, str_split, "\r\n")) %>% #Split by new line in orginal text
    select(-text) %>%
    unnest() %>% 
    mutate(last.object   = map_chr(tibble.by.line, ~ pluck(last(.x))), # get the final object in each line
           last.object.2 = map(last.object, str_split, " ")) %>% #split by space to only get the function call
    select(last.object.2) %>% 
    unnest() %>%
    mutate(name.function = map(last.object.2, ~ pluck(last(.x)))) %>% #select last object; this is the function call
    select(name.function) %>%
    unnest() %>%
    #Tidy the information
    group_by(name.function) %>%
    count(name.function) %>%
    arrange(desc(n)) %>%
    filter(name.function != "")
}

Here are my top 20 functions used, expressed as a proportion of all functions that I use:

Function Name Usage Proportion
c 5.8
library 4.8
mutate 3.1
filter 2.7
function 2.5
mtext 2.2
select 1.9
group_by 1.7
arrange 1.6
aes 1.4
ifelse 1.4
subset 1.4
names 1.4
plot 1.3
unlist 1.2
lead 1.2
ggplot 1.2
sum 1.2
desc 1.1
summarize 0.9

Changing Preferences

I read Hadley Wickham’s R for Data Science in 2017. So, I was interested to see how my coding style had changed since then. By modifying the above analysis with the file.info function to get the modified data for every R file on my Machine, I was able to show how my R code had changed from 2016, when I got the machine, through today.

df.file.path.and.text <- 
  list.files.r.true %>%
    mutate(file.info = map(value,file.info)) %>% #use file.info function to get the date of modification for every R file on my machine
    unnest() %>%
    select(value, mtime) %>%
    mutate(file.text = map(value, read_file)) #Use the read_file function to get the text for each R file

df.function.count.by.year <- 
    df.file.path.and.text %>% 
      unnest() %>%
      mutate(df.functions = map(file.text, get.function.names)) %>% #Use my get.function.names function to extract the use count of each function in the R files
      #Tidy
      mutate(file.year = year(mtime)) %>%
      select(df.functions, file.year) %>% 
      unnest() %>%
      group_by(file.year, name.function) %>%
      summarize(count.uses = sum(n, na.rm = TRUE)) %>%
      arrange(desc(file.year), desc(count.uses)) %>%
      filter(name.function != "")

Clearly, functions from the Tidyverse have taken over my coding vocabulary since 2016. The most obvious examples are the dissapearance of the base plotting functions like plot, par, and mtext. At the same time, from 2016 to 2018 there is a clear rise in usage for the Tidyverse verbs like mutate, filter, and select.

Conclusion

Here is the code you can use to analyze your most frequent usage of R packages and functions. Happy instropecting!