I’m curious to know which R packages and functions I use the most. To answer that question, I wrote some R code (Meta, I know…) to analyze every R file I’ve ever written. Let me show you how.
First, I have to find every R file on my machine, which I will do using the list.files function. After finding every R file, I notice than many file names don’t correspond to functions that I have written. So, I prune this list by excluding everything from the Program Files, Win-library, and OneDrive (these are backups) directorys. Finally, I use theread_file function to import, as a single string, the text for each R file.
#Get all path names for R files on my machine
list.files.r <- list.files("C:/", pattern = "\\.R$", recursive = TRUE, full.names = TRUE)
#Prune the list
list.files.r.true <-
list.files.r %>%
as_tibble() %>%
filter(str_detect(value, "Program Files") != TRUE) %>% #These are not my code
filter(str_detect(value, "win-library") != TRUE) %>% #These are not my code
filter(str_detect(value, "OneDrive") != TRUE) #These are duplicates stored in my one drive
#Import text for each R file
list.text.r.files <-
list.files.r.true$value %>%
map(read_file)
Here are some of the file names in my R SandBox:
Now that we have every R file loaded into a list, we can find instances where I loaded packages in each of them. Taking advantage of the fact that I load each library on a new line, I can a list of all packages that I use, ordered by frequency of use.
#Write a function to get the packages I load in each R file
get.r.packages <- function(x) {
x %>%
str_split("\r\n") %>% #Split by line
unlist() %>%
str_extract("library.+") %>% #Find all calls to load a library
str_replace("library\\(", "") %>%
str_replace("\\)","") %>%
str_replace("\\#.+","") %>%
trimws() %>%
table() %>%
as.tibble() %>%
arrange(desc(n)) %>%
set_colnames(c("package","count"))
}
list.packages <-
list.text.r.files %>%
map(safely(get.r.packages))
df.total.packages <-
list.packages %>%
transpose() %>%
.$result %>%
bind_rows() %>%
group_by(package) %>%
summarize(total = sum(count)) %>%
arrange(desc(total))
Here are my top ten most used packages as a proportion of the number of R files that I use them in:
| package | proportion |
|---|---|
| tidyverse | 44.8 |
| readxl | 34.0 |
| stringr | 31.5 |
| magrittr | 29.6 |
| plyr | 22.2 |
| WindR | 19.2 |
| readr | 18.7 |
| rvest | 17.7 |
| TTR | 13.8 |
| xts | 10.8 |
Hadley is the best.
Taking inspiration from Matt Dancho’s blog post How to Learn R, Part 1: Learn from a Master Data Scientist’s Code, I also want to identify every function that I have ever used. To do so, we exploit the pattern that every function follows this pattern: space + function_name + (. This is the wrapper I wrote that identifies each function used in my R code and counts the uses.
get.function.names <- function(x) {
x %>%
str_split("\\(") %>% #split text by all left-hand parenthesis - this signals the end of a function call
set_names("text") %>%
as.tibble() %>%
mutate(tibble.by.line = map(text, str_split, "\r\n")) %>% #Split by new line in orginal text
select(-text) %>%
unnest() %>%
mutate(last.object = map_chr(tibble.by.line, ~ pluck(last(.x))), # get the final object in each line
last.object.2 = map(last.object, str_split, " ")) %>% #split by space to only get the function call
select(last.object.2) %>%
unnest() %>%
mutate(name.function = map(last.object.2, ~ pluck(last(.x)))) %>% #select last object; this is the function call
select(name.function) %>%
unnest() %>%
#Tidy the information
group_by(name.function) %>%
count(name.function) %>%
arrange(desc(n)) %>%
filter(name.function != "")
}
Here are my top 20 functions used, expressed as a proportion of all functions that I use:
| Function Name | Usage Proportion |
|---|---|
| c | 5.8 |
| library | 4.8 |
| mutate | 3.1 |
| filter | 2.7 |
| function | 2.5 |
| mtext | 2.2 |
| select | 1.9 |
| group_by | 1.7 |
| arrange | 1.6 |
| aes | 1.4 |
| ifelse | 1.4 |
| subset | 1.4 |
| names | 1.4 |
| plot | 1.3 |
| unlist | 1.2 |
| lead | 1.2 |
| ggplot | 1.2 |
| sum | 1.2 |
| desc | 1.1 |
| summarize | 0.9 |
I read Hadley Wickham’s R for Data Science in 2017. So, I was interested to see how my coding style had changed since then. By modifying the above analysis with the file.info function to get the modified data for every R file on my Machine, I was able to show how my R code had changed from 2016, when I got the machine, through today.
df.file.path.and.text <-
list.files.r.true %>%
mutate(file.info = map(value,file.info)) %>% #use file.info function to get the date of modification for every R file on my machine
unnest() %>%
select(value, mtime) %>%
mutate(file.text = map(value, read_file)) #Use the read_file function to get the text for each R file
df.function.count.by.year <-
df.file.path.and.text %>%
unnest() %>%
mutate(df.functions = map(file.text, get.function.names)) %>% #Use my get.function.names function to extract the use count of each function in the R files
#Tidy
mutate(file.year = year(mtime)) %>%
select(df.functions, file.year) %>%
unnest() %>%
group_by(file.year, name.function) %>%
summarize(count.uses = sum(n, na.rm = TRUE)) %>%
arrange(desc(file.year), desc(count.uses)) %>%
filter(name.function != "")
Clearly, functions from the Tidyverse have taken over my coding vocabulary since 2016. The most obvious examples are the dissapearance of the base plotting functions like plot, par, and mtext. At the same time, from 2016 to 2018 there is a clear rise in usage for the Tidyverse verbs like mutate, filter, and select.
Here is the code you can use to analyze your most frequent usage of R packages and functions. Happy instropecting!