I’m curious to know which R packages and functions I use the most. To answer that question, I wrote some R code (Meta, I know…) to analyze every R file I’ve ever written. Let me show you how.
First, I have to find every R file on my machine, which I will do using the list.files
function. After finding every R file, I notice than many file names don’t correspond to functions that I have written. So, I prune this list by excluding everything from the Program Files, Win-library, and OneDrive (these are backups) directorys. Finally, I use theread_file
function to import, as a single string, the text for each R file.
#Get all path names for R files on my machine
list.files.r <- list.files("C:/", pattern = "\\.R$", recursive = TRUE, full.names = TRUE)
#Prune the list
list.files.r.true <-
list.files.r %>%
as_tibble() %>%
filter(str_detect(value, "Program Files") != TRUE) %>% #These are not my code
filter(str_detect(value, "win-library") != TRUE) %>% #These are not my code
filter(str_detect(value, "OneDrive") != TRUE) #These are duplicates stored in my one drive
#Import text for each R file
list.text.r.files <-
list.files.r.true$value %>%
map(read_file)
Here are some of the file names in my R SandBox:
Now that we have every R file loaded into a list, we can find instances where I loaded packages in each of them. Taking advantage of the fact that I load each library on a new line, I can a list of all packages that I use, ordered by frequency of use.
#Write a function to get the packages I load in each R file
get.r.packages <- function(x) {
x %>%
str_split("\r\n") %>% #Split by line
unlist() %>%
str_extract("library.+") %>% #Find all calls to load a library
str_replace("library\\(", "") %>%
str_replace("\\)","") %>%
str_replace("\\#.+","") %>%
trimws() %>%
table() %>%
as.tibble() %>%
arrange(desc(n)) %>%
set_colnames(c("package","count"))
}
list.packages <-
list.text.r.files %>%
map(safely(get.r.packages))
df.total.packages <-
list.packages %>%
transpose() %>%
.$result %>%
bind_rows() %>%
group_by(package) %>%
summarize(total = sum(count)) %>%
arrange(desc(total))
Here are my top ten most used packages as a proportion of the number of R files that I use them in:
package | proportion |
---|---|
tidyverse | 44.8 |
readxl | 34.0 |
stringr | 31.5 |
magrittr | 29.6 |
plyr | 22.2 |
WindR | 19.2 |
readr | 18.7 |
rvest | 17.7 |
TTR | 13.8 |
xts | 10.8 |
Hadley is the best.
Taking inspiration from Matt Dancho’s blog post How to Learn R, Part 1: Learn from a Master Data Scientist’s Code, I also want to identify every function that I have ever used. To do so, we exploit the pattern that every function follows this pattern: space + function_name + (
. This is the wrapper I wrote that identifies each function used in my R code and counts the uses.
get.function.names <- function(x) {
x %>%
str_split("\\(") %>% #split text by all left-hand parenthesis - this signals the end of a function call
set_names("text") %>%
as.tibble() %>%
mutate(tibble.by.line = map(text, str_split, "\r\n")) %>% #Split by new line in orginal text
select(-text) %>%
unnest() %>%
mutate(last.object = map_chr(tibble.by.line, ~ pluck(last(.x))), # get the final object in each line
last.object.2 = map(last.object, str_split, " ")) %>% #split by space to only get the function call
select(last.object.2) %>%
unnest() %>%
mutate(name.function = map(last.object.2, ~ pluck(last(.x)))) %>% #select last object; this is the function call
select(name.function) %>%
unnest() %>%
#Tidy the information
group_by(name.function) %>%
count(name.function) %>%
arrange(desc(n)) %>%
filter(name.function != "")
}
Here are my top 20 functions used, expressed as a proportion of all functions that I use:
Function Name | Usage Proportion |
---|---|
c | 5.8 |
library | 4.8 |
mutate | 3.1 |
filter | 2.7 |
function | 2.5 |
mtext | 2.2 |
select | 1.9 |
group_by | 1.7 |
arrange | 1.6 |
aes | 1.4 |
ifelse | 1.4 |
subset | 1.4 |
names | 1.4 |
plot | 1.3 |
unlist | 1.2 |
lead | 1.2 |
ggplot | 1.2 |
sum | 1.2 |
desc | 1.1 |
summarize | 0.9 |
I read Hadley Wickham’s R for Data Science in 2017. So, I was interested to see how my coding style had changed since then. By modifying the above analysis with the file.info
function to get the modified data for every R file on my Machine, I was able to show how my R code had changed from 2016, when I got the machine, through today.
df.file.path.and.text <-
list.files.r.true %>%
mutate(file.info = map(value,file.info)) %>% #use file.info function to get the date of modification for every R file on my machine
unnest() %>%
select(value, mtime) %>%
mutate(file.text = map(value, read_file)) #Use the read_file function to get the text for each R file
df.function.count.by.year <-
df.file.path.and.text %>%
unnest() %>%
mutate(df.functions = map(file.text, get.function.names)) %>% #Use my get.function.names function to extract the use count of each function in the R files
#Tidy
mutate(file.year = year(mtime)) %>%
select(df.functions, file.year) %>%
unnest() %>%
group_by(file.year, name.function) %>%
summarize(count.uses = sum(n, na.rm = TRUE)) %>%
arrange(desc(file.year), desc(count.uses)) %>%
filter(name.function != "")
Clearly, functions from the Tidyverse have taken over my coding vocabulary since 2016. The most obvious examples are the dissapearance of the base plotting functions like plot
, par
, and mtext
. At the same time, from 2016 to 2018 there is a clear rise in usage for the Tidyverse verbs like mutate
, filter
, and select
.
Here is the code you can use to analyze your most frequent usage of R packages and functions. Happy instropecting!