Calculating the proportion of code classified in an R file

Introduction

tidycode can be used to easily classify the lines of code in an R file (e.g., as data cleaning, setup, etc.).

This vignette shows how tidycode can easily be used to calculate the proportion of a total R file classified to different categories?

Loading and setting up

We will frist load the tidyverse and tidycode packages, and then use the tidycode function read_rfiles() to read two example files:

library(tidyverse)
library(tidycode)

two_rfiles <- read_rfiles(
  tidycode_example("example_plot.R"),
  tidycode_example("example_analysis.R")
)

Classify the lines of code in the R files

Next, we can classify the lines of code in the two rfiles saved to the object two_rfiles:

unnested_expressions <- unnest_calls(two_rfiles, expr)

classified_code <- unnested_expressions %>%
  dplyr::inner_join(
    get_classifications("crowdsource", include_duplicates = FALSE)
  ) %>%
  dplyr::anti_join(get_stopfuncs()) %>%
  dplyr::select(file, func, classification)
#> Joining, by = "func"Joining, by = "func"

Creating a function

Then, we will create a simple function that takes the classified code and calculates the proportion of the lines of code in each file that is classified into different categories:

calc_proportion_file <- function(d) {
  d %>% 
    count(file, classification) %>% 
    group_by(file) %>% 
    mutate(prop = n / sum(n))
}

Using the function

It is easy to use the function on our classified code:

proportion_of_file <- calc_proportion_file(classified_code)

proportion_of_file
#> # A tibble: 7 x 4
#> # Groups:   file [2]
#>   file                                                classification     n  prop
#>   <chr>                                               <chr>          <int> <dbl>
#> 1 /Library/Frameworks/R.framework/Versions/3.6/Resou… data cleaning      2 0.286
#> 2 /Library/Frameworks/R.framework/Versions/3.6/Resou… exploratory        1 0.143
#> 3 /Library/Frameworks/R.framework/Versions/3.6/Resou… setup              3 0.429
#> 4 /Library/Frameworks/R.framework/Versions/3.6/Resou… visualization      1 0.143
#> 5 /Library/Frameworks/R.framework/Versions/3.6/Resou… data cleaning      4 0.5  
#> 6 /Library/Frameworks/R.framework/Versions/3.6/Resou… setup              1 0.125
#> 7 /Library/Frameworks/R.framework/Versions/3.6/Resou… visualization      3 0.375

Visualizing classified code on a per-file basis

We can also easily visualize the results:

proportion_of_file %>%
  ggplot()+
  coord_flip()+
  geom_bar(aes(x=0, y=prop, fill=reorder(classification, prop)), stat="identity", size=1)+
  scale_y_continuous(labels=scales::percent_format())+
  facet_wrap(~file, ncol=1)+
  labs(
    title = "Proportion of Code by File",
    y = "Proportion of Code",
    fill = "Classification"
  )+
  theme(
    axis.text.y = element_blank(),
    axis.ticks.y = element_blank(),
    axis.title.y = element_blank(),
    strip.text = element_text(hjust=0),
    panel.background = element_blank(),
    strip.background = element_blank(),
    panel.grid.major.x = element_line(color="grey80")
  )

This can become quite a large viz if there are very many files; thus, this approach may be more useful when trying to visualize code on on a per-file basis for a relatively small (perhaps 10-15 or fewer) files.

Next, we show how this approach may be scaled up to a larger number of files by visualizing the proportion of code classified across many files.

Visualizing classified code across files

First, we’ll create a function that is an analog to calc_proportion_file(), but for calculating the mean proportion across many files:

calc_proportion_overall <- function(d) {
  d %>% group_by(classification) %>%
    count() %>% 
    ungroup() %>% 
    mutate(
      prop = prop.table(n)
    )
}

We can use this in the same way as calc_proportion_file(), passing classified_code as the sole argument:

proportion_overall <- calc_proportion_overall(classified_code)
proportion_overall
#> # A tibble: 4 x 3
#>   classification     n   prop
#>   <chr>          <int>  <dbl>
#> 1 data cleaning      6 0.4   
#> 2 exploratory        1 0.0667
#> 3 setup              4 0.267 
#> 4 visualization      4 0.267

These results can be visualized as follows:

proportion_overall %>%
  ggplot()+
  coord_flip()+
  geom_bar(aes(x=reorder(classification, prop), y=1), stat="identity", fill="grey80")+
  geom_bar(aes(x=reorder(classification, prop), y=prop, fill=prop), stat="identity")+
  geom_text(aes(x=reorder(classification, prop), y=prop, label=paste0(round(prop*100, digits=0), "%"), hjust=-.5))+
  scale_y_continuous(labels=scales::percent_format())+
  labs(
    title = "Overall Proportion of Code",
    y = "Proportion of Code",
    x = "Classification"
  )+
  theme(
    panel.background = element_blank(),
    legend.position = "none"
  )