tidycode can be used to easily classify the lines of code in an R file (e.g., as data cleaning, setup, etc.).
This vignette shows how tidycode can easily be used to calculate the proportion of a total R file classified to different categories?
We will frist load the tidyverse and tidycode packages, and then use the tidycode function read_rfiles()
to read two example files:
Next, we can classify the lines of code in the two rfiles saved to the object two_rfiles
:
unnested_expressions <- unnest_calls(two_rfiles, expr)
classified_code <- unnested_expressions %>%
dplyr::inner_join(
get_classifications("crowdsource", include_duplicates = FALSE)
) %>%
dplyr::anti_join(get_stopfuncs()) %>%
dplyr::select(file, func, classification)
#> Joining, by = "func"Joining, by = "func"
Then, we will create a simple function that takes the classified code and calculates the proportion of the lines of code in each file that is classified into different categories:
It is easy to use the function on our classified code:
proportion_of_file <- calc_proportion_file(classified_code)
proportion_of_file
#> # A tibble: 7 x 4
#> # Groups: file [2]
#> file classification n prop
#> <chr> <chr> <int> <dbl>
#> 1 /Library/Frameworks/R.framework/Versions/3.6/Resou… data cleaning 2 0.286
#> 2 /Library/Frameworks/R.framework/Versions/3.6/Resou… exploratory 1 0.143
#> 3 /Library/Frameworks/R.framework/Versions/3.6/Resou… setup 3 0.429
#> 4 /Library/Frameworks/R.framework/Versions/3.6/Resou… visualization 1 0.143
#> 5 /Library/Frameworks/R.framework/Versions/3.6/Resou… data cleaning 4 0.5
#> 6 /Library/Frameworks/R.framework/Versions/3.6/Resou… setup 1 0.125
#> 7 /Library/Frameworks/R.framework/Versions/3.6/Resou… visualization 3 0.375
We can also easily visualize the results:
proportion_of_file %>%
ggplot()+
coord_flip()+
geom_bar(aes(x=0, y=prop, fill=reorder(classification, prop)), stat="identity", size=1)+
scale_y_continuous(labels=scales::percent_format())+
facet_wrap(~file, ncol=1)+
labs(
title = "Proportion of Code by File",
y = "Proportion of Code",
fill = "Classification"
)+
theme(
axis.text.y = element_blank(),
axis.ticks.y = element_blank(),
axis.title.y = element_blank(),
strip.text = element_text(hjust=0),
panel.background = element_blank(),
strip.background = element_blank(),
panel.grid.major.x = element_line(color="grey80")
)
This can become quite a large viz if there are very many files; thus, this approach may be more useful when trying to visualize code on on a per-file basis for a relatively small (perhaps 10-15 or fewer) files.
Next, we show how this approach may be scaled up to a larger number of files by visualizing the proportion of code classified across many files.
First, we’ll create a function that is an analog to calc_proportion_file()
, but for calculating the mean proportion across many files:
calc_proportion_overall <- function(d) {
d %>% group_by(classification) %>%
count() %>%
ungroup() %>%
mutate(
prop = prop.table(n)
)
}
We can use this in the same way as calc_proportion_file()
, passing classified_code
as the sole argument:
proportion_overall <- calc_proportion_overall(classified_code)
proportion_overall
#> # A tibble: 4 x 3
#> classification n prop
#> <chr> <int> <dbl>
#> 1 data cleaning 6 0.4
#> 2 exploratory 1 0.0667
#> 3 setup 4 0.267
#> 4 visualization 4 0.267
These results can be visualized as follows:
proportion_overall %>%
ggplot()+
coord_flip()+
geom_bar(aes(x=reorder(classification, prop), y=1), stat="identity", fill="grey80")+
geom_bar(aes(x=reorder(classification, prop), y=prop, fill=prop), stat="identity")+
geom_text(aes(x=reorder(classification, prop), y=prop, label=paste0(round(prop*100, digits=0), "%"), hjust=-.5))+
scale_y_continuous(labels=scales::percent_format())+
labs(
title = "Overall Proportion of Code",
y = "Proportion of Code",
x = "Classification"
)+
theme(
panel.background = element_blank(),
legend.position = "none"
)