Clean(ish) Code: dupree

Background and Links:

Today’s packages

Me

Preamble

# Dependencies:
# - lintr, dplyr, purrr, tibble, magrittr, methods, stringdist
if (!"dupree" %in% installed.packages()) {
  require("devtools")
  devtools::install_github(
    repo = "russHyde/dupree", dependencies = FALSE
  )
}

suppressPackageStartupMessages({
  library(lintr)
  library(dupree)
  library(git2r)
})

Code Smells & Architectural Ideals

"The most common design problems result from code that

Is duplicated
Is unclear
Is complicated"

Quote: Kerievsky ‘Refactoring to Patterns’

See also Fowler ‘Refactoring’, Martin ‘Clean Code’ and Jenny Bryan’s talk ‘Code smells and feels’

Types of duplication

Trivial stuff (library(dplyr))
Copy/paste-driven development (similar logic & code)
Functional duplication (same logic, different code)
? False duplication (different logic, similar code)

How to detect duplication?

Python
- pylint (looks for identical lines between files)
Java / C++ / C# etc
- lots of choice (code structure / identity)
R: nothing for source code (AFAIK)
- String / Sequence similarity: stringdist
- Text analysis: ropensci:textreuse
- (But tools like: goodpractice, lintr, styler, cyclocomp, pkgnet)

`dupree`

https://github.com/russHyde/dupree
All community input is welcome
Most data input is welcome:
- sets of files (dupree())
- a directory (dupree_dir())
- or a package (dupree_package())

Duplication in a script

# min_block_size: used to prevent dupree analysing really small code blocks
dupree("duplication_heavy.R", min_block_size = 3) %>%
  dplyr::select(-file_a, -file_b)

Duplication in a script (cont.)

library(dplyr)
data(diamonds)

diamonds %>%
  filter(clarity %in% c("SI1", "SI2")) %>%
  group_by(color) %>%
  summarise(m_price = mean(price), sd_price = sd(price))

diamonds %>%
  filter(cut >= "Very Good") %>%
  group_by(color) %>%
  summarise(m_price = mean(price), sd_price = sd(price))


# note that dupree can't tell that the following code is logically
# the same as the preceding code
summarise(
  group_by(
    filter(diamonds, cut >= "Very Good"),
    color
  ),
  sd_price = sd(price),
  m_price = mean(price)
)

Mechanics

Longest Common Substring

# breakf-a---st
# break-dance--
stringdist::stringdist("breakfast", "breakdance", method = "lcs")

## [1] 7

Code blocks

-> Sentences of function / variable names

-> “Sentences” of integers

-> Compute similarity score based on longest-common-subsequence

Mechanics (cont.)

Use seq_sim to compute LCS-based distance between vectors of integers

to_ints <- function(word){
  as.integer(factor(strsplit(word, "")[[1]], levels = letters))
}

to_ints("breakfast")

## [1]  2 18  5  1 11  6  1 19 20

stringdist::seq_sim(
  list(to_ints("breakfast")), list(to_ints("breakdance")), method = "lcs"
) # 1 - |LCS| / (|seq1| + |seq2|)

## [1] 0.6315789

Duplication in a package

Downloaded the source code for lintr from github using ropensci/git2r.

# temporary dir for storing `lintr`'s source code
lintr_path <- file.path(tempdir(), "lintr")
lintr_repo <- git2r::clone(
  "https://github.com/jimhester/lintr",
  lintr_path
)

Duplication in a package (cont)

Ran dupree on lintr

dups <- dupree::dupree_package(
  lintr_path, min_block_size = 40
)

Duplication in a package (cont)

dups %>%
  dplyr::filter(score > 0.4 & file_a != file_b) %>%
  dplyr::mutate_at(c("file_a", "file_b"), basename) %>%
  head()

GOTO: equals_na_lintr.R

Visualisation of duplication results

We make a tidygraph structure from the similarity scores

dup_graph <- dups %>%
  # keep code-block pairs with moderate similarity:
  dplyr::filter(score > 0.4) %>%
  dplyr::transmute(
    # indicate code-blocks by filename and start-line
    from = paste(basename(file_a), line_a),
    to = paste(basename(file_b), line_b),
    type = "duplication",
    score = score
  ) %>%
  tidygraph::as_tbl_graph() %>%
  # distinguish the file each code block came from
  mutate(filename = gsub("(.*) \\d+$", "\\1", name))

Visualisation of duplication results (cont)

graph_image <- dup_graph %>%
  ggraph(layout = "gem") +
  geom_edge_link(
    aes(colour = type, edge_width = score)
  ) +
  geom_node_point(
    aes(colour = filename), size = 4, show.legend = FALSE
  ) +
  theme_graph()

Visualisation of duplication results (cont)

graph_image

Visualisation of duplication results (cont)

graph_image +
  geom_node_text(aes(label = name), repel = TRUE)

What was `lintr` by the way?

style / syntax checker for R
configurable
can be ran
- in Rstudio / vim / atom etc
- or on Travis
(and dupree uses lintr’s file parsers)

refactoRing

Improving the structure of code (without modifying its function)
The rule of 3
Examples
- Figures: Global theming / %+%
- Statements: Replace with function call
- Common functions: Move to a package
- RMarkdown: Configurable reports / child-stubs

Background and Links:

Today’s packages

Me

Preamble

Code Smells & Architectural Ideals

Types of duplication

How to detect duplication?

dupree

Duplication in a script

Duplication in a script (cont.)

Mechanics

Mechanics (cont.)

Duplication in a package

Duplication in a package (cont)

Duplication in a package (cont)

Visualisation of duplication results

Visualisation of duplication results (cont)

Visualisation of duplication results (cont)

Visualisation of duplication results (cont)

What was lintr by the way?

refactoRing

Thanks

`dupree`

What was `lintr` by the way?