Background and Links:

Preamble

# Dependencies:
# - lintr, dplyr, purrr, tibble, magrittr, methods, stringdist
if (!"dupree" %in% installed.packages()) {
  require("devtools")
  devtools::install_github(
    repo = "russHyde/dupree", dependencies = FALSE
  )
}

suppressPackageStartupMessages({
  library(lintr)
  library(dupree)
  library(git2r)
})

Code Smells & Architectural Ideals

"The most common design problems result from code that

  • Is duplicated

  • Is unclear

  • Is complicated"

Quote: Kerievsky ‘Refactoring to Patterns’

See also Fowler ‘Refactoring’, Martin ‘Clean Code’ and Jenny Bryan’s talk ‘Code smells and feels’

Types of duplication

  • Trivial stuff (library(dplyr))

  • Copy/paste-driven development (similar logic & code)

  • Functional duplication (same logic, different code)

  • ? False duplication (different logic, similar code)

How to detect duplication?

  • Python

    • pylint (looks for identical lines between files)
  • Java / C++ / C# etc

    • lots of choice (code structure / identity)
  • R: nothing for source code (AFAIK)

    • String / Sequence similarity: stringdist
    • Text analysis: ropensci:textreuse
    • (But tools like: goodpractice, lintr, styler, cyclocomp, pkgnet)

dupree

  • https://github.com/russHyde/dupree

  • All community input is welcome

  • Most data input is welcome:

    • sets of files (dupree())
    • a directory (dupree_dir())
    • or a package (dupree_package())

Duplication in a script

# min_block_size: used to prevent dupree analysing really small code blocks
dupree("duplication_heavy.R", min_block_size = 3) %>%
  dplyr::select(-file_a, -file_b)

Duplication in a script (cont.)

library(dplyr)
data(diamonds)

diamonds %>%
  filter(clarity %in% c("SI1", "SI2")) %>%
  group_by(color) %>%
  summarise(m_price = mean(price), sd_price = sd(price))

diamonds %>%
  filter(cut >= "Very Good") %>%
  group_by(color) %>%
  summarise(m_price = mean(price), sd_price = sd(price))


# note that dupree can't tell that the following code is logically
# the same as the preceding code
summarise(
  group_by(
    filter(diamonds, cut >= "Very Good"),
    color
  ),
  sd_price = sd(price),
  m_price = mean(price)
)

Mechanics

Longest Common Substring

# breakf-a---st
# break-dance--
stringdist::stringdist("breakfast", "breakdance", method = "lcs")
## [1] 7

Code blocks

-> Sentences of function / variable names

-> “Sentences” of integers

-> Compute similarity score based on longest-common-subsequence

Mechanics (cont.)

Use seq_sim to compute LCS-based distance between vectors of integers

to_ints <- function(word){
  as.integer(factor(strsplit(word, "")[[1]], levels = letters))
}

to_ints("breakfast")
## [1]  2 18  5  1 11  6  1 19 20
stringdist::seq_sim(
  list(to_ints("breakfast")), list(to_ints("breakdance")), method = "lcs"
) # 1 - |LCS| / (|seq1| + |seq2|)
## [1] 0.6315789

Duplication in a package

Downloaded the source code for lintr from github using ropensci/git2r.

# temporary dir for storing `lintr`'s source code
lintr_path <- file.path(tempdir(), "lintr")
lintr_repo <- git2r::clone(
  "https://github.com/jimhester/lintr",
  lintr_path
)

Duplication in a package (cont)

Ran dupree on lintr

dups <- dupree::dupree_package(
  lintr_path, min_block_size = 40
)

Duplication in a package (cont)

dups %>%
  dplyr::filter(score > 0.4 & file_a != file_b) %>%
  dplyr::mutate_at(c("file_a", "file_b"), basename) %>%
  head()

GOTO: equals_na_lintr.R

Visualisation of duplication results

We make a tidygraph structure from the similarity scores

dup_graph <- dups %>%
  # keep code-block pairs with moderate similarity:
  dplyr::filter(score > 0.4) %>%
  dplyr::transmute(
    # indicate code-blocks by filename and start-line
    from = paste(basename(file_a), line_a),
    to = paste(basename(file_b), line_b),
    type = "duplication",
    score = score
  ) %>%
  tidygraph::as_tbl_graph() %>%
  # distinguish the file each code block came from
  mutate(filename = gsub("(.*) \\d+$", "\\1", name))

Visualisation of duplication results (cont)

graph_image <- dup_graph %>%
  ggraph(layout = "gem") +
  geom_edge_link(
    aes(colour = type, edge_width = score)
  ) +
  geom_node_point(
    aes(colour = filename), size = 4, show.legend = FALSE
  ) +
  theme_graph()

Visualisation of duplication results (cont)

graph_image

Visualisation of duplication results (cont)

graph_image +
  geom_node_text(aes(label = name), repel = TRUE)

What was lintr by the way?

  • style / syntax checker for R

  • configurable

  • can be ran

    • in Rstudio / vim / atom etc

    • or on Travis

  • (and dupree uses lintr’s file parsers)

refactoRing

  • Improving the structure of code (without modifying its function)

  • The rule of 3

  • Examples

    • Figures: Global theming / %+%

    • Statements: Replace with function call

    • Common functions: Move to a package

    • RMarkdown: Configurable reports / child-stubs

Thanks