Introduction

This is reply to The root of all evil in data science by Markus von Ins.

Markus von Ins claims that it easy to write “viruses” in R because one can rewrite existing functions. IMHO, Markus von Ins fails to understand lexical scoping in R.

All code snippets from Markus start with a comment # from Markus

Changing var()

First Markus sets a baseline for comparison.

# from Markus
x <- c(10.4, 5.6, 3.1, 6.4, 21.7)
var(x)
## [1] 53.853
length(x)
## [1] 5

Let’s see what is in the global environment and how var() is defined.

ls()
## [1] "x"
var
## function (x, y = NULL, na.rm = FALSE, use) 
## {
##     if (missing(use)) 
##         use <- if (na.rm) 
##             "na.or.complete"
##         else "everything"
##     na.method <- pmatch(use, c("all.obs", "complete.obs", "pairwise.complete.obs", 
##         "everything", "na.or.complete"))
##     if (is.na(na.method)) 
##         stop("invalid 'use' argument")
##     if (is.data.frame(x)) 
##         x <- as.matrix(x)
##     else stopifnot(is.atomic(x))
##     if (is.data.frame(y)) 
##         y <- as.matrix(y)
##     else stopifnot(is.atomic(y))
##     .Call(C_cov, x, y, na.method, FALSE)
## }
## <bytecode: 0x3844138>
## <environment: namespace:stats>

Then Markus ‘overwrites’ the var() with a different version.

# from Markus
var <- function(x){
  n <- length(x)
  mu <- mean(x)
  v <- sum((x - mu) ^ 2) / n
  v
}
var(x)
## [1] 43.0824

Let’s look again at the global environment. Notice that there is a new object var which wasn’t defined before. The new object is the function as defined by Markus. The original var() from the stats package is still available and unchanged.

ls()
## [1] "var" "x"
var
## function(x){
##   n <- length(x)
##   mu <- mean(x)
##   v <- sum((x - mu) ^ 2) / n
##   v
## }
stats::var
## function (x, y = NULL, na.rm = FALSE, use) 
## {
##     if (missing(use)) 
##         use <- if (na.rm) 
##             "na.or.complete"
##         else "everything"
##     na.method <- pmatch(use, c("all.obs", "complete.obs", "pairwise.complete.obs", 
##         "everything", "na.or.complete"))
##     if (is.na(na.method)) 
##         stop("invalid 'use' argument")
##     if (is.data.frame(x)) 
##         x <- as.matrix(x)
##     else stopifnot(is.atomic(x))
##     if (is.data.frame(y)) 
##         y <- as.matrix(y)
##     else stopifnot(is.atomic(y))
##     .Call(C_cov, x, y, na.method, FALSE)
## }
## <bytecode: 0x3844138>
## <environment: namespace:stats>
stats::var(x)
## [1] 53.853

So we have not changed stats::var() but masked it by the version in the global environment. The one in the global environment is selected because the global environment is always first on the search path.

searchpaths()
## [1] ".GlobalEnv"                   "/usr/lib/R/library/stats"    
## [3] "/usr/lib/R/library/graphics"  "/usr/lib/R/library/grDevices"
## [5] "/usr/lib/R/library/utils"     "/usr/lib/R/library/datasets" 
## [7] "/usr/lib/R/library/methods"   "Autoloads"                   
## [9] "/usr/lib/R/library/base"

Changing length()

In a second example Markus creates a new function length() in the global environment.

# from Markus
var(x)
## [1] 43.0824
length <- function(x){
  le_2 <- sum(x ^ 2)
  le <- sqrt(le_2)
  le
}
length(x)
## [1] 25.70953
var(x)
## [1] 8.378682

The result of var(x) is changed because var() in the global environment calls the function length(). Because the new length() is first on the search path, that one is used.

ls()
## [1] "length" "var"    "x"
base::length(x)
## [1] 5
stats::var(x)
## [1] 53.853

‘Virus’ packages can only mask functions

Loading a new package puts it on the search path just after the global environment. In case one of the exported functions has the same name a function further on the search path, the user gets a warning message. The user is able the inspected the masking function and test whether it influences his code.

library(lme4)
## Loading required package: Matrix
## 
## Attaching package: 'lme4'
## The following object is masked from 'package:stats':
## 
##     sigma
searchpaths()
##  [1] ".GlobalEnv"                                                     
##  [2] "/home/thierry_onkelinx/R/x86_64-pc-linux-gnu-library/3.2/lme4"  
##  [3] "/home/thierry_onkelinx/R/x86_64-pc-linux-gnu-library/3.2/Matrix"
##  [4] "/usr/lib/R/library/stats"                                       
##  [5] "/usr/lib/R/library/graphics"                                    
##  [6] "/usr/lib/R/library/grDevices"                                   
##  [7] "/usr/lib/R/library/utils"                                       
##  [8] "/usr/lib/R/library/datasets"                                    
##  [9] "/usr/lib/R/library/methods"                                     
## [10] "Autoloads"                                                      
## [11] "/usr/lib/R/library/base"
identical(sigma, lme4::sigma)
## [1] TRUE
identical(sigma, stats::sigma)
## [1] FALSE
library(concordance)
## 
## Attaching package: 'concordance'
## The following object is masked from 'package:lme4':
## 
##     sigma
## The following object is masked from 'package:stats':
## 
##     sigma
identical(sigma, lme4::sigma)
## [1] FALSE
identical(sigma, stats::sigma)
## [1] FALSE

Note that the masking is only affecting the global environment. The lexical scope within a package is limited to the package itself and the packages it imports, depends on or suggests. A secure way of programming functions in a package is by explicitly importing each external function in the NAMESPACE.

Save version of user defined var()

When we use the foo::bar() notation, then masking the function has no effect because our function will always use the function bar() from the package foo.

var(x)
## [1] 8.378682
var <- function(x){
  n <- base::length(x)
  mu <- base::mean(x)
  v <- base::sum((x - mu) ^ 2) / n
  v
}
mean <- function(x){0}
sum <- function(x){Inf}
var(x)
## [1] 43.0824
stats::var(x)
## [1] 53.853
identical(var, stats::var)
## [1] FALSE
identical(length, base::length)
## [1] FALSE
identical(mean, base::mean)
## [1] FALSE
identical(sum, base::sum)
## [1] FALSE
length(x)
## [1] Inf
mean(x)
## [1] 0
sum(x)
## [1] Inf

Some guidelines

  1. Don’t mask existing functions.
  2. Check functions masking other functions.
  3. Place stable functions into a package.

More information