This is reply to The root of all evil in data science by Markus von Ins.
Markus von Ins claims that it easy to write “viruses” in R because one can rewrite existing functions. IMHO, Markus von Ins fails to understand lexical scoping in R.
All code snippets from Markus start with a comment # from Markus
var()
First Markus sets a baseline for comparison.
# from Markus
x <- c(10.4, 5.6, 3.1, 6.4, 21.7)
var(x)
## [1] 53.853
length(x)
## [1] 5
Let’s see what is in the global environment and how var()
is defined.
ls()
## [1] "x"
var
## function (x, y = NULL, na.rm = FALSE, use)
## {
## if (missing(use))
## use <- if (na.rm)
## "na.or.complete"
## else "everything"
## na.method <- pmatch(use, c("all.obs", "complete.obs", "pairwise.complete.obs",
## "everything", "na.or.complete"))
## if (is.na(na.method))
## stop("invalid 'use' argument")
## if (is.data.frame(x))
## x <- as.matrix(x)
## else stopifnot(is.atomic(x))
## if (is.data.frame(y))
## y <- as.matrix(y)
## else stopifnot(is.atomic(y))
## .Call(C_cov, x, y, na.method, FALSE)
## }
## <bytecode: 0x3844138>
## <environment: namespace:stats>
Then Markus ‘overwrites’ the var()
with a different version.
# from Markus
var <- function(x){
n <- length(x)
mu <- mean(x)
v <- sum((x - mu) ^ 2) / n
v
}
var(x)
## [1] 43.0824
Let’s look again at the global environment. Notice that there is a new object var
which wasn’t defined before. The new object is the function as defined by Markus. The original var()
from the stats
package is still available and unchanged.
ls()
## [1] "var" "x"
var
## function(x){
## n <- length(x)
## mu <- mean(x)
## v <- sum((x - mu) ^ 2) / n
## v
## }
stats::var
## function (x, y = NULL, na.rm = FALSE, use)
## {
## if (missing(use))
## use <- if (na.rm)
## "na.or.complete"
## else "everything"
## na.method <- pmatch(use, c("all.obs", "complete.obs", "pairwise.complete.obs",
## "everything", "na.or.complete"))
## if (is.na(na.method))
## stop("invalid 'use' argument")
## if (is.data.frame(x))
## x <- as.matrix(x)
## else stopifnot(is.atomic(x))
## if (is.data.frame(y))
## y <- as.matrix(y)
## else stopifnot(is.atomic(y))
## .Call(C_cov, x, y, na.method, FALSE)
## }
## <bytecode: 0x3844138>
## <environment: namespace:stats>
stats::var(x)
## [1] 53.853
So we have not changed stats::var()
but masked it by the version in the global environment. The one in the global environment is selected because the global environment is always first on the search path.
searchpaths()
## [1] ".GlobalEnv" "/usr/lib/R/library/stats"
## [3] "/usr/lib/R/library/graphics" "/usr/lib/R/library/grDevices"
## [5] "/usr/lib/R/library/utils" "/usr/lib/R/library/datasets"
## [7] "/usr/lib/R/library/methods" "Autoloads"
## [9] "/usr/lib/R/library/base"
length()
In a second example Markus creates a new function length()
in the global environment.
# from Markus
var(x)
## [1] 43.0824
length <- function(x){
le_2 <- sum(x ^ 2)
le <- sqrt(le_2)
le
}
length(x)
## [1] 25.70953
var(x)
## [1] 8.378682
The result of var(x)
is changed because var()
in the global environment calls the function length()
. Because the new length()
is first on the search path, that one is used.
ls()
## [1] "length" "var" "x"
base::length(x)
## [1] 5
stats::var(x)
## [1] 53.853
Loading a new package puts it on the search path just after the global environment. In case one of the exported functions has the same name a function further on the search path, the user gets a warning message. The user is able the inspected the masking function and test whether it influences his code.
library(lme4)
## Loading required package: Matrix
##
## Attaching package: 'lme4'
## The following object is masked from 'package:stats':
##
## sigma
searchpaths()
## [1] ".GlobalEnv"
## [2] "/home/thierry_onkelinx/R/x86_64-pc-linux-gnu-library/3.2/lme4"
## [3] "/home/thierry_onkelinx/R/x86_64-pc-linux-gnu-library/3.2/Matrix"
## [4] "/usr/lib/R/library/stats"
## [5] "/usr/lib/R/library/graphics"
## [6] "/usr/lib/R/library/grDevices"
## [7] "/usr/lib/R/library/utils"
## [8] "/usr/lib/R/library/datasets"
## [9] "/usr/lib/R/library/methods"
## [10] "Autoloads"
## [11] "/usr/lib/R/library/base"
identical(sigma, lme4::sigma)
## [1] TRUE
identical(sigma, stats::sigma)
## [1] FALSE
library(concordance)
##
## Attaching package: 'concordance'
## The following object is masked from 'package:lme4':
##
## sigma
## The following object is masked from 'package:stats':
##
## sigma
identical(sigma, lme4::sigma)
## [1] FALSE
identical(sigma, stats::sigma)
## [1] FALSE
Note that the masking is only affecting the global environment. The lexical scope within a package is limited to the package itself and the packages it imports, depends on or suggests. A secure way of programming functions in a package is by explicitly importing each external function in the NAMESPACE.
var()
When we use the foo::bar()
notation, then masking the function has no effect because our function will always use the function bar()
from the package foo
.
var(x)
## [1] 8.378682
var <- function(x){
n <- base::length(x)
mu <- base::mean(x)
v <- base::sum((x - mu) ^ 2) / n
v
}
mean <- function(x){0}
sum <- function(x){Inf}
var(x)
## [1] 43.0824
stats::var(x)
## [1] 53.853
identical(var, stats::var)
## [1] FALSE
identical(length, base::length)
## [1] FALSE
identical(mean, base::mean)
## [1] FALSE
identical(sum, base::sum)
## [1] FALSE
length(x)
## [1] Inf
mean(x)
## [1] 0
sum(x)
## [1] Inf