Module 5 Lesson 13 :R Programming Structures 2

Environment and Scope Issues

A function—formally referred to as a closure in the R documentation—consists not only of its arguments and body but also of its environment. The latter is made up of the collection of objects present at the time the function is created. An understanding of how environments work in R is essential for writing effective R functions.

The top-level Environment

Consider this example:

w<- 12
f<- function(y) {
  d<-8
  h<-function() {
    return(d*(w+y))
  }
  return(h())
}
environment(f)

## <environment: R_GlobalEnv>

Here, the function f() is created at the top level—that is, at the interpreter command prompt—and thus has the top-level environment, which in R output is referred to as R_GlobalEnv but which confusingly you refer to in R code as .GlobalEnv. If you run an R program as a batch file, that is considered top level, too.
The function ls() lists the objects of an environment. If you call it at the top level, you get the top-level environment. Let’s try it with our example code:

ls()

## [1] "f" "w"

As you can see, the top-level environment here includes the variable w, which is actually used within f(). Note that f() is here too, as functions are indeed objects and we did create it at the top level.
At levels other than the top, ls() works a little differently, as you’ll see in the next section.

You get a bit more information from ls.str():

ls.str()

## f : function (y)  
## w :  num 12

Next, we’ll look at how w and other variables come into play within f().

The Scope Hierarchy

Let’s first get an intuitive overview of how scope works in R and then relate it to environments.
If we were working with the C language (as usual, background in C is not assumed), we would say that the variable w in the previous section is global to f(), while d is local to f(). Things are similar in R, but R is more hierarchical. If we were working with the C language (as usual, background in C is not assumed), we would say that the variable w in the previous section is global to f(), while d is local to f(). Things are similar in R, but R is more hierarchical.
In C, we would not have functions defined within functions, as we have with h() inside f() in our example.
Yet, since functions are objects, it is possible—and sometimes desirable from the point of view of the encapsulation goal of object-oriented programming—to define a function within a function; we are simply creating an object, which we can do anywhere.
Here, we have h() being local to f(), just like d. In such a situation, it makes sense for scope to be hierarchical. Thus, R is set up so that d, which is local to f(), is in turn global to h(). The same is true for y, as arguments are considered locals in R.
Similarly, the hierarchical nature of scope implies that since w is global to f(), it is global to h() as well. Indeed, we do use w within h().
In terms of environments then, h()’s environment consists of whatever objects are defined at the time h() comes into existence; that is, at the time that this assignment is executed:

h<-function() {
  return(d*(w+y))
}

(If f() is called multiple times, h() will come into existence multiple times, going out of existence each time f() returns.)
What, then, will be in h()’s environment? Well, at the time h() is created, there are the objects d and y created within f(), plus f()’s environment (w). In other words, if one function is defined within another, then that inner function’s environment consists of the environment of the outer one, plus whatever locals have been created so far within the outer one. With multiple nesting of functions, you have a nested sequence of larger and larger environments, with the “root” consisting of the top-level objects.

Let’s try out the code:

f(2)

## [1] 112

What happened? The call f(2) resulted in setting the local variable d to 8, followed by the call h(). The latter evaluated d(w+y)—that is, 8(12+2)— giving us 112.
Note carefully the role of w. The R interpreter found that there was no local variable of that name, so it ascended to the next higher level—in this case, the top level—where it found a variable w with value 12.

Keep in mind that h() is local to f() and invisible at the top level.

## function() {
##   return(d*(w+y))
## }

It’s possible (though not desirable) to deliberately allow name conflicts in this hierarchy. In our example, for instance, we could have a local variable d within h(), conflicting with the one in f(). In such a situation, the innermost environment is used first. In this case, a reference to d within h() would refer to h()’s d, not f()’s.
Environments created by inheritance in this manner are generally referred to by their memory locations. Here is what happened after adding a print statement to f() (using edit(), not shown here) and then running the code:

## function(y) {
##   d<-8
##   h<-function() {
##     return(d*(w+y))
##   }
##   return(h())
## }

function(y) {
d <- 8
h <- function() {
return(d*(w+y))
}
print(environment(h))
return(h())
}

## function(y) {
## d <- 8
## h <- function() {
## return(d*(w+y))
## }
## print(environment(h))
## return(h())
## }

f(2)

## [1] 112

Compare all this to the situation in which the functions are not nested:

## function(y) {
##   d<-8
##   h<-function() {
##     return(d*(w+y))
##   }
##   return(h())
## }
## <bytecode: 0x00000000151e7090>

function(y) {
d <- 8
return(h())
}

## function(y) {
## d <- 8
## return(h())
## }

## function() {
##   return(d*(w+y))
## }

function() {
return(d*(w+y))
}

## function() {
## return(d*(w+y))
## }

The result is as follows:

f(5)

## [1] 136

This does not work, as d is no longer in the environment of h(), because h() is defined at the top level. Thus, an error is generated.
Worse, if by happenstance there had been some unrelated variable d in the top-level environment, we would not get an error message but instead would have incorrect results.
You might wonder why R didn’t complain about the lack of y in the alternate definition of h() in the preceding example. As mentioned earlier, R doesn’t evaluate a variable until it needs it under a policy called lazy evaluation. In this case, R had already encountered an error with d and thus never got to the point where it would try to evaluate y.

The fix is to pass d and y as arguments:

function(y) {
d <- 8
return(h(d,y))
}

## function(y) {
## d <- 8
## return(h(d,y))
## }

## function() {
##   return(d*(w+y))
## }

function(dee,yyy) {
return(dee*(w+yyy))
}

## function(dee,yyy) {
## return(dee*(w+yyy))
## }

f(2)

## [1] 112

Okay, let’s look at one last variation:

f<-function(y,ftn) {
d <- 8
print(environment(ftn))
return(ftn(d,y))
}

h<-function(dee,yyy) {
return(dee*(w+yyy))
}

w <- 12
f(3,h)

## <environment: R_GlobalEnv>

## [1] 120

When f() executed, the formal argument ftn was matched by the actual argument h. Since arguments are treated as locals, you might guess that ftn could have a different environment than top level. But as discussed, a closure includes environment, and thus ftn has h’s environment.
Note carefully that all the examples so far involving nonlocal variables are for reads, not writes. The case of writes is crucial, and it will be covered in Section.

More on ls()

Without arguments, a call to ls() from within a function returns the names of the current local variables (including arguments). With the envir argument, it will print the names of the locals of any frame in the call chain.

Here’s an example:

f<- function(y) {
d <- 8
return(h(d,y))
}

h<-function(dee,yyy) {
print(ls())
print(ls(envir=parent.frame(n=1)))
return(dee*(w+yyy))
}

f(2)

## [1] "dee" "yyy"
## [1] "d" "y"

## [1] 112

With parent.frame(), the argument n specifies how many frames to go up in the call chain. Here, we were in the midst of executing h(), which had been called from f(), so specifying n = 1 gives us f()’s frame, and thus we get its locals.

Functions Have (Almost) No Side Effects

Yet another influence of the functional programming philosophy is that functions do not change nonlocal variables; that is, generally, there are no side effects. Roughly speaking, the code in a function has read access to its nonlocal variables, but it does not have write access to them. Our code can appear to reassign those variables, but the action will affect only copies, not the variables themselves. Let’s demonstrate this by adding some more code to our previous example.

w <- 12
f<- function(y) {
  d <- 8
  w <- w + 1
  y <- y - 2
  print(w)
  h <- function() {
    return(d*(w+y))
    }
  return(h())
}
t <- 4
f(t)

## [1] 13

## [1] 120

## [1] 12

## [1] 4

So, w at the top level did not change, even though it appeared to change within f(). Only a local copy of w, within f(), changed. Similarly, the top-level variable t didn’t change, even though its associated formal argument y did change.

No Pointers in R

R does not have variables corresponding to pointers or references like those of, say, the C language. This can make programming more difficult in some cases. (As of this writing, the current version of R has an experimental feature called reference classes, which may reduce the difficulty.)
For example, you cannot write a function that directly changes its arguments. In Python, for instance, you can do this:

x = [13,5,12]
x.sort()
x

## [5, 12, 13]

Here, the value of x, the argument to sort(), changed. By contrast, here’s how it works in R:

x <- c(13,5,12)
sort(x)

## [1]  5 12 13

## [1] 13  5 12

The argument to sort() does not change. If we do want x to change in this R code, the solution is to reassign the arguments:

x <- sort(x)
x

## [1]  5 12 13

What if our function has several variables of output? A solution is to gather them together into a list, call the function with this list as an argument, have the function return the list, and then reassign to the original list.
An example is the following function, which determines the indices of odd and even numbers in a vector of integers:

oddsevens<-function(v){
    odds <- which(v %% 2 == 1)
    evens <- which(v %% 2 == 1)
    list(o=odds,e=evens)
}

In general, our function f() changes variables x and y. We might store them in a list lxy, which would then be our argument to f(). The code, both called and calling, might have a pattern like this:

f <- function(lxxyy) {
...
lxxyy$x <- ...
lxxyy$y <- ...
return(lxxyy)
}

# set x and y
lxy$x <- ...
lxy$y <- ...
lxy <- f(lxy)
# use new x and y
... <- lxy$x
... <- lxy$y

However, this may become unwieldy if your function will change many variables. It can be especially awkward if your variables, say x and y in the example, are themselves lists, resulting in a return value consisting of lists within a list. This can still be handled, but it makes the code more syntactically complex and harder to read.
Alternatives include the use of global variables, which we will look at in the next section, and the new R reference classes mentioned earlier.
Another class of applications in which lack of pointers causes difficulties is that of treelike data structures. C code normally makes heavy use of pointers for these kinds of structures. One solution for R is to revert to what was done in the “good old days” before C, when programmers formed their own “pointers” as vector indices.

Writing Upstairs

As mentioned earlier, code that exists at a certain level of the environment hierarchy has at least read access to all the variables at the levels above it. On the other hand, direct write access to variables at higher levels via the standard <- operator is not possible.
If you wish to write to a global variable—or more generally, to any variable higher in the environment hierarchy than the level at which your write statement exists—you can use the superassignment operator, <<-, or the assign() function. Let’s discuss the superassignment operator first.

Writing to Nonlocals with the Superassignment Operator

Consider the following code:

two <- function(u) {
  u <<- 2*u
  z <- 2*z
}

x <- 1
z <- 3
u
Error: object "u" not found 
two(x)
x
[1] 1
z
[1] 3
u
[1] 2

Let’s look at the impact (or not) on the three top-level variables x, z, and u:
- x: Even though x was the actual argument to two() in the example, it retained the value 1 after the call. This is because its value 1 was copied to the formal argument u, which is treated as a local variable within the function. Thus, when u changed, x did not change with it.
- z: The two z values are entirely unrelated to each other—one is top level, and the other is local to two(). The change in the local variable has no effect on the global variable. Of course, having two variables with the same name is probably not good programming practice.
- u: The u value did not even exist as a top-level variable prior to our calling two(), hence the “not found” error message. However, it was created as a top-level variable by the superassignment operator within two(), as confirmed after the call.

Though <<- is typically used to write to top-level variables, as in our example, technically, it does something a bit different. Use of this operator to write to a variable w will result in a search up the environment hierarchy, stopping at the first level at which a variable of that name is encountered. If none is found, then the level selected will be global. Look what happens in this little example:

f<-function() {
inc <- function() {x <<- x + 1}
x <- 3
inc()
return(x)
}
f()
[1] 4
x
Error: object 'x' not found

Here, inc() is defined within f(). When inc() is executing, and the R interpreter sees a superassignment to x, it starts going up the hierarchy. At the first level up—the environment within f()—it does find an x, and so that x is the one that is written to, not x at the top level.

Writing to Nonlocals with assign()

You can also use the assign() function to write to upper-level variables. Here’s an altered version of the previous example:

two<-function(u) {
assign("u",2*u,pos=.GlobalEnv)
z <- 2*z
}

two(x)
x
[1] 1
u
[1] 2

Here, we replaced the superassignment operator with a call to assign(). That call instructs R to assign the value 2*u (this is the local u) to a variable u further up the call stack, specifically in the top-level environment. In this case, that environment is only one call level higher, but if we had a chain of calls, it could be much further up.

Extended Example: Discrete-Event Simulation in R

Discrete-event simulation (DES) is widely used in business, industry, and government. The term discrete event refers to the fact that the state of the system changes only in discrete quantities, rather than changing continuously.
A typical example would involve a queuing system, say people lining up to use an ATM. Let’s define the state of our system at time t to be the number of people in the queue at that time. The state changes only by +1, when someone arrives, or by −1, when a person finishes an ATM transaction. This is in contrast to, for instance, a simulation of weather, in which temperature, barometric pressure, and so on change continuously.
This will be one of the longer, more involved examples in this book. But it exemplifies a number of important issues in R, especially concerning global variables, and will serve as an example when we discuss appropriate use global variables in the next section. Your patience will turn out to be a good investment of time. (It is not assumed here that the reader has any prior background in DES.)
Central to DES operation is maintenance of the event list, which is simply a list of scheduled events. This is a general DES term, so the word list here does not refer to the R data type. In fact, we’ll represent the event list by a data frame.
In the ATM example, for instance, the event list might at some point in the simulation look like this:
- customer 1 arrives at time 23.12
- customer 2 arrives at time 25.88
- customer 3 arrives at time 25.97
- customer 1 finishes service at time 26.02
Since the earliest event must always be handled next, the simplest form of coding the event list is to store it in time order, as in the example. (Readers with computer science background might notice that a more efficient approach might be to use some kind of binary tree for storage.) Here, we will implement it as a data frame, with the first row containing the earliest scheduled event, the second row containing the second earliest, and so on.
The main loop of the simulation repeatedly iterates. Each iteration pulls the earliest event off of the event list, updates the simulated time to reflect the occurrence of that event, and reacts to this event. The latter action will typically result in the creation of new events. For example, if a customer arrival occurs when the queue is empty, that customer’s service will begin—one event triggers setting up another. Our code must determine the customer’s service time, and then it will know the time at which service will be finished, which is another event that must be added to the event list.
One of the oldest approaches to writing DES code is the event-oriented paradigm. Here, the code to handle the occurrence of one event directly sets up another event, reflecting our preceding discussion.
As an example to guide your thinking, consider the ATM situation. At time 0, the queue is empty. The simulation code randomly generates the time of the first arrival, say 2.3. At this point, the event list is simply (2.3,“arrival”). This event is pulled off the list, simulated time is updated to 2.3, and we react to the arrival event as follows:
The queue for the ATM is empty, so we start the service by randomly generating the service time—say it is 1.2 time units. Then the completion of service will occur at simulated time 2.3 + 1.2 = 3.5.
We add the completion of service event to the event list, which will now consist of (3.5,“service done”)).
We also generate the time to the next arrival, say 0.6, which means the arrival will occur at time 2.9. Now the event list consists of (2.9,“arrival”) and (3.5,“service done”).
The code consists of a generally applicable library. We also have an example application, which simulates an M/M/1 queue, which is a singleserver queue in which both interarrival time and service time are exponentially distributed.

Here is a summary of the library functions:

schedevnt(): Inserts a newly created event into the event list.
getnextevnt(): Pulls the earliest event off the event list.
dosim(): Includes the core loop of the simulation. Repeatedly calls getnextevnt() to get the earliest of the pending events; updates the current simulated time, sim$currtime, to reflect the occurrence of that event; and calls the application-specific function reactevnt() to process this newly occurred event.

The code uses the following application-specific functions: - initglbls(): Initializes the application-specific global variables. - reactevnt(): Takes the proper actions when an event occurs, typically generating new events as a result. - prntrslts(): Prints the application-specific results of the simulation.

# DES.R: R routines for discrete-event simulation (DES)

# each event will be represented by a data frame row consisting of the
# following components: evnttime, the time the event is to occur;
# evnttype, a character string for the programmer-defined event type;
# optional application-specific components, e.g.
# the job's arrival time in a queuing app

# a global list named "sim" holds the events data frame, evnts, and
# current simulated time, currtime; there is also a component dbg, which
# indicates debugging mode

# forms a row for an event of type evntty that will occur at time
# evnttm; see comments in schedevnt() regarding appin
evntrow <- function(evnttm,evntty,appin=NULL) {
rw <- c(list(evnttime=evnttm,evnttype=evntty),appin)
return(as.data.frame(rw))
}

# insert event with time evnttm and type evntty into event list;
# appin is an optional set of application-specific traits of this event,
# specified in the form a list with named components
schedevnt <- function(evnttm,evntty,appin=NULL) {
newevnt <- evntrow(evnttm,evntty,appin)
# if the event list is empty, set it to consist of evnt and return
if (is.null(sim$evnts)) {
sim$evnts <<- newevnt
return()
}
# otherwise, find insertion point
inspt <- binsearch((sim$evnts)$evnttime,evnttm)
# now "insert," by reconstructing the data frame; we find what
# portion of the current matrix should come before the new event and
# what portion should come after it, then string everything together
before <-if (inspt == 1) NULL else sim$evnts[1:(inspt-1),]
nr <- nrow(sim$evnts)
after <- if (inspt <= nr) sim$evnts[inspt:nr,] else NULL
sim$evnts <<- rbind(before,newevnt,after)
}

# binary search of insertion point of y in the sorted vector x; returns
# the position in x before which y should be inserted, with the value
# length(x)+1 if y is larger than x[length(x)]; could be changed to C
# code for efficiency
binsearch <- function(x,y) {
n <- length(x)
lo <- 1
hi <- n
 while(lo+1 < hi) {
mid <- floor((lo+hi)/2)
if (y == x[mid]) return(mid)
if (y < x[mid]) hi <- mid else lo <- mid
}
if (y <= x[lo]) return(lo)
if (y < x[hi]) return(hi)
return(hi+1)
}

# start to process next event (second half done by application
# programmer via call to reactevnt())
getnextevnt <- function() {
head <- sim$evnts[1,]
# delete head
if (nrow(sim$evnts) == 1) {
sim$evnts <<- NULL
} else sim$evnts <<- sim$evnts[-1,]
return(head)
}

# simulation body
# arguments:
# initglbls: application-specific initialization function; inits
# globals to statistical totals for the app, etc.; records apppars
# in globals; schedules the first event
# reactevnt: application-specific event handling function, coding the
# proper action for each type of event
# prntrslts: prints application-specific results, e.g. mean queue
# wait
# apppars: list of application-specific parameters, e.g.
# number of servers in a queuing app
# maxsimtime: simulation will be run until this simulated time
# dbg: debug flag; if TRUE, sim will be printed after each event
dosim <- function(initglbls,reactevnt,prntrslts,maxsimtime,apppars=NULL, dbg=FALSE) {
  sim <<- list()
  sim$currtime <<- 0.0 # current simulated time
  sim$evnts <<- NULL # events data frame
sim$dbg <<- dbg
initglbls(apppars)
while(sim$currtime < maxsimtime) {
head <- getnextevnt()
sim$currtime <<- head$evnttime # update current simulated time
reactevnt(head) # process this event
if (dbg) print(sim)
}
prntrslts()
}

The following is an example application of the code. Again, the simulation models an M/M/1 queue, which is a single-server queuing system in which service times and times between job arrivals are exponentially distributed.

# DES application: M/M/1 queue, arrival rate 0.5, service rate 1.0
# the call
# dosim(mm1initglbls,mm1reactevnt,mm1prntrslts,10000.0,
# list(arrvrate=0.5,srvrate=1.0))
# should return a value of about 2 (may take a while)

# initializes global variables specific to this app
mm1initglbls <- function(apppars) {
mm1glbls <<- list()
# simulation parameters
mm1glbls$arrvrate <<- apppars$arrvrate
mm1glbls$srvrate <<- apppars$srvrate
# server queue, consisting of arrival times of queued jobs
mm1glbls$srvq <<- vector(length=0)
# statistics
mm1glbls$njobsdone <<- 0 # jobs done so far
mm1glbls$totwait <<- 0.0 # total wait time so far
# set up first event, an arrival; the application-specific data for
# each event will consist of its arrival time, which we need to
# record in order to later calculate the job's residence time in the
# system
arrvtime <- rexp(1,mm1glbls$arrvrate)
schedevnt(arrvtime,"arrv",list(arrvtime=arrvtime))
}

 # application-specific event processing function called by dosim()
# in the general DES library
mm1reactevnt <- function(head) {
if (head$evnttype == "arrv") { # arrival
 # if server free, start service, else add to queue (added to queue
# even if empty, for convenience)

if (length(mm1glbls$srvq) == 0) {
mm1glbls$srvq <<- head$arrvtime
srvdonetime <- sim$currtime + rexp(1,mm1glbls$srvrate)
schedevnt(srvdonetime,"srvdone",list(arrvtime=head$arrvtime))
} else mm1glbls$srvq <<- c(mm1glbls$srvq,head$arrvtime)
 # generate next arrival
arrvtime <- sim$currtime + rexp(1,mm1glbls$arrvrate)
schedevnt(arrvtime,"arrv",list(arrvtime=arrvtime))
} else { # service done
# process job that just finished
# do accounting
mm1glbls$njobsdone <<- mm1glbls$njobsdone + 1
mm1glbls$totwait <<-
mm1glbls$totwait + sim$currtime - head$arrvtime
 # remove from queue
mm1glbls$srvq <<- mm1glbls$srvq[-1]
# more still in the queue?
if (length(mm1glbls$srvq) > 0) {
# schedule new service
srvdonetime <- sim$currtime + rexp(1,mm1glbls$srvrate)
schedevnt(srvdonetime,"srvdone",list(arrvtime=mm1glbls$srvq[1]))
}
 }
 }

 mm1prntrslts <- function() {
print("mean wait:")
 print(mm1glbls$totwait/mm1glbls$njobsdone)
}

To see how all this works, take a look at the M/M/1 application code. There, we have set up a global variable, mm1glbls, which contains variables relevant to the M/M/1 code, such as mm1glbls$totwait, the running total of the wait time of all jobs simulated so far. As you can see, the superassignment operator is used to write to such variables, as in this statement:

mm1glbls$srvq <<- mm1glbls$srvq[-1]

Let’s look at mm1reactevnt() to see how the simulation works, focusing on the code portion in which a “service done” event is handled.

} else { # service done
# process job that just finished
# do accounting
mm1glbls$njobsdone <<- mm1glbls$njobsdone + 1
mm1glbls$totwait <<-
mm1glbls$totwait + sim$currtime - head$arrvtime
# remove this job from queue
mm1glbls$srvq <<- mm1glbls$srvq[-1]
# more still in the queue?
if (length(mm1glbls$srvq) > 0) {
# schedule new service
srvdonetime <- sim$currtime + rexp(1,mm1glbls$srvrate)
schedevnt(srvdonetime,"srvdone",list(arrvtime=mm1glbls$srvq[1]))
}
}

First, this code does some bookkeeping, updating the totals of number of jobs completed and wait time. It then removes this newly completed job from the server queue. Finally, it checks if there are still jobs in the queue and, if so, calls schedevnt() to arrange for the service of the one at the head. What about the DES library code itself? First note that the simulation state, consisting of the current simulated time and the event list, has been placed in an R list structure, sim. This was done in order to encapsulate all the main information into one package, which in R, typically means using a list. The sim list has been made a global variable.
As mentioned, a key issue in writing a DES library is the event list. This code implements it as a data frame, sim$evnts. Each row of the data frame corresponds to one scheduled event, with information about the event time, a character string representing the event type (say arrival or service completion), and any application-specific data the programmer wishes to add. Since each row consists of both numeric and character data, it was natural to choose a data frame for representing this event list. The rows of the data frame are in ascending order of event time, which is contained in the first column.
The main loop of the simulation is in dosim() of the DES library code, beginning at line 91:

while(sim$currtime < maxsimtime) {
head <- getnextevnt()
sim$currtime <<- head$evnttime # update current simulated time
reactevnt(head) # process this event
if (dbg) print(sim)
}

First, getnextevnt() is called to remove the head (the earliest event) from the event list. (Note the side effect: The event list changes.) Then the current simulated time is updated according to the scheduled time in the head event. Finally, the programmer-supplied function reactevnt() is called to process the event (as seen in the M/M/1 code discussed earlier).
The main potential advantage of using a data frame as our structure here is that it enables us to maintain the event list in ascending order by time via a binary search operation by event time. This is done in line 31 within schedevnt(), the function that inserts a newly created event into the event list:

inspt <- binsearch((sim$evnts)$evnttime,evnttm)

Here, we wish to insert a newly created event into the event list, and the fact that we are working with a vector enables the use of a fast binary search. (As noted in the comments in the code, though, this really should be implemented in C for good performance.) A later line in schedevnt()is a good example of the use of rbind():

sim$evnts <<- rbind(before,newevnt,after)

Now, we have extracted the events in the event list whose times are earlier than that of evnt and stored them in before. We also constructed a similar set in after for the events whose times are later than that of newevnt. We then use rbind() to put all these together in the proper order.

When Should You Use Global Variables?

Use of global variables is a subject of controversy in the programming community. Obviously, the question raised by the title of this section cannot be answered in any formulaic way, as it is a matter of personal taste and style. Nevertheless, most programmers would probably consider the outright banning of global variables, which is encouraged by many teachers of programming, to be overly rigid. In this section, we will explore the possible value of globals in the context of the structures of R. Here, the term global variable, or just global, will be used to include any variable located higher in the environment hierarchy than the level of the given code of interest.
The use of global variables in R is more common than you may have guessed. You might be surprised to learn that R itself makes very substantial use of globals internally, both in its C code and in its R routines. The superassignment operator <<-, for instance, is used in many of the R library functions (albeit typically in writing to a variable just one level up in the environment hierarchy). Threaded code and GPU code, which are used for writing fast programs (as described in Chapter 16), tend to make heavy use of global variables, which provide the main avenue of communication between parallel actors.
Now, to make our discussion concrete, let’s return to the earlier example from previous section:

f <- function(lxxyy) { # lxxyy is a list containing x and y
...
lxxyy$x <- ...
lxxyy$y <- ...
return(lxxyy)
}

# set x and y
lxy$x <- ...
lxy$y <- ...
lxy <- f(lxy)
# use new x and y
... <- lxy$x
... <- lxy$y

As noted earlier, this code might be a bit unwieldy, especially if x and y are themselves lists. By contrast, here is an alternate pattern that uses globals:

f <- function() {
...
x <<- ...
y <<- ...
}
# set x and y
x <- ...
y <- ...
f() # x and y are changed in here
# use new x and y
... <- x
... <- y

Arguably, this second version is much cleaner, being less cluttered and not requiring manipulation of lists. Cleaner code is usually easier to write, debug, and maintain.
It is for these reasons—avoiding clutter and simplifying the code—that we chose to use globals, rather than to return lists, in the DES code earlier in this chapter. Let’s explore that example further. We had two global variables (both lists, encapsulating various information): sim, associated with the library code, and mm1glbls, associated with our M/M/1 application code. Let’s consider sim first.
Even many programmers who have reservations about using globals agree that such variables may be justified if they are truly global, in the sense that they are used broadly in the program. This is the case for sim in our DES example. It is used both in the library code (in schedevnt(), getnextevnt(), and dosim()) and in in our M/M/1 application code (in mm1reactevnt()). The latter access to sim is on a read-only basis in this particular instance, but it could involve writes in some applications. A common example of such writes is when an event needs to be canceled. This might arise in modeling a “whichever comes first” situation; two events are scheduled, and when one of them occurs, the other must be canceled.
So, using sim as a global seems justified. Nevertheless, if we were bound and determined to avoid using globals, we could have placed sim as a local within dosim(). This function would then pass sim as an argument to all of the functions mentioned in the previous paragraph (schedevnt(), getnextevnt(), and so on), and each of these functions would return the modified sim. Line 94 for example, would change from this:

reactevnt(head)

to this:

sim <- reactevnt(head)

We would then need to add a line like the following to our applicationspecific function mm1reactevnt():

return(sim)

We could do something similar with mm1glbls, placing a variable called, say, appvars as a local within dosim(). However, if we did this with sim as well, we would need to place them together in a list so that both would be returned, as in our earlier example function f(). We would then have the lists-within-lists clutter described earlier—well, lists within lists within lists in this case. On the other hand, critics of the use of global variables counter that the simplicity of the code comes at a price. They worry that it may be difficult during debugging to track down locations at which a global variable changes value, since such a change could occur anywhere in the program. This seems to be less of a concern in view of our modern text editors and integrated development tools (the original article calling for avoiding use of globals was published in 1970!), which can be used to find all instances of a variable. However, it should be taken into consideration. Another concern raised by critics involves situations in which a function is called in several unrelated parts of the overall program using different values. For example, consider using our example f() function in different parts of our program, each call with its own values of x and y, rather than just a single value of each, as assumed earlier. This could be solved by setting up vectors of x and y values, with one element for each instance of f() in your program. You would lose some of the simplicity of using globals, though. The above issues apply generally, not just to R. However, for R there is an additional concern for globals at the top level, as the user will typically have lots of variables there. The danger is that code that uses globals may accidentally overwrite an unrelated variable with the same name. This can be avoided easily, of course, by simply choosing long, very application-specific names for globals in your code. But a compromise is also available in the form of environments, such as the following for the DES example above. Within dosim(), the line

sim <<- list()

would be replaced by

assign("simenv",new.env(),envir=.GlobalEnv)

This would create a new environment, pointed to by simenv at the top level. It would serve as a package in which to encapsulate our globals. We would access them via get() and assign(). For instance, the lines

if (is.null(sim$evnts)) {
sim$evnts <<- newevnt

in schedevnt() would become

if (is.null(get("evnts",envir=simenv))) {
assign("evnts",newevnt,envir=simenv)

Yes, this is cluttered too, but at least it is not complex like lists of lists of lists. And it does protect against unwittingly writing to an unrelated variable the user has at the top level. Using the superassignment operator still yields the least cluttered code, but this compromise is worth considering. As usual, there is no single style of programming that produces the best solution in all applications. The globals approach is another option to consider for your programming tool kit.

Closures

Recall that an R closure consists of a function’s arguments and body together with its environment at the time of the call. The fact that the environment is included is exploited in a type of programming that uses a feature also known (in a slight overloading of terminology) as a closure. A closure consists of a function that sets up a local variable and then creates another function that accesses that variable. This is a very abstract description, so let’s go right to an example.

counter<-function () {
  ctr <- 0
f <- function() {
ctr <<- ctr + 1
cat("this count currently has value",ctr,"\n")
}
return(f)
}

Let’s try this out before going into the internal details:

c1 <- counter()
c2 <- counter()
c1<-function() {
ctr <- ctr + 1
cat("this count currently has value",ctr,"\n")
}

c2<-function() {
ctr <<- ctr + 1
cat("this count currently has value",ctr,"\n")
}

c1()
c1()
c2()
c2()
c2()
c1()

Here, we called counter() twice, assigning the results to c1 and c2. As expected, those two variables will consist of functions, specifically copies of f(). However, f() accesses a variable ctr through the superassignment operator, and that variable will be the one of that name that is local to counter(), as it is the first one up the environment hierarchy. It is part of the environment of f() and, as such, is packaged in what is returned to the caller of counter(). The key point is that each time counter() is called, the variable ctr will be in a different environment (in the example, the environments were at memory addresses 0x8d445c0 and 0x8d447d4). In other words, different calls to counter() will produce physically different ctrs. The result, then, is that our functions c1() and c2() serve as completely independent counters, as seen in the example, where we invoke each of them a few times.

Recursion

Once a mathematics PhD student whom I knew to be quite bright, but who had little programming background, sought my advice on how to write a certain function. I quickly said, “You don’t even need to tell me what the function is supposed to do. The answer is to use recursion.” Startled, he asked what recursion is. I advised him to read about the famous Towers of Hanoi problem. Sure enough, he returned the next day, reporting that he was able to solve his problem in just a few lines of code, using recursion. Obviously, recursion can be a powerful tool. Well then, what is it? A recursive function calls itself. If you have not encountered this concept before, it may sound odd, but the idea is actually simple. In rough terms, the idea is this: To solve a problem of type X by writing a recursive function f(): 1. Break the original problem of type X into one or more smaller problems of type X. 2. Within f(), call f() on each of the smaller problems. 3. Within f(), piece together the results of (b) to solve the original problem.

A Quicksort Implementation

A classic example is Quicksort, an algorithm used to sort a vector of numbers from smallest to largest. For instance, suppose we wish to sort the vector (5,4,12,13,3,8,88). We first compare everything to the first element, 5, to form two subvectors: one consisting of the elements less than 5 and the other consisting of the elements greater than or equal to 5. That gives us subvectors (4,3) and (12,13,8,88). We then call the function on the subvectors, returning (3,4) and (8,12,13,88). We string those together with the 5, yielding (3,4,5,8,12,13,88), as desired. R’s vector-filtering capability and its c() function make implementation of Quicksort quite easy.

qs <- function(x) {
if (length(x) <= 1) return(x)
pivot <- x[1]
therest <- x[-1]
sv1 <- therest[therest < pivot]
sv2 <- therest[therest >= pivot]
sv1 <- qs(sv1)
sv2 <- qs(sv2)
return(c(sv1,pivot,sv2))
}

Note carefully the termination condition:

if (length(x) <= 1) return(x)

Without this, the function would keep calling itself repeatedly on empty vectors, executing forever. (Actually, the R interpreter would eventually refuse to go any further, but you get the idea.) Sounds like magic? Recursion certainly is an elegant way to solve many problems. But recursion has two potential drawbacks: - It’s fairly abstract. I knew that the graduate student, as a fine mathematician, would take to recursion like a fish to water, because recursion is really just the inverse of proof by mathematical induction. But many programmers find it tough. - Recursion is very lavish in its use of memory, which may be an issue in R if applied to large problems.

Replacement Functions

Recall the following example

x <- c(1,2,4)
names(x)

## NULL

names(x) <- c("a","b","ab")
names(x)

## [1] "a"  "b"  "ab"

##  a  b ab 
##  1  2  4

Consider one line in particular:

names(x) <- c("a","b","ab")

Looks totally innocuous, eh? Well, no. In fact, it’s outrageous! How on Earth can we possibly assign a value to the result of a function call? The resolution to this odd state of affairs lies in the R notion of replacement functions. The preceding line of R code actually is the result of executing the following:

x <- "names<-"(x,value=c("a","b","ab"))

No, this isn’t a typo. The call here is indeed to a function named names<-(). (We need to insert the quotation marks due to the special characters involved.)

What’s Considered a Replacement Function?

Any assignment statement in which the left side is not just an identifier (meaning a variable name) is considered a replacement function. When encountering this:

g(u) <- v

R will try to execute this:

u <- "g<-"(u,value=v)

Note the “try” in the preceding sentence. The statement will fail if you have not previously defined g<-(). Note that the replacement function has one more argument than the original function g(), a named argument value, for reasons explained in this section. In earlier chapters, you’ve seen this innocent-looking statement:

x[3] <- 8

The left side is not a variable name, so it must be a replacement function, and indeed it is, as follows. Subscripting operations are functions. The function “[”() is for reading vector elements, and “[<-”() is used to write. Here’s an example:

x <- c(8,88,5,12,13)
x

## [1]  8 88  5 12 13

x[3]

## [1] 5

"["(x,3)

## [1] 5

x <- "[<-"(x,2:3,value=99:100)
x

## [1]   8  99 100  12  13

Again, that complicated call in this line:

x <- "[<-"(x,2:3,value=99:100)

is simply performing what happens behind the scenes when we execute this:

x[2:3] <- 99:100

We can easily verify what’s occurring like so:

x <- c(8,88,5,12,13)
x[2:3] <- 99:100

Tools for Composing Function Code

If you are writing a short function that’s needed only temporarily, a quickand- dirty way to do this is to write it on the spot, right there in your interactive terminal session. Here’s an example:

g <- function(x) {
 return(x+1)
 }

This approach obviously is infeasible for longer, more complex functions. Now, let’s look at some better ways to compose R code.

Text Editors and Integrated Development Environments

You can use a text editor such as Vim, Emacs, or even Notepad, or an editor within an integrated development environment (IDE) to write your code in a file and then read it into R from the file. To do the latter, you can use R’s source() function. For instance, suppose we have functions f() and g() in a file xyz.R. In R, we give this command:

source("xyz.R")

This reads f() and g() into R as if we had typed them using the quick-anddirty way shown at the beginning of this section. If you don’t have much code, you can cut and paste from your editor window to your R window. Some general-purpose editors have special plug-ins available for R, such as ESS for Emacs and Vim-R for Vim. There are also IDEs for R, such as the commercial one by Revolution Analytics, and open source products such as StatET, JGR, Rcmdr, and RStudio.

The edit() Function

A nice implication of the fact that functions are objects is that you can edit functions from within R’s interactive mode. Most R programmers do their code editing with a text editor in a separate window, but for a small, quick change, the edit() function can be handy.

For instance, we could edit the function f1() by typing this:

f1 <- edit(f1)

This opens the default editor on the code for f1, which we could then edit and assign back to f1. Or, we might be interested in having a function f2() very similar to f1() and thus could execute the following:

f2 <- edit(f1)

This gives us a copy of f1() to start from. We would do a little editing and then save to f2(), as seen in the preceding command.

Writing Your Own Binary Operations

You can invent your own operations! Just write a function whose name begins and ends with %, with two arguments of a certain type, and a return value of that type. For example, here’s a binary operation that adds double the second operand to the first:

"%a2b%" <- function(a,b) return(a+2*b)
3 %a2b% 5

## [1] 13