Functions

In the last tutorial I introduced a function that packaged up the data loading steps. Earlier still I briefly mentioned functional programming as an avoidance of explicit loops. We will continue by looking at both these ideas a bit closer. The major advantage of a statistical language over a menu-driven point and click application is that series of commands can be saved as scripts. Scripts are just series of commands which we can save as text and either cut and paste back into the R terminal, use the source() command to load the code from a text file, or in RStudio click the source button in the text editor. By convention such files are given the extension '.R'. Note that if you click on a '.R' file in the RStudio file browser it will be opened in the code editing window. As you learn and use R you should at the very least be editing your command history and saving the data analysis steps as '.R' scripts to reuse later - e.g. Figs for a paper or presentation may want multiple revisions before the final submission.

However when you get a bit more confident you will want to start writing your own functions. Unlike a script, a function is adaptable as it can take different argument which change its behaviour. Functions are also the components that can be used to build larger more complex analyses from simpler building blocks. Another feature of R is that many functions are object oriented and will behave according to the class of data that is input. Finally functions unlike scripts are self contained in that whilst they may create many intermediate objects (vectors, lists, data.frames) only the final result will appear on your workspace at the end.

Simple Functions

Here is a very simple function borrowed from the Quick-R website, which is a taster for the R in Action book. I have also added a lot of annotation to explain the code more fully. A new function is created using the function() command like so:

mysummary = function(x, npar=TRUE, print=TRUE) 
# function example - get measures of central tendency
# and spread for a numeric vector x. The user has a
# choice of measures and whether the results are printed.
  {
  # if (!npar) is equivalent to saying if(npar==FALSE)
  # essentially ! means not  
    if (!npar)
    # if (!npar) we evaluate the block of code within the braces or else...  
    {
      center = mean(x); spread = sd(x) 
    }
    # we skip to the evaluate this block
    # the median and median absolute deviation (mad)
    # more robust estimates of centre and spread
    else 
    {
      center = median(x); spread = mad(x) 
    }
    # the cat statement will print to the screen even within a function
    if (print & !npar) 
    {
      cat('Mean=', center, '\n', 'SD=', spread, '\n')
    }
    # the else if here is redundant it could just be an if
    # but it is perhaps more legible stating the programmers intent
    else if (print & npar) 
    {
      cat('Median=', center, '\n', 'MAD=', spread, '\n')
    }
    # a result can be as complex an object as you like
    result = list(center=center,spread=spread)
    # you dont actually have to use the return() function
    # simply typing result on the last line would have worked too
    # however you need it if you want to return(a_result) before the last line
    return(result)
  }

x = rpois(500, 4) 
y = mysummary(x)
## Median= 4 
##  MAD= 1.483 
y = mysummary(x, npar=FALSE, print=FALSE)

The function either calculates the mean and standard devaition of a numeric vector or the more robust (resistant to outlier) median and median absolute deviation. The first thing to note here are the many if, else, and elseif statements followed by {curly braces} that control the flow of the function depending upon the argument options selected. The second thing that will appear curious to non-coders is the indentation I have used after each control statement. This is a coding convention that makes code more legible. Quite simply whenever there is a condition or a loop we indent the code and surround it with braces. Complex functions may have many many loops and conditons within other loops or conditions at which point this convention becomes essential.

Exercises

  1. Write a function that takes two numeric vectors e.g. myFun(x,y) and either adds, subtracts, or multiplies the vectors piecewise.
  2. Add a line to your function that will make it quit and return an error message if your vectors are either not-numeric, or if they are different lengths (Hint : help(stop).
  3. Force your function to print out the result to the console…backwards with tabs between each number.

Scope

Another important point to understand is the scope of a function. Generally the variables and objects within a function are distinct from those on your workspace and may hold many intermediate values without clashing. The workspace variables are called global and the variables within the function are called local


x = 1
y = 1

simpleFun = function(x, y) {
    # local y is altered
    y = y + 5
    return(x + y)
}
z = simpleFun(x, y)
x
## [1] 1
# but global y remains the same the function makes a diffrent copy of y
# and alters that.. not the original y variable
y
## [1] 1
z
## [1] 7

anotherFun = function(x) {
    # here a local variable is created within the function
    y = y + 5
    return(x + y)
}
z = anotherFun(x)
x
## [1] 1
# so again it doesn't effect global y
y
## [1] 1
z
## [1] 7

# be careful of subtleties this function behaves as expected although some
# find it a bit odd
i = 10
j = 10
iFun = function() {
    for (i in 1:5) {
        i = i + 1
        j = i + j
    }
    return(paste("i =", i, "j =", j))
}

iFun()
## [1] "i = 6 j = 30"
i
## [1] 10
j
## [1] 10

# but a control-flow command alone in a script can alter the global
# variable.
i = 10
j = 10
for (i in 1:5) {
    i = i + 1
    j = i + j
}
i
## [1] 6
j
## [1] 30

People get very worked up about R function scope. Generally it's only a problem in esoteric cases or with complex programs or with lazy use of variable names that match R constants or in-built functions (e.g. pi). However as a rule try not to reference a global variable without first passing it to a function in the arguments - if only becasue this will make your functions easier to understand - and they will stand-alone.

Exercises

  1. Have a look at the following functions with embedded function but don't run it yet.
  2. Can you work out the results of testFunA(3) or testFunB(3) or testFunC(3)?
  3. Run the functions and check if you were correct. If not.. think about it a bit.

testFunA = function(num) {
    # NB subFun doesn't need brackets as it is a single expression on 1 line
    subFun = function() num * num
    num * subFun()
}

funretFun = function(num) # This function returns a function!!!  it's called a 'Closure' this
# example adapted from Hadley Wickham Closures are seldom necessary
{
    function(x) x^num
}

testFunB = funretFun(2)
testFunC = funretFun(3)

Object Orientation

Working with R-built in functions you will note that many of them seem highly adaptable and intelligent. The plot() command seems to produce the right sort of graph for your data. This is due to object orientation. Quite simply many of the in-built functions are written with different methods for different classes of object or variable. For instance have a look at the summary command using the methods command:

methods(summary)
##  [1] summary.Date            summary.PDF_Dictionary*
##  [3] summary.PDF_Stream*     summary.POSIXct        
##  [5] summary.POSIXlt         summary.aov            
##  [7] summary.aovlist         summary.aspell*        
##  [9] summary.connection      summary.data.frame     
## [11] summary.default         summary.ecdf*          
## [13] summary.factor          summary.glm            
## [15] summary.infl            summary.lm             
## [17] summary.loess*          summary.manova         
## [19] summary.matrix          summary.mlm            
## [21] summary.nls*            summary.packageStatus* 
## [23] summary.ppr*            summary.prcomp*        
## [25] summary.princomp*       summary.srcfile        
## [27] summary.srcref          summary.stepfun        
## [29] summary.stl*            summary.table          
## [31] summary.tukeysmooth*   
## 
##    Non-visible functions are asterisked
summary
## function (object, ...) 
## UseMethod("summary")
## <bytecode: 0x103375e08>
## <environment: namespace:base>
# when you type a function without brackets into the console the code of
# the function is returned. Here we see that summary is just one line. the
# UseMethod command just takes object checks the class and dispatches to
# the appropriate method e.g. .glm or .manova

The summary command is not so much a single function but rather a family of related functions. As a beginner or intermediate R user you probably won't be making big programs or functions with lots of different methods for different data classes…yet. However after a short time you may write a function that returns a result with multiple pieces of data in a list. If you define this object with a class name then you can add your own summary(), plot(), or print() methods that know automatically what to do with your data - then save it all in a '.R' file so they can be sourced together. It's quite easy:

x = 1:10
summary(x)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    3.25    5.50    5.50    7.75   10.00 
class(x) = "myNewClass"

# this is really not a useful example
summary.myNewClass = function(object) {
    qq = quantile(object^2)
    qq <- signif(c(qq[1L:3L], mean(object^2), qq[4L:5L]), 3)
    names(qq) <- c("Min.", "1st Qu.", "Median", "Mean", "3rd Qu.", "Max.")
    return(qq)
}

summary(x)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     1.0    10.8    30.5    38.5    60.2   100.0 

As I said it's quite easy. For no explicable reason my function summarises myNewClass by returning the quantiles of the square of myNewClass. Indeed many of the R core developers decided that this type of object orientation - called S3 - was too easy or at least far too informal. For further developing the R language itself they needed something more difficult and cumbersome so they invented a painfully rigorous system called S4. We will give that a miss for now.

Exercises

  1. Create your own class of data: like a numeric vector - but different! Call it yourNewClass.
  2. It needs to print backwards so write a new print extension method that does this.
  3. You must have a plot method for your data. Think of something daft/weird and implement it.
  4. Whilst S3 is easy, can you see why it might be a bit chaotic for development of the R language itself?

That is quite enough on functions alone for now - anything else we can deal with when we come to it. The real power of the R language (IMHO) lies in the way functions and data interact..We'll deal with that a bit later, next we will finally have look at the plotting functions.