It's sometimes said that R is a hard language to learn. One reason is perhaps that it is based on statistical modelling and this is an advanced topic even before we try to learn the languge, alhough this is balanced by the extensive and integrated help system and documentation with R. That aside there are stylistic differences to the language itself that will throw new users or those coming from other programming languages. The majority of first time users of R have probably stepped up to it from Excel looking for something more powerful. Amongst scientists spreadsheets have a bad name- but really the column and row based formula and calculations are a form of genius. It's actually difficult for me to imagine what science was like before computers and spreadsheets! Computer scientists too maybe familiar with procedural languages such as C, C++ or Java but they will find the procedural style of these languages quite different from R - indeed if they try and simply transcribe their C code into R they will find it operates very slowly. R is a functional language that draws on earlier languages such as Lisp and Scheme.
What is Procedural? What is Functional ? Here's a comparison inspired/adapted from the Bioinformatics Zen blog.
library(QSARdata)
data(Mutagen)
mutagenVec = as.vector(Mutagen_Outcome)
rm(Mutagen_Outcome)
## Warning: object 'Mutagen_Outcome' not found
mutagenFrame = Mutagen_Dragon
rm(Mutagen_Dragon)
## Warning: object 'Mutagen_Dragon' not found
mutagenVec[1:4]
## [1] "mutagen" "mutagen" "nonmutagen" "nonmutagen"
# this is a procedural type code that steps through each value of
# `mutagen` and recodes it to 1 for mutagens or 0 for non-mutagens
procRecodeFun = function(mutagenVec) {
binary.mutagenVec = vector()
for (i in 1:length(mutagenVec)) {
if (mutagenVec[i] == "mutagen")
binary.mutagenVec = c(binary.mutagenVec, 1)
if (mutagenVec[i] == "nonmutagen")
binary.mutagenVec <- c(binary.mutagenVec, 0)
}
return(binary.mutagenVec)
}
# whereas this is a `functional` style function
funRecodeFun = function(mutagenVec) {
mutagenVec[mutagenVec == "mutagen"] = 1
mutagenVec[mutagenVec == "nonmutagen"] = 0
return(mutagenVec)
}
# I wrapped these in functions just so we could easily time them using the
# useful system.time command. They would have been simpler just as short
# scripts
system.time((mutagenVec1 = procRecodeFun(mutagenVec)))
## user system elapsed
## 0.187 0.019 0.206
system.time((mutagenVec2 = funRecodeFun(mutagenVec)))
## user system elapsed
## 0 0 0
If your computer is anything like mine you should find that both functions appear to work instantaneously but using the system.time command the functional version is ~70 times faster. Here it doesn't matter so much but for larger datasets and complex analyses it can make a BIG difference. What is the difference?
mutagenVec[mutagenVec == "mutagen"] - and is a common feature of functional languages. This style is preferable (quicker to write, quicker to run) to using if, else, or ifelse statements and jumbled parentheses ({}) binary.mutantVec then a new copy is made. You might have made this a bit faster by pre-allocating binary.mutantVec as the correct length and inserting the data.x and
x that are evenx but with all even values replaced by NA.x divisible by either 3 or 7.ifelse command. Can you further improve 'funRecodeFun'?Actually the majority of base R commands are already partly functional in that they are vectorised. Commands like mean, sum, or diff really only make sense vectorised as they don't really work on single values. However commands like any, is.numeric, or sqrt make equal sense on single elements or whole vectors. This may seem obvious to you from the very first tutorial? In many languages however similar commands are scalar meaning you would have to loop through vectors as in the procRecodeFun applying them to each element. This is why some really advanced C programmers sometimes write painfully naive R code.
Of course there are many scalar commands in R and you may want to apply them to each element of a vector (or list) piecewise. There are two useful R command to vectorise scalar functions on simple vectors. The first makes sense if you want the scalar function to operate on each element (e.g. like any or as.numeric)
dir()
## [1] "cache" "tut7.Rmd" "tut7.html" "tut7.md"
files = dir()[2:4]
readLines(files[1], n = 3)
## [1] "<FONT style=\"font-size:14px\">"
## [2] "#### Functional Programming"
## [3] "It's sometimes said that R is a hard language to learn. One reason is perhaps that it is based on statistical modelling and this is an advanced topic even before we try to learn the languge, alhough this is balanced by the extensive and integrated help system and documentation with R. That aside there are stylistic differences to the language itself that will throw new users or those coming from other programming languages. The majority of first time users of R have probably stepped up to it from Excel looking for something more powerful. Amongst scientists spreadsheets have a bad name- but really the column and row based formula and calculations are a form of genius. It's actually difficult for me to imagine what science was like before computers and spreadsheets! Computer scientists too maybe familiar with `procedural languages` such as C, C++ or Java but they will find the `procedural` style of these languages quite different from R - indeed if they try and simply transcribe their C code into R they will find it operates very slowly. R is a **functional language** that draws on earlier languages such as Lisp and Scheme."
# readLines is a scalar function it accepts a single input only
readLines(files, n = 3)
## Error: invalid 'description' argument
# we can use sapply to vectorise the readLines function. Note that the n=3
# argument is supplied to sapply NOT directly e.g. readLines(n=3)
sapply(files, readLines, n = 3)
## tut7.Rmd
## [1,] "<FONT style=\"font-size:14px\">"
## [2,] "#### Functional Programming"
## [3,] "It's sometimes said that R is a hard language to learn. One reason is perhaps that it is based on statistical modelling and this is an advanced topic even before we try to learn the languge, alhough this is balanced by the extensive and integrated help system and documentation with R. That aside there are stylistic differences to the language itself that will throw new users or those coming from other programming languages. The majority of first time users of R have probably stepped up to it from Excel looking for something more powerful. Amongst scientists spreadsheets have a bad name- but really the column and row based formula and calculations are a form of genius. It's actually difficult for me to imagine what science was like before computers and spreadsheets! Computer scientists too maybe familiar with `procedural languages` such as C, C++ or Java but they will find the `procedural` style of these languages quite different from R - indeed if they try and simply transcribe their C code into R they will find it operates very slowly. R is a **functional language** that draws on earlier languages such as Lisp and Scheme."
## tut7.html
## [1,] "<!DOCTYPE html PUBLIC \"-//W3C//DTD XHTML 1.0 Transitional//EN\""
## [2,] "\"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd\">"
## [3,] "<!-- saved from url=(0014)about:internet -->"
## tut7.md
## [1,] "<FONT style=\"font-size:14px\">"
## [2,] "#### Functional Programming"
## [3,] "It's sometimes said that R is a hard language to learn. One reason is perhaps that it is based on statistical modelling and this is an advanced topic even before we try to learn the languge, alhough this is balanced by the extensive and integrated help system and documentation with R. That aside there are stylistic differences to the language itself that will throw new users or those coming from other programming languages. The majority of first time users of R have probably stepped up to it from Excel looking for something more powerful. Amongst scientists spreadsheets have a bad name- but really the column and row based formula and calculations are a form of genius. It's actually difficult for me to imagine what science was like before computers and spreadsheets! Computer scientists too maybe familiar with `procedural languages` such as C, C++ or Java but they will find the `procedural` style of these languages quite different from R - indeed if they try and simply transcribe their C code into R they will find it operates very slowly. R is a **functional language** that draws on earlier languages such as Lisp and Scheme."
The second command Reduce works when you want a function to be vectorised across all elements (e.g. like sum or mean).
a = 1:20
b = 5:25
c = 12:27
# the intersect is the overlap in elements of a and b
intersect(a, b)
## [1] 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
# but it ony works on two inputs not like this
intersect(a, b, c)
## Error: unused argument(s) (c)
# however we can vectorise it like so:
Reduce(intersect, list(a, b, c))
## [1] 12 13 14 15 16 17 18 19 20
# which is the equivalent of
intersect(intersect(a, b), c)
## [1] 12 13 14 15 16 17 18 19 20
A slightly different form of vectorisation is shown by the R operators (+, -, *, , etc.). these will add, subtract or multiply two vectors together in a piecewise fashion i.e.
x = 1:5
x * x
## [1] 1 4 9 16 25
Some R commands such as paste for example will work this way but most do not. We can however use the mapply command to make them.
# I want to print out 5,6,7,8,9,10 samples from 1:30 I can do it like this
for (i in 5:10) {
s = cbind(s, sample(30, i))
}
## Error: object 's' not found
# or I can make it piecewise vectorised using `mapply` NB as it is
# piecewise the single element 30 effectively is used 5 times
s = mapply(sample, 30, 5:10)
Note that mapply has returned your result in a list. This is useful as the results are different lengths so wouldn't have fit well in a pre-allocated matrix for instance. Plus there are special functional commands for applying functions to each element of a list! Lets say we want to calculate the sd of each sample we took.
# lapply stands for list apply
lapply(s, sd)
## [[1]]
## [1] 5.404
##
## [[2]]
## [1] 10.91
##
## [[3]]
## [1] 8.255
##
## [[4]]
## [1] 9.18
##
## [[5]]
## [1] 7.858
##
## [[6]]
## [1] 9.589
##
# NB sapply also works on lists but will return the result in a simpler
# vector format
sapply(s, sd)
## [1] 5.404 10.913 8.255 9.180 7.858 9.589
Finally there are functions that you wish to apply to groups of variables within a vector using a separate factor. For this we can use tapply.
pvar1 = mutagenFrame[, 1]
tapply(pvar1, mutagenVec, quantile)
## $mutagen
## 0% 25% 50% 75% 100%
## 30.03 177.23 230.27 281.33 1584.41
##
## $nonmutagen
## 0% 25% 50% 75% 100%
## 31.07 142.22 201.62 295.81 1550.25
##
# this is an alternative to something like
quantile(pvar1[mutagenVec == "mutagen"])
## 0% 25% 50% 75% 100%
## 30.03 177.23 230.27 281.33 1584.41
quantile(pvar1[mutagenVec == "nonmutagen"])
## 0% 25% 50% 75% 100%
## 31.07 142.22 201.62 295.81 1550.25
These are the first simpler members of the *apply family of functional commands. Once you become familiar with them - code becomes simple, life becomes easy.
intersect, and union of the samples in list s above?rev command to reverse the whole list s or reverse each sample within s?I said earlier that much of the power of R over other programming languages comes from the use of matrices and data.frames. Functional programming is even more powerful for dealing with such data as it simplifies much of the column-wise and row-wise operations which are the meat and potatoes of data-analysis.
We'll quickly have a look at the basic apply function for matrices and then move on to have a look at a newer package plyr that attempts to standardise the behaviour of the base R apply commands. For both we will use monthly ozone averages on a very coarse 24 by 24 grid covering Central America, from Jan 1995 to Dec 2000. The data is stored in a 3d area with the first two dimensions representing latitude and longitude, and the third representing time.
library(plyr)
library(MASS)
data(ozone)
dim(ozone)
## [1] 24 24 72
dimnames(ozone)
## $lat
## [1] "-21.2" "-18.7" "-16.2" "-13.7" "-11.2" "-8.7" "-6.2" "-3.7"
## [9] "-1.2" "1.3" "3.8" "6.3" "8.7" "11.2" "13.7" "16.2"
## [17] "18.7" "21.2" "23.7" "26.2" "28.7" "31.2" "33.7" "36.2"
##
## $long
## [1] "-113.8" "-111.3" "-108.8" "-106.3" "-103.8" "-101.3" "-98.8"
## [8] "-96.3" "-93.8" "-91.3" "-88.8" "-86.3" "-83.7" "-81.2"
## [15] "-78.7" "-76.2" "-73.7" "-71.2" "-68.7" "-66.2" "-63.7"
## [22] "-61.2" "-58.7" "-56.2"
##
## $time
## [1] "1" "2" "3" "4" "5" "6" "7" "8" "9" "10" "11" "12" "13" "14"
## [15] "15" "16" "17" "18" "19" "20" "21" "22" "23" "24" "25" "26" "27" "28"
## [29] "29" "30" "31" "32" "33" "34" "35" "36" "37" "38" "39" "40" "41" "42"
## [43] "43" "44" "45" "46" "47" "48" "49" "50" "51" "52" "53" "54" "55" "56"
## [57] "57" "58" "59" "60" "61" "62" "63" "64" "65" "66" "67" "68" "69" "70"
## [71] "71" "72"
##
month <- ordered(rep(1:12, length = 72))
ozone[1:4, 1:4, 1]
## long
## lat -113.8 -111.3 -108.8 -106.3
## -21.2 260 260 260 258
## -18.7 258 258 258 254
## -16.2 258 254 254 252
## -13.7 254 256 254 250
o = ozone[1, 1, ]
o
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
## 260 254 254 244 250 256 268 276 274 282 288 270 262 262 258 254 264 270
## 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36
## 280 280 278 288 280 280 262 262 256 254 250 258 264 258 270 280 274 272
## 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54
## 266 244 252 242 258 268 284 280 280 288 288 272 266 262 254 254 260 264
## 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72
## 272 270 276 286 284 278 272 262 256 260 264 274 276 288 284 286 284 268
# we can deseasonalise this by fitting a robust linear model and
# extracting the residuals
par(mfrow = c(2, 1))
plot(o, type = "l", ylim = range(o))
plot(residuals(rlm(o ~ month - 1)), type = "l", ylim = range(o) -
mean(o))
The rlm function seems to deseasonalise the data quite well. If we want to use it for all pints on the grid then the obvious solution would be to loop through them using an index. However either the base apply or the newer aaply function will do this concisely.
# the apply or aaply trick is then to do this for every spot on the grid
rlmFun = function(o) residuals(rlm(o ~ month - 1))
system.time((ozone.deseas.ap = apply(ozone, 1:2, rlmFun)))
## user system elapsed
## 3.744 0.042 3.798
system.time((ozone.deseas.aa = aaply(ozone, 1:2, rlmFun)))
## user system elapsed
## 4.023 0.042 4.114
dim(ozone.deseas.ap)
## [1] 72 24 24
dim(ozone.deseas.aa)
## [1] 24 24 72
Both apply and aaply take a similar length of time but apply has rotated the data! This is actually a feature of many of the base apply like commands (sapply,lapply, etc.) that the output format can be somewhat trial and error. With the plyr commands a simple naming convention dictates what input they accept (a*ply for arrays as here, l*ply for list input, d*ply for data.frames) and the output format (*aply for array output as here, *lply, and you get the idea).
Of course if you select a daft output then you may find the results difficult to interpret…
system.time((ozone.deseas.al = alply(ozone, 1:2, rlmFun)))
## user system elapsed
## 3.919 0.042 3.963
length(ozone.deseas.al)
## [1] 576
head(ozone.deseas.al, n = 3)
## $`1`
## 1 2 3 4 5 6 7 8
## -4.3964 -5.2036 -1.0000 -8.0052 -8.5089 -9.3387 -6.0000 -0.6724
## 9 10 11 12 13 14 15 16
## -3.0000 -3.0000 4.3964 -3.1964 -2.3964 2.7964 3.0000 1.9948
## 17 18 19 20 21 22 23 24
## 5.4911 4.6613 6.0000 3.3276 1.0000 3.0000 -3.6036 6.8036
## 25 26 27 28 29 30 31 32
## -2.3964 2.7964 1.0000 1.9948 -8.5089 -7.3387 -10.0000 -18.6724
## 33 34 35 36 37 38 39 40
## -7.0000 -5.0000 -9.6036 -1.1964 1.6036 -15.2036 -3.0000 -10.0052
## 41 42 43 44 45 46 47 48
## -0.5089 2.6613 10.0000 3.3276 3.0000 3.0000 4.3964 -1.1964
## 49 50 51 52 53 54 55 56
## 1.6036 2.7964 -1.0000 1.9948 1.4911 -1.3387 -2.0000 -6.6724
## 57 58 59 60 61 62 63 64
## -1.0000 1.0000 0.3964 4.8036 7.6036 2.7964 1.0000 7.9948
## 65 66 67 68 69 70 71 72
## 5.4911 8.6613 2.0000 11.3276 7.0000 1.0000 0.3964 -5.1964
##
## $`2`
## 1 2 3 4 5 6 7 8
## -3.3284 -2.2716 -0.6667 -7.1185 -10.0000 -5.3284 -4.5000 -1.0000
## 9 10 11 12 13 14 15 16
## -4.6716 -5.0000 4.1284 -2.8816 -1.3284 1.7284 1.3333 0.8815
## 17 18 19 20 21 22 23 24
## 2.0000 2.6716 3.5000 3.0000 1.3284 3.0000 -3.8716 7.1184
## 25 26 27 28 29 30 31 32
## 0.6716 3.7284 1.3333 0.8815 -4.0000 -3.3284 -12.5000 -15.0000
## 33 34 35 36 37 38 39 40
## -6.6716 -5.0000 -9.8716 -6.8816 -5.3284 -14.2716 -4.6667 -11.1185
## 41 42 43 44 45 46 47 48
## -2.0000 -1.3284 7.5000 3.0000 3.3284 1.0000 4.1284 -0.8816
## 49 50 51 52 53 54 55 56
## 2.6716 1.7284 -0.6667 4.8815 4.0000 0.6716 -2.5000 -5.0000
## 57 58 59 60 61 62 63 64
## 1.3284 3.0000 2.1284 9.1184 8.6716 1.7284 3.3333 6.8815
## 65 66 67 68 69 70 71 72
## 8.0000 8.6716 3.5000 9.0000 5.3284 3.0000 0.1284 -2.8816
##
## $`3`
## 1 2 3 4 5 6
## -2.964e+00 -4.964e+00 3.333e-01 -4.000e+00 -1.250e+01 -5.836e+00
## 7 8 9 10 11 12
## -3.764e+00 -5.643e-01 -5.000e+00 -5.000e+00 2.236e+00 -4.998e-01
## 13 14 15 16 17 18
## -9.643e-01 1.036e+00 3.333e-01 -2.000e+00 1.500e+00 -1.836e+00
## 19 20 21 22 23 24
## 2.357e-01 1.436e+00 1.000e+00 3.000e+00 -3.764e+00 1.500e+00
## 25 26 27 28 29 30
## 1.036e+00 5.036e+00 3.333e-01 -5.684e-14 -4.500e+00 -1.836e+00
## 31 32 33 34 35 36
## -7.764e+00 -1.656e+01 -7.000e+00 -7.000e+00 -7.764e+00 -8.500e+00
## 37 38 39 40 41 42
## -1.096e+01 -1.096e+01 -5.667e+00 -8.000e+00 -2.500e+00 2.164e+00
## 43 44 45 46 47 48
## 6.236e+00 3.436e+00 3.000e+00 1.000e+00 4.236e+00 -4.998e-01
## 49 50 51 52 53 54
## 3.036e+00 1.036e+00 3.333e-01 6.000e+00 5.500e+00 1.643e-01
## 55 56 57 58 59 60
## 2.357e-01 -2.564e+00 3.000e+00 5.000e+00 4.236e+00 7.500e+00
## 61 62 63 64 65 66
## 7.036e+00 5.036e+00 4.333e+00 8.000e+00 7.500e+00 1.216e+01
## 67 68 69 70 71 72
## 4.236e+00 5.436e+00 5.000e+00 3.000e+00 2.357e-01 -4.998e-01
##