Source file ⇒ lec18.Rmd
Last time we learned about lapply()
and sapply()
. These functions allow us to avoid loops on lists, data frames and vectors.
example:
mylist <- list(1,2,3)
myfunc <- function(x,y,z) x+y+z
mylist %>% sapply(myfunc,10,20)
## [1] 31 32 33
example:
Here is a function that converts inches to millimeters or vice versa:
convert = function(vals, toMM = TRUE) {
if ( toMM ) {
vals = vals * 25.4
} else {
vals = vals / 25.4
}
return(vals)
}
1:5 %>% convert(TRUE)
## [1] 25.4 50.8 76.2 101.6 127.0
Convert the following data frame from mm to inches
df <- data.frame(a=c(1,2,3),b=c(1,2,3))
Each of the columns df is converted into a vector of length 3
We can improve the performance of sapply()
by using vapply()
instead. vapply()
has an extra non optional arguement, FUN.VALUE, which is the class of the output.
vapply(X, FUN, FUN.VALUE, ..., USE.NAMES = TRUE)
df %>% vapply(convert,toMM=FALSE,numeric(3))
a | b |
---|---|
0.0393701 | 0.0393701 |
0.0787402 | 0.0787402 |
0.1181102 | 0.1181102 |
The first rule of efficient programming in R is to make use of vectorized calculations when ever possible. If you can’t vectorize your calculation try using sapply before writing a for loop.
Lets compare the speed of squaring a large vector a number of different ways.
You can check how much time it takes to evaluate any expression by wrapping it in system.time()
. Units are in seconds.
system.time(normal.samples <- rnorm(1000000))
## user system elapsed
## 0.070 0.004 0.077
The elapsed time is the wall clock time. The user time is the CPU time perfomring the R process (in this case calling runif()
repeatedly and creating a vector of length 1000000 ). The system time is CPU time spent by the operating system (for example opening and closing files, doing input and output, and looking at the system clock).
slow
n <- 10000
system.time({
vals <- rep(0, n)
for(i in 1:n)
vals[i] <- i^2
})
## user system elapsed
## 0.009 0.000 0.010
head(vals)
## [1] 1 4 9 16 25 36
As the vector val grows we find 10,000 increasingly large chuncks of memory. This is time consuming. Lets look at the addresses of the first block of those chunks of memory.
library(pryr)
vals <- c()
n <- 10000
for(i in 1:n) vals <- c(vals, address(vals))
head(vals)
## [1] "0x7fa5eb807378" "0x7fa5f27ab5e8" "0x7fa5f59069f8" "0x7fa5f4384518"
## [5] "0x7fa5f4384560" "0x7fa5f5252070"
to make loop faster you should pre-allocate memory so that the vector entries are stored in a fixed large chunck of memory.
n <- 10000
system.time({
vals <- rep(0, n)
for(i in 1:n)
vals[i] <- i^2
})
## user system elapsed
## 0.009 0.000 0.011
head(vals)
## [1] 1 4 9 16 25 36
memory block is fixed
n <- 10000
vals <- rep(0, n)
for(i in 1:n)
vals[i] <- address(vals)
head(vals)
## [1] "0x7fa5f44a6c00" "0x7fa5f42e6600" "0x7fa5f42e6600" "0x7fa5f42e6600"
## [5] "0x7fa5f42e6600" "0x7fa5f42e6600"
sapply
is even faster since the function is written in C language
n <- 10000
system.time(
vals <- 1:n %>% sapply(function(x) x^2)
)
## user system elapsed
## 0.007 0.001 0.009
vapply()
is possibly a little faster than sapply()
since you tell the computer ahead of time what size the output array is going to be
n <- 10000
system.time(
vals <- 1:n %>% vapply(function(x) x^2, numeric(1))
)
## user system elapsed
## 0.007 0.000 0.007
If you can vectorize your function this is always the fastest.
n <- 10000
system.time(
vals <- ( 1:n)^2
)
## user system elapsed
## 0 0 0
In hw #6 I hope that you did this:
n <- 500
x <- runif(n,-1,1)
y <- runif(n, -1,1)
z <- x^2 +y^2<=1
head(z)
## [1] FALSE FALSE TRUE TRUE TRUE TRUE
x <- c(NA,2,3,NA)
!is.na(x)
## [1] FALSE TRUE TRUE FALSE
x[!is.na(x)]
## [1] 2 3
The data verb ‘rank()’ replaces each number in a vector with its rank with respect to other numbers in the vector. The smallest number has the rank 1.
example:
c(20:11) %>% rank()
## [1] 10 9 8 7 6 5 4 3 2 1
desc() %>% rank()
reverses the direction of the ranking so the largest number in the vector has rank 1.
c(20:11) %>% desc() %>% rank()
## [1] 1 2 3 4 5 6 7 8 9 10
example:
Recall the Babynames data table.
head(BabyNames)
name | sex | count | year |
---|---|---|---|
Mary | F | 7065 | 1880 |
Anna | F | 2604 | 1880 |
Emma | F | 2003 | 1880 |
Elizabeth | F | 1939 | 1880 |
Minnie | F | 1746 | 1880 |
Margaret | F | 1578 | 1880 |
Here we provide the rank of the most popular BabyNames:
BabyNames %>%
group_by(name) %>%
summarise(total=sum(count)) %>%
arrange(desc(total)) %>%
mutate(rank=rank(desc(total))) %>%
head()
name | total | rank |
---|---|---|
James | 5114325 | 1 |
John | 5095590 | 2 |
Robert | 4809858 | 3 |
Michael | 4315029 | 4 |
Mary | 4127615 | 5 |
William | 4054318 | 6 |
Suppose you want the 50th most popular name:
BabyNames %>%
group_by(name) %>%
summarise(total=sum(count)) %>%
mutate(popularity=rank(desc(total))) %>%
filter( popularity ==50)
name | total | popularity |
---|---|---|
Nicholas | 874177 | 50 |
How would you change the above code to find, for men, the 5 most popular names in Babynames
?