Source file ⇒ lec18.Rmd

Announcements

  1. Midterm Review sheet (extra credit)
  2. Midterm Review problems (extra credit)
  3. midterm review in lab
  4. light homework #7

Today

  1. Wrap up discussion of apply functions
  2. Efficient programming
  3. Ranks and Ordering (chapter 12)

1. Apply functions (chap 4 data camp’s Int. R)

Last time we learned about lapply() and sapply(). These functions allow us to avoid loops on lists, data frames and vectors.

example:

mylist <- list(1,2,3)
myfunc <- function(x,y,z) x+y+z

mylist %>% sapply(myfunc,10,20)
## [1] 31 32 33

example:

Here is a function that converts inches to millimeters or vice versa:

convert = function(vals, toMM = TRUE) {
  if ( toMM ) {
    vals = vals * 25.4
  } else {
    vals = vals / 25.4
  }
  return(vals)
}

1:5 %>% convert(TRUE)
## [1]  25.4  50.8  76.2 101.6 127.0

Task for you

Convert the following data frame from mm to inches

df <- data.frame(a=c(1,2,3),b=c(1,2,3))

Each of the columns df is converted into a vector of length 3

We can improve the performance of sapply() by using vapply() instead. vapply() has an extra non optional arguement, FUN.VALUE, which is the class of the output.

vapply(X, FUN, FUN.VALUE, ..., USE.NAMES = TRUE)
df %>% vapply(convert,toMM=FALSE,numeric(3))
a b
0.0393701 0.0393701
0.0787402 0.0787402
0.1181102 0.1181102

2. Efficient Programming

The first rule of efficient programming in R is to make use of vectorized calculations when ever possible. If you can’t vectorize your calculation try using sapply before writing a for loop.

Lets compare the speed of squaring a large vector a number of different ways.

You can check how much time it takes to evaluate any expression by wrapping it in system.time(). Units are in seconds.

system.time(normal.samples <- rnorm(1000000))
##    user  system elapsed 
##   0.070   0.004   0.077

The elapsed time is the wall clock time. The user time is the CPU time perfomring the R process (in this case calling runif() repeatedly and creating a vector of length 1000000 ). The system time is CPU time spent by the operating system (for example opening and closing files, doing input and output, and looking at the system clock).

slow

n <- 10000
system.time({
vals <- rep(0, n)
for(i in 1:n)
      vals[i] <- i^2
})
##    user  system elapsed 
##   0.009   0.000   0.010
head(vals)
## [1]  1  4  9 16 25 36

As the vector val grows we find 10,000 increasingly large chuncks of memory. This is time consuming. Lets look at the addresses of the first block of those chunks of memory.

library(pryr)

vals <- c()
n <- 10000
for(i in 1:n)  vals <- c(vals, address(vals))

head(vals)
## [1] "0x7fa5eb807378" "0x7fa5f27ab5e8" "0x7fa5f59069f8" "0x7fa5f4384518"
## [5] "0x7fa5f4384560" "0x7fa5f5252070"

to make loop faster you should pre-allocate memory so that the vector entries are stored in a fixed large chunck of memory.

n <- 10000
system.time({
vals <- rep(0, n)
for(i in 1:n)
      vals[i] <- i^2
})
##    user  system elapsed 
##   0.009   0.000   0.011
head(vals)
## [1]  1  4  9 16 25 36

memory block is fixed

n <- 10000
vals <- rep(0, n)
for(i in 1:n)
      vals[i] <- address(vals)

head(vals)
## [1] "0x7fa5f44a6c00" "0x7fa5f42e6600" "0x7fa5f42e6600" "0x7fa5f42e6600"
## [5] "0x7fa5f42e6600" "0x7fa5f42e6600"

sapply is even faster since the function is written in C language

n <- 10000
system.time(
vals <- 1:n %>% sapply(function(x) x^2)
)
##    user  system elapsed 
##   0.007   0.001   0.009

vapply() is possibly a little faster than sapply() since you tell the computer ahead of time what size the output array is going to be

n <- 10000
system.time(
vals <- 1:n %>% vapply(function(x) x^2, numeric(1))
)
##    user  system elapsed 
##   0.007   0.000   0.007

If you can vectorize your function this is always the fastest.

n <- 10000
system.time(
  vals <- ( 1:n)^2
)
##    user  system elapsed 
##       0       0       0

In hw #6 I hope that you did this:

n <- 500
x <- runif(n,-1,1)
y <- runif(n, -1,1)
z <- x^2 +y^2<=1

head(z)
## [1] FALSE FALSE  TRUE  TRUE  TRUE  TRUE

i-clickers

Review vector subsetting using logicals

x <- c(NA,2,3,NA)
!is.na(x)
## [1] FALSE  TRUE  TRUE FALSE
x[!is.na(x)]
## [1] 2 3

3. Ranks and Ordering

The data verb ‘rank()’ replaces each number in a vector with its rank with respect to other numbers in the vector. The smallest number has the rank 1.

example:

c(20:11) %>% rank()
##  [1] 10  9  8  7  6  5  4  3  2  1

desc() %>% rank() reverses the direction of the ranking so the largest number in the vector has rank 1.

c(20:11) %>% desc() %>% rank()
##  [1]  1  2  3  4  5  6  7  8  9 10

example:

Recall the Babynames data table.

head(BabyNames)
name sex count year
Mary F 7065 1880
Anna F 2604 1880
Emma F 2003 1880
Elizabeth F 1939 1880
Minnie F 1746 1880
Margaret F 1578 1880

Here we provide the rank of the most popular BabyNames:

BabyNames %>%
  group_by(name) %>%
  summarise(total=sum(count)) %>%
  arrange(desc(total)) %>%
  mutate(rank=rank(desc(total))) %>%
  head()
name total rank
James 5114325 1
John 5095590 2
Robert 4809858 3
Michael 4315029 4
Mary 4127615 5
William 4054318 6

Suppose you want the 50th most popular name:

BabyNames %>%
  group_by(name) %>%
  summarise(total=sum(count)) %>%
  mutate(popularity=rank(desc(total))) %>%
  filter( popularity ==50)
name total popularity
Nicholas 874177 50

Task for you

How would you change the above code to find, for men, the 5 most popular names in Babynames?