Announcements

Midterm Review sheet (extra credit)
Midterm Review problems (extra credit)
midterm review in lab
light homework #7

Today

Wrap up discussion of apply functions
Efficient programming
Ranks and Ordering (chapter 12)

1. Apply functions (chap 4 data camp’s Int. R)

Last time we learned about lapply() and sapply(). These functions allow us to avoid loops on lists, data frames and vectors.

example:

mylist <- list(1,2,3)
myfunc <- function(x,y,z) x+y+z

mylist %>% sapply(myfunc,10,20)

## [1] 31 32 33

example:

Here is a function that converts inches to millimeters or vice versa:

convert = function(vals, toMM = TRUE) {
  if ( toMM ) {
    vals = vals * 25.4
  } else {
    vals = vals / 25.4
  }
  return(vals)
}

1:5 %>% convert(TRUE)

## [1]  25.4  50.8  76.2 101.6 127.0

Task for you

Convert the following data frame from mm to inches

df <- data.frame(a=c(1,2,3),b=c(1,2,3))

Each of the columns df is converted into a vector of length 3

We can improve the performance of sapply() by using vapply() instead. vapply() has an extra non optional arguement, FUN.VALUE, which is the class of the output.

vapply(X, FUN, FUN.VALUE, ..., USE.NAMES = TRUE)

df %>% vapply(convert,toMM=FALSE,numeric(3))

a	b
0.0393701	0.0393701
0.0787402	0.0787402
0.1181102	0.1181102

2. Efficient Programming

The first rule of efficient programming in R is to make use of vectorized calculations when ever possible. If you can’t vectorize your calculation try using sapply before writing a for loop.

Lets compare the speed of squaring a large vector a number of different ways.

You can check how much time it takes to evaluate any expression by wrapping it in system.time(). Units are in seconds.

system.time(normal.samples <- rnorm(1000000))

##    user  system elapsed 
##   0.070   0.004   0.077

The elapsed time is the wall clock time. The user time is the CPU time perfomring the R process (in this case calling runif() repeatedly and creating a vector of length 1000000 ). The system time is CPU time spent by the operating system (for example opening and closing files, doing input and output, and looking at the system clock).

slow

n <- 10000
system.time({
vals <- rep(0, n)
for(i in 1:n)
      vals[i] <- i^2
})

##    user  system elapsed 
##   0.009   0.000   0.010

head(vals)

## [1]  1  4  9 16 25 36

As the vector val grows we find 10,000 increasingly large chuncks of memory. This is time consuming. Lets look at the addresses of the first block of those chunks of memory.

library(pryr)

vals <- c()
n <- 10000
for(i in 1:n)  vals <- c(vals, address(vals))

head(vals)

## [1] "0x7fa5eb807378" "0x7fa5f27ab5e8" "0x7fa5f59069f8" "0x7fa5f4384518"
## [5] "0x7fa5f4384560" "0x7fa5f5252070"

to make loop faster you should pre-allocate memory so that the vector entries are stored in a fixed large chunck of memory.

n <- 10000
system.time({
vals <- rep(0, n)
for(i in 1:n)
      vals[i] <- i^2
})

##    user  system elapsed 
##   0.009   0.000   0.011

head(vals)

## [1]  1  4  9 16 25 36

memory block is fixed

n <- 10000
vals <- rep(0, n)
for(i in 1:n)
      vals[i] <- address(vals)

head(vals)

## [1] "0x7fa5f44a6c00" "0x7fa5f42e6600" "0x7fa5f42e6600" "0x7fa5f42e6600"
## [5] "0x7fa5f42e6600" "0x7fa5f42e6600"

sapply is even faster since the function is written in C language

n <- 10000
system.time(
vals <- 1:n %>% sapply(function(x) x^2)
)

##    user  system elapsed 
##   0.007   0.001   0.009

vapply() is possibly a little faster than sapply() since you tell the computer ahead of time what size the output array is going to be

n <- 10000
system.time(
vals <- 1:n %>% vapply(function(x) x^2, numeric(1))
)

##    user  system elapsed 
##   0.007   0.000   0.007

If you can vectorize your function this is always the fastest.

n <- 10000
system.time(
  vals <- ( 1:n)^2
)

##    user  system elapsed 
##       0       0       0

In hw #6 I hope that you did this:

n <- 500
x <- runif(n,-1,1)
y <- runif(n, -1,1)
z <- x^2 +y^2<=1

head(z)

## [1] FALSE FALSE  TRUE  TRUE  TRUE  TRUE

i-clickers

Review vector subsetting using logicals

x <- c(NA,2,3,NA)
!is.na(x)

## [1] FALSE  TRUE  TRUE FALSE

x[!is.na(x)]

## [1] 2 3

3. Ranks and Ordering

The data verb ‘rank()’ replaces each number in a vector with its rank with respect to other numbers in the vector. The smallest number has the rank 1.

example:

c(20:11) %>% rank()

##  [1] 10  9  8  7  6  5  4  3  2  1

desc() %>% rank() reverses the direction of the ranking so the largest number in the vector has rank 1.

c(20:11) %>% desc() %>% rank()

##  [1]  1  2  3  4  5  6  7  8  9 10

example:

Recall the Babynames data table.

head(BabyNames)

name	sex	count	year
Mary	F	7065	1880
Anna	F	2604	1880
Emma	F	2003	1880
Elizabeth	F	1939	1880
Minnie	F	1746	1880
Margaret	F	1578	1880

Here we provide the rank of the most popular BabyNames:

BabyNames %>%
  group_by(name) %>%
  summarise(total=sum(count)) %>%
  arrange(desc(total)) %>%
  mutate(rank=rank(desc(total))) %>%
  head()

name	total	rank
James	5114325	1
John	5095590	2
Robert	4809858	3
Michael	4315029	4
Mary	4127615	5
William	4054318	6

Suppose you want the 50th most popular name:

BabyNames %>%
  group_by(name) %>%
  summarise(total=sum(count)) %>%
  mutate(popularity=rank(desc(total))) %>%
  filter( popularity ==50)

name	total	popularity
Nicholas	874177	50

Task for you

How would you change the above code to find, for men, the 5 most popular names in Babynames?

lec18

Announcements

Today

1. Apply functions (chap 4 data camp’s Int. R)

Task for you

2. Efficient Programming

i-clickers

Review vector subsetting using logicals

3. Ranks and Ordering

Task for you