Source file ⇒ 2017-lec11.Rmd

Announcement

  1. HW 4/lab 5 extension untension till Thursday due to holiday.
  2. Midterm is Tuesday March 14 (50 min test)–paper test
  3. Covers Chap 1-12, 14, 16 in book, + DataCamp Intermediate R chap 1-4
  4. HW 5 due in two weeks (not a long assignment)
  5. lab this week is practice midterm questions
  6. Will pass out MT review problems next Tuesday due in one week for EC.
  7. Will review in class T,Th week before midterm.
  8. After midterm we start learning Unix

Today

  1. Calling a function
  2. Wrap up discussion of apply functions
  3. Efficient programming
  4. Ranks and Ordering (chapter 12 DC book)
  5. Collective Properties of Cases (chap 14 DC book)

0. Calling a function:

The bulit in function base::sum() has the following documentation:

For example:

sum(1:5)
## [1] 15
sum(1:5,1:2)
## [1] 18
sum(1:5, TRUE)
## [1] 16
sum(c(1,2,NA),TRUE)
## [1] NA
sum(c(1,2,NA), na.rm=TRUE)
## [1] 3

Writing na.rm=TRUE identifies that na.rm is the value for the object name na.rm. In this example you can see it might be important to name the values you pass in a function call.

1. Apply functions (chap 4 data camp’s Int. R)

Last time we learned about lapply() and sapply(). These functions allow us to avoid loops on lists, data frames and vectors.

sapply() and vapply() are very similar to lapply() except they simplify their output to produce an atomic vector. While sapply() guesses, vapply() takes an additional argument specifying the output type.

sapply()

example:

mylist <- list(1,2,3)
myfunc <- function(x,y,z) x+y+z

mylist %>% sapply(myfunc,10,20)
## [1] 31 32 33
mylist %>% sapply(myfunc,y=10,z=20)
## [1] 31 32 33

In Class exercise

Do problem 1

http://gandalf.berkeley.edu:3838/alucas/Lecture-11-collection/

vapply()

We can improve the performance of sapply() by using vapply() instead. vapply() has an extra arguement, FUN.VALUE, which is the class of the output.

vapply(X, FUN, FUN.VALUE, ..., USE.NAMES = TRUE)

examples:

mtcars %>% sapply(is.numeric)
##  mpg  cyl disp   hp drat   wt qsec   vs   am gear carb 
## TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
mtcars %>% vapply( is.numeric, logical(1))
##  mpg  cyl disp   hp drat   wt qsec   vs   am gear carb 
## TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
df <- data.frame(a=c(1,2,3),b=c(1,2,3))

convert = function(vals, toMM = TRUE) {
  if ( toMM ) {
    vals = vals * 25.4
  } else {
    vals = vals / 25.4
  }
  return(vals)
}
df %>% vapply(convert,toMM=FALSE,numeric(3))
a b
0.0393701 0.0393701
0.0787402 0.0787402
0.1181102 0.1181102

2. Efficient Programming

The first rule of efficient programming in R is to make use of vectorized calculations when ever possible. If you can’t vectorize your calculation try using sapply before writing a for loop.

Lets compare the speed of squaring a large vector a number of different ways.

You can check how much time it takes to evaluate any expression by wrapping it in system.time(). Units are in seconds.

system.time(normal.samples <- rnorm(1000000))
##    user  system elapsed 
##   0.066   0.002   0.068

The elapsed time is the wall clock time. The user time is the CPU time perfomring the R process (in this case calling runif() repeatedly and creating a vector of length 1000000 ). The system time is CPU time spent by the operating system (for example opening and closing files, doing input and output, and looking at the system clock).

slow

n <- 10000
vals=c()
system.time({
for(i in 1:n)
      vals=c(vals,i^2)
})
##    user  system elapsed 
##   0.229   0.008   0.237
head(vals)
## [1]  1  4  9 16 25 36

As the vector val grows we find 10,000 increasingly large chuncks of memory (growing vector size). This is time consuming. Lets look at the addresses of the first block of those chunks of memory.

library(pryr)

vals <- c()
n <- 10000
for(i in 1:n)  vals <- c(vals, address(vals))

head(vals)
## [1] "0x7fea7b007378" "0x7fea83586008" "0x7fea82260b20" "0x7fea8235cf70"
## [5] "0x7fea825345b0" "0x7fea824a6210"

to make loop faster you should pre-allocate memory so that the vector entries are stored in a fixed large chunck of memory.

n <- 10000
system.time({
vals <- rep(0, n)
for(i in 1:n)
      vals[i] <- i^2
})
##    user  system elapsed 
##   0.009   0.000   0.009
head(vals)
## [1]  1  4  9 16 25 36

memory block is fixed

n <- 10000
vals <- rep(0, n)
for(i in 1:n)
      vals[i] <- address(vals)

head(vals)
## [1] "0x7fea8522c600" "0x7fea85270e00" "0x7fea85270e00" "0x7fea85270e00"
## [5] "0x7fea85270e00" "0x7fea85270e00"

sapply is even faster since the function is written in C language

n <- 10000
system.time(
vals <- 1:n %>% sapply(function(x) x^2)
)
##    user  system elapsed 
##   0.008   0.000   0.009

vapply() is possibly a little faster than sapply() since you tell the computer ahead of time what size the output array is going to be

n <- 10000
system.time(
vals <- 1:n %>% vapply(function(x) x^2, numeric(1))
)
##    user  system elapsed 
##   0.009   0.000   0.009

If you can vectorize your function this is always the fastest.

n <- 10000
system.time(
  vals <- ( 1:n)^2
)
##    user  system elapsed 
##       0       0       0

In hw #4 I hope that you did this:

n <- 500
x <- runif(n,-1,1)
y <- runif(n, -1,1)
z <- x^2 +y^2<=1

head(z)
## [1] TRUE TRUE TRUE TRUE TRUE TRUE

3. Ranks and Ordering

The data verb ‘rank()’ replaces each number in a vector with its rank with respect to other numbers in the vector. The smallest number has the rank 1.

example:

c(20:11) %>% rank()
##  [1] 10  9  8  7  6  5  4  3  2  1

desc() %>% rank() reverses the direction of the ranking so the largest number in the vector has rank 1.

c(20:11) %>% desc() %>% rank()
##  [1]  1  2  3  4  5  6  7  8  9 10

What happens if we have a tie?

c(4,4,4,4,4) %>% rank()  #average of 1,2,3,4,5 is 3
## [1] 3 3 3 3 3
c(2,4,2,9,9,8) %>% rank()  # average of 1,2 is 1.5,  average of 5,6 is 5.5
## [1] 1.5 3.0 1.5 5.5 5.5 4.0

example:

Recall the Babynames data table.

head(BabyNames)
name sex count year
Mary F 7065 1880
Anna F 2604 1880
Emma F 2003 1880
Elizabeth F 1939 1880
Minnie F 1746 1880
Margaret F 1578 1880

The dataverb dplyr::arrange() arranges a data frame in increasing order.

mtcars %>% 
  arrange(cyl,desc(disp)) %>%  #increasing for cyl, decreasing for disp
  head()
mpg cyl disp hp drat wt qsec vs am gear carb
24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1

Next provide the rank of the most popular BabyNames:

BabyNames %>%
  group_by(name) %>%
  summarise(total=sum(count)) %>%
  arrange(desc(total)) %>%
  mutate(rank=rank(total)) %>%
  head()
name total rank
James 5114325 92600
John 5095590 92599
Robert 4809858 92598
Michael 4315029 92597
Mary 4127615 92596
William 4054318 92595
BabyNames %>%
  group_by(name) %>%
  summarise(total=sum(count)) %>%
  arrange(desc(total)) %>%
  mutate(rank=rank(desc(total))) %>%
  head()
name total rank
James 5114325 1
John 5095590 2
Robert 4809858 3
Michael 4315029 4
Mary 4127615 5
William 4054318 6

Suppose you want the 50th most popular name:

BabyNames %>%
  group_by(name) %>%
  summarise(total=sum(count)) %>%
  mutate(popularity=rank(desc(total))) %>%
  filter( popularity ==50)
name total popularity
Nicholas 874177 50

3. Collective Properties of Cases

Geom_denisty plots

Here is a selection of variables from the data table NCHS

NCHS %>% select(age,weight,sex,diabetic) %>%
  head(3)
age weight sex diabetic
2 12.5 female Not
77 75.4 male Not
10 32.9 female Not

Density glyphs can be used to compare distributions for different groups. You need merely indicate which variable should be used for grouping. For instance, here are the densities of body weight for each sex:

NCHS %>% ggplot( aes(x=weight) ) + 
  geom_density(aes(color=sex, fill=sex), alpha=.5) + 
  xlab("Weight (kg)") + ylab( "Density (1/kg)")

The density shows detail, but is often a fairly simple shape. This allows you to place densities side by side for comparison. For instance, here people are being divided into different age groups.

The density shows detail, but is often a fairly simple shape. This allows you to place densities side by side for comparison. For instance, here people are being divided into different age groups.

#library(mosaic)  Need library(mosaic) for ntiles()

NCHS <- 
  NCHS %>% 
  mutate( ageGroup=mosaic::ntiles(age, n=6, format="interval") )
NCHS %>%
  ggplot( aes(x=weight, group=sex) ) + 
  geom_density(aes(color=sex, fill=sex), alpha=.25) + 
  xlab("Weight (kg)") + ylab( "Density (1/kg)") + 
  facet_wrap( ~ ageGroup )

As always, the best choice of a mapping between variables and graphical attributes depends on what aspect of the data you would like to emphasize. For instance, to see how weight changes with age, you might prefer to use sex as the faceting variable and age as the grouping variable.

NCHS %>%
  ggplot( aes(x=weight) ) + 
  geom_density(aes(color=ageGroup, fill=ageGroup), alpha=.25) + 
  xlab("Weight (kg)") + ylab( "Density (1/kg)") + 
  facet_wrap( ~ sex ) + theme(text = element_text(size = 20))

box and whisker (boxplot) and violin geoms

Sometimes a more stylized depiction of the distribution is helpful. This makes it feasible to put the distributions side-by-side for comparison. The most common such depiction is the box and whisker glyph.

The lower and upper “hinges” correspond to the first and third quartiles (the 25th and 75th percentiles). The upper whisker extends from the hinge to the highest value that is within 1.5 * IQR of the hinge, where IQR is the inter-quartile range, or distance between the first and third quartiles. The lower whisker extends from the hinge to the lowest value within 1.5 * IQR of the hinge. Data beyond the end of the whiskers are outliers and plotted as point.

Here is data from the mpg data table giving fuel economy data for 38 popular models of cars

head(ggplot2::mpg,3)
manufacturer model displ year cyl trans drv cty hwy fl class
audi a4 1.8 1999 4 auto(l5) f 18 29 p compact
audi a4 1.8 1999 4 manual(m5) f 21 29 p compact
audi a4 2.0 2008 4 manual(m6) f 20 31 p compact
p <- mpg %>% ggplot(aes(class, hwy))
p + geom_boxplot() 

and with geom_jitter()

p <- mpg %>% ggplot(aes(class, hwy))
p + geom_boxplot() + geom_jitter()

You can use aesethic fill in geom_boxplot() to color the inside of the box plot.

p <- mpg %>% ggplot(aes(class, hwy))
p + geom_boxplot(aes(fill=drv)) 

Here is another example with the NCHS data

NCHS %>% select(age,weight,sex,diabetic,ageGroup) %>%
  head(3)
age weight sex diabetic ageGroup
2 12.5 female Not [1,8]
77 75.4 male Not [62,85]
10 32.9 female Not [8,14]
NCHS %>%
  ggplot( aes(y=weight, x=ageGroup ) ) + 
  geom_boxplot(
    aes(color=ageGroup, fill=ageGroup), 
    alpha=.25, outlier.size=1, outlier.colour="gray") + 
  ylab("Weight (kg)") + xlab( "Age Group (yrs)") + 
  facet_wrap( ~ sex )

In Class exercise

Do problem 3:

https://scf.berkeley.edu:3838/shiny/alucas/Lecture-11-collection/

The violin glyph provides a similar ease of comparison to the box-and-whisker glyph, but shows more detail in the density.

NCHS %>%
  ggplot( aes(y=weight, x=ageGroup) ) + 
  geom_violin( aes(color=ageGroup, fill=ageGroup), 
    alpha=.25) + 
  ylab("Weight (kg)") + xlab( "Age Group (yrs)") + 
  facet_wrap( ~ sex )

Both the box-and-whisker and violin glyphs use ink very efficiently. This allows you to show other data. For instance, here’s the same box-and-whisker plot, but with people with diabetics shown in red.

NCHS %>%
  ggplot(aes(y=weight, x=ageGroup)) + geom_boxplot(aes(fill=diabetic), position = position_dodge(width = 0.8),outlier.size = 1, outlier.colour = "gray", alpha=.75) + facet_grid(.~sex) + ylab("Weight (kg)") + xlab("Age Group (yrs)")

You can see that diabetics tend to be heavier than non-diabetics, especially for ages 19+.