Source file ⇒ 2017-lec11.Rmd

Announcement

HW 4/lab 5 extension untension till Thursday due to holiday.
Midterm is Tuesday March 14 (50 min test)–paper test
Covers Chap 1-12, 14, 16 in book, + DataCamp Intermediate R chap 1-4
HW 5 due in two weeks (not a long assignment)
lab this week is practice midterm questions
Will pass out MT review problems next Tuesday due in one week for EC.
Will review in class T,Th week before midterm.
After midterm we start learning Unix

Today

Calling a function
Wrap up discussion of apply functions
Efficient programming
Ranks and Ordering (chapter 12 DC book)
Collective Properties of Cases (chap 14 DC book)

0. Calling a function:

The bulit in function base::sum() has the following documentation:

For example:

sum(1:5)

## [1] 15

sum(1:5,1:2)

## [1] 18

sum(1:5, TRUE)

## [1] 16

sum(c(1,2,NA),TRUE)

## [1] NA

sum(c(1,2,NA), na.rm=TRUE)

## [1] 3

Writing na.rm=TRUE identifies that na.rm is the value for the object name na.rm. In this example you can see it might be important to name the values you pass in a function call.

1. Apply functions (chap 4 data camp’s Int. R)

Last time we learned about lapply() and sapply(). These functions allow us to avoid loops on lists, data frames and vectors.

sapply() and vapply() are very similar to lapply() except they simplify their output to produce an atomic vector. While sapply() guesses, vapply() takes an additional argument specifying the output type.

sapply()

example:

mylist <- list(1,2,3)
myfunc <- function(x,y,z) x+y+z

mylist %>% sapply(myfunc,10,20)

## [1] 31 32 33

mylist %>% sapply(myfunc,y=10,z=20)

## [1] 31 32 33

In Class exercise

Do problem 1

http://gandalf.berkeley.edu:3838/alucas/Lecture-11-collection/

vapply()

We can improve the performance of sapply() by using vapply() instead. vapply() has an extra arguement, FUN.VALUE, which is the class of the output.

vapply(X, FUN, FUN.VALUE, ..., USE.NAMES = TRUE)

examples:

mtcars %>% sapply(is.numeric)

##  mpg  cyl disp   hp drat   wt qsec   vs   am gear carb 
## TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE

mtcars %>% vapply( is.numeric, logical(1))

##  mpg  cyl disp   hp drat   wt qsec   vs   am gear carb 
## TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE

df <- data.frame(a=c(1,2,3),b=c(1,2,3))

convert = function(vals, toMM = TRUE) {
  if ( toMM ) {
    vals = vals * 25.4
  } else {
    vals = vals / 25.4
  }
  return(vals)
}

df %>% vapply(convert,toMM=FALSE,numeric(3))

a	b
0.0393701	0.0393701
0.0787402	0.0787402
0.1181102	0.1181102

2. Efficient Programming

The first rule of efficient programming in R is to make use of vectorized calculations when ever possible. If you can’t vectorize your calculation try using sapply before writing a for loop.

Lets compare the speed of squaring a large vector a number of different ways.

You can check how much time it takes to evaluate any expression by wrapping it in system.time(). Units are in seconds.

system.time(normal.samples <- rnorm(1000000))

##    user  system elapsed 
##   0.066   0.002   0.068

The elapsed time is the wall clock time. The user time is the CPU time perfomring the R process (in this case calling runif() repeatedly and creating a vector of length 1000000 ). The system time is CPU time spent by the operating system (for example opening and closing files, doing input and output, and looking at the system clock).

slow

n <- 10000
vals=c()
system.time({
for(i in 1:n)
      vals=c(vals,i^2)
})

##    user  system elapsed 
##   0.229   0.008   0.237

head(vals)

## [1]  1  4  9 16 25 36

As the vector val grows we find 10,000 increasingly large chuncks of memory (growing vector size). This is time consuming. Lets look at the addresses of the first block of those chunks of memory.

library(pryr)

vals <- c()
n <- 10000
for(i in 1:n)  vals <- c(vals, address(vals))

head(vals)

## [1] "0x7fea7b007378" "0x7fea83586008" "0x7fea82260b20" "0x7fea8235cf70"
## [5] "0x7fea825345b0" "0x7fea824a6210"

to make loop faster you should pre-allocate memory so that the vector entries are stored in a fixed large chunck of memory.

n <- 10000
system.time({
vals <- rep(0, n)
for(i in 1:n)
      vals[i] <- i^2
})

##    user  system elapsed 
##   0.009   0.000   0.009

head(vals)

## [1]  1  4  9 16 25 36

memory block is fixed

n <- 10000
vals <- rep(0, n)
for(i in 1:n)
      vals[i] <- address(vals)

head(vals)

## [1] "0x7fea8522c600" "0x7fea85270e00" "0x7fea85270e00" "0x7fea85270e00"
## [5] "0x7fea85270e00" "0x7fea85270e00"

sapply is even faster since the function is written in C language

n <- 10000
system.time(
vals <- 1:n %>% sapply(function(x) x^2)
)

##    user  system elapsed 
##   0.008   0.000   0.009

vapply() is possibly a little faster than sapply() since you tell the computer ahead of time what size the output array is going to be

n <- 10000
system.time(
vals <- 1:n %>% vapply(function(x) x^2, numeric(1))
)

##    user  system elapsed 
##   0.009   0.000   0.009

If you can vectorize your function this is always the fastest.

n <- 10000
system.time(
  vals <- ( 1:n)^2
)

##    user  system elapsed 
##       0       0       0

In hw #4 I hope that you did this:

n <- 500
x <- runif(n,-1,1)
y <- runif(n, -1,1)
z <- x^2 +y^2<=1

head(z)

## [1] TRUE TRUE TRUE TRUE TRUE TRUE

3. Ranks and Ordering

The data verb ‘rank()’ replaces each number in a vector with its rank with respect to other numbers in the vector. The smallest number has the rank 1.

example:

c(20:11) %>% rank()

##  [1] 10  9  8  7  6  5  4  3  2  1

desc() %>% rank() reverses the direction of the ranking so the largest number in the vector has rank 1.

c(20:11) %>% desc() %>% rank()

##  [1]  1  2  3  4  5  6  7  8  9 10

What happens if we have a tie?

c(4,4,4,4,4) %>% rank()  #average of 1,2,3,4,5 is 3

## [1] 3 3 3 3 3

c(2,4,2,9,9,8) %>% rank()  # average of 1,2 is 1.5,  average of 5,6 is 5.5

## [1] 1.5 3.0 1.5 5.5 5.5 4.0

example:

Recall the Babynames data table.

head(BabyNames)

name	sex	count	year
Mary	F	7065	1880
Anna	F	2604	1880
Emma	F	2003	1880
Elizabeth	F	1939	1880
Minnie	F	1746	1880
Margaret	F	1578	1880

The dataverb dplyr::arrange() arranges a data frame in increasing order.

mtcars %>% 
  arrange(cyl,desc(disp)) %>%  #increasing for cyl, decreasing for disp
  head()

mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb
24.4	4	146.7	62	3.69	3.190	20.00	1	0	4	2
22.8	4	140.8	95	3.92	3.150	22.90	1	0	4	2
21.4	4	121.0	109	4.11	2.780	18.60	1	1	4	2
26.0	4	120.3	91	4.43	2.140	16.70	0	1	5	2
21.5	4	120.1	97	3.70	2.465	20.01	1	0	3	1
22.8	4	108.0	93	3.85	2.320	18.61	1	1	4	1

Next provide the rank of the most popular BabyNames:

BabyNames %>%
  group_by(name) %>%
  summarise(total=sum(count)) %>%
  arrange(desc(total)) %>%
  mutate(rank=rank(total)) %>%
  head()

name	total	rank
James	5114325	92600
John	5095590	92599
Robert	4809858	92598
Michael	4315029	92597
Mary	4127615	92596
William	4054318	92595

BabyNames %>%
  group_by(name) %>%
  summarise(total=sum(count)) %>%
  arrange(desc(total)) %>%
  mutate(rank=rank(desc(total))) %>%
  head()

name	total	rank
James	5114325	1
John	5095590	2
Robert	4809858	3
Michael	4315029	4
Mary	4127615	5
William	4054318	6

Suppose you want the 50th most popular name:

BabyNames %>%
  group_by(name) %>%
  summarise(total=sum(count)) %>%
  mutate(popularity=rank(desc(total))) %>%
  filter( popularity ==50)

name	total	popularity
Nicholas	874177	50

In Class exercise

Do problem 2:

http://gandalf.berkeley.edu:3838/alucas/Lecture-11-collection/

3. Collective Properties of Cases

Geom_denisty plots

Here is a selection of variables from the data table NCHS

NCHS %>% select(age,weight,sex,diabetic) %>%
  head(3)

age	weight	sex	diabetic
2	12.5	female	Not
77	75.4	male	Not
10	32.9	female	Not

Density glyphs can be used to compare distributions for different groups. You need merely indicate which variable should be used for grouping. For instance, here are the densities of body weight for each sex:

NCHS %>% ggplot( aes(x=weight) ) + 
  geom_density(aes(color=sex, fill=sex), alpha=.5) + 
  xlab("Weight (kg)") + ylab( "Density (1/kg)")

The density shows detail, but is often a fairly simple shape. This allows you to place densities side by side for comparison. For instance, here people are being divided into different age groups.

#library(mosaic)  Need library(mosaic) for ntiles()

NCHS <- 
  NCHS %>% 
  mutate( ageGroup=mosaic::ntiles(age, n=6, format="interval") )

NCHS %>%
  ggplot( aes(x=weight, group=sex) ) + 
  geom_density(aes(color=sex, fill=sex), alpha=.25) + 
  xlab("Weight (kg)") + ylab( "Density (1/kg)") + 
  facet_wrap( ~ ageGroup )

As always, the best choice of a mapping between variables and graphical attributes depends on what aspect of the data you would like to emphasize. For instance, to see how weight changes with age, you might prefer to use sex as the faceting variable and age as the grouping variable.

NCHS %>%
  ggplot( aes(x=weight) ) + 
  geom_density(aes(color=ageGroup, fill=ageGroup), alpha=.25) + 
  xlab("Weight (kg)") + ylab( "Density (1/kg)") + 
  facet_wrap( ~ sex ) + theme(text = element_text(size = 20))

box and whisker (boxplot) and violin geoms

Sometimes a more stylized depiction of the distribution is helpful. This makes it feasible to put the distributions side-by-side for comparison. The most common such depiction is the box and whisker glyph.

The lower and upper “hinges” correspond to the first and third quartiles (the 25th and 75th percentiles). The upper whisker extends from the hinge to the highest value that is within 1.5 * IQR of the hinge, where IQR is the inter-quartile range, or distance between the first and third quartiles. The lower whisker extends from the hinge to the lowest value within 1.5 * IQR of the hinge. Data beyond the end of the whiskers are outliers and plotted as point.

Here is data from the mpg data table giving fuel economy data for 38 popular models of cars

head(ggplot2::mpg,3)

manufacturer	model	displ	year	cyl	trans	drv	cty	hwy	fl	class
audi	a4	1.8	1999	4	auto(l5)	f	18	29	p	compact
audi	a4	1.8	1999	4	manual(m5)	f	21	29	p	compact
audi	a4	2.0	2008	4	manual(m6)	f	20	31	p	compact

p <- mpg %>% ggplot(aes(class, hwy))
p + geom_boxplot()

and with geom_jitter()

p <- mpg %>% ggplot(aes(class, hwy))
p + geom_boxplot() + geom_jitter()

You can use aesethic fill in geom_boxplot() to color the inside of the box plot.

p <- mpg %>% ggplot(aes(class, hwy))
p + geom_boxplot(aes(fill=drv))

Here is another example with the NCHS data

NCHS %>% select(age,weight,sex,diabetic,ageGroup) %>%
  head(3)

age	weight	sex	diabetic	ageGroup
2	12.5	female	Not	[1,8]
77	75.4	male	Not	[62,85]
10	32.9	female	Not	[8,14]

NCHS %>%
  ggplot( aes(y=weight, x=ageGroup ) ) + 
  geom_boxplot(
    aes(color=ageGroup, fill=ageGroup), 
    alpha=.25, outlier.size=1, outlier.colour="gray") + 
  ylab("Weight (kg)") + xlab( "Age Group (yrs)") + 
  facet_wrap( ~ sex )

In Class exercise

Do problem 3:

https://scf.berkeley.edu:3838/shiny/alucas/Lecture-11-collection/

The violin glyph provides a similar ease of comparison to the box-and-whisker glyph, but shows more detail in the density.

NCHS %>%
  ggplot( aes(y=weight, x=ageGroup) ) + 
  geom_violin( aes(color=ageGroup, fill=ageGroup), 
    alpha=.25) + 
  ylab("Weight (kg)") + xlab( "Age Group (yrs)") + 
  facet_wrap( ~ sex )

Both the box-and-whisker and violin glyphs use ink very efficiently. This allows you to show other data. For instance, here’s the same box-and-whisker plot, but with people with diabetics shown in red.

NCHS %>%
  ggplot(aes(y=weight, x=ageGroup)) + geom_boxplot(aes(fill=diabetic), position = position_dodge(width = 0.8),outlier.size = 1, outlier.colour = "gray", alpha=.75) + facet_grid(.~sex) + ylab("Weight (kg)") + xlab("Age Group (yrs)")

You can see that diabetics tend to be heavier than non-diabetics, especially for ages 19+.

2017 lec 11

Announcement

Today

0. Calling a function:

1. Apply functions (chap 4 data camp’s Int. R)

sapply()

In Class exercise

vapply()

2. Efficient Programming

3. Ranks and Ordering

In Class exercise

3. Collective Properties of Cases

Geom_denisty plots

box and whisker (boxplot) and violin geoms

In Class exercise