Source file ⇒ 2017-lec11.Rmd
The bulit in function base::sum() has the following documentation:
For example:
sum(1:5)
## [1] 15
sum(1:5,1:2)
## [1] 18
sum(1:5, TRUE)
## [1] 16
sum(c(1,2,NA),TRUE)
## [1] NA
sum(c(1,2,NA), na.rm=TRUE)
## [1] 3
Writing na.rm=TRUE
identifies that na.rm
is the value for the object name na.rm
. In this example you can see it might be important to name the values you pass in a function call.
Last time we learned about lapply()
and sapply()
. These functions allow us to avoid loops on lists, data frames and vectors.
sapply()
and vapply()
are very similar to lapply()
except they simplify their output to produce an atomic vector. While sapply()
guesses, vapply()
takes an additional argument specifying the output type.
example:
mylist <- list(1,2,3)
myfunc <- function(x,y,z) x+y+z
mylist %>% sapply(myfunc,10,20)
## [1] 31 32 33
mylist %>% sapply(myfunc,y=10,z=20)
## [1] 31 32 33
Do problem 1
http://gandalf.berkeley.edu:3838/alucas/Lecture-11-collection/
We can improve the performance of sapply()
by using vapply()
instead. vapply()
has an extra arguement, FUN.VALUE, which is the class of the output.
vapply(X, FUN, FUN.VALUE, ..., USE.NAMES = TRUE)
examples:
mtcars %>% sapply(is.numeric)
## mpg cyl disp hp drat wt qsec vs am gear carb
## TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
mtcars %>% vapply( is.numeric, logical(1))
## mpg cyl disp hp drat wt qsec vs am gear carb
## TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
df <- data.frame(a=c(1,2,3),b=c(1,2,3))
convert = function(vals, toMM = TRUE) {
if ( toMM ) {
vals = vals * 25.4
} else {
vals = vals / 25.4
}
return(vals)
}
df %>% vapply(convert,toMM=FALSE,numeric(3))
a | b |
---|---|
0.0393701 | 0.0393701 |
0.0787402 | 0.0787402 |
0.1181102 | 0.1181102 |
The first rule of efficient programming in R is to make use of vectorized calculations when ever possible. If you can’t vectorize your calculation try using sapply before writing a for loop.
Lets compare the speed of squaring a large vector a number of different ways.
You can check how much time it takes to evaluate any expression by wrapping it in system.time()
. Units are in seconds.
system.time(normal.samples <- rnorm(1000000))
## user system elapsed
## 0.066 0.002 0.068
The elapsed time is the wall clock time. The user time is the CPU time perfomring the R process (in this case calling runif()
repeatedly and creating a vector of length 1000000 ). The system time is CPU time spent by the operating system (for example opening and closing files, doing input and output, and looking at the system clock).
slow
n <- 10000
vals=c()
system.time({
for(i in 1:n)
vals=c(vals,i^2)
})
## user system elapsed
## 0.229 0.008 0.237
head(vals)
## [1] 1 4 9 16 25 36
As the vector val grows we find 10,000 increasingly large chuncks of memory (growing vector size). This is time consuming. Lets look at the addresses of the first block of those chunks of memory.
library(pryr)
vals <- c()
n <- 10000
for(i in 1:n) vals <- c(vals, address(vals))
head(vals)
## [1] "0x7fea7b007378" "0x7fea83586008" "0x7fea82260b20" "0x7fea8235cf70"
## [5] "0x7fea825345b0" "0x7fea824a6210"
to make loop faster you should pre-allocate memory so that the vector entries are stored in a fixed large chunck of memory.
n <- 10000
system.time({
vals <- rep(0, n)
for(i in 1:n)
vals[i] <- i^2
})
## user system elapsed
## 0.009 0.000 0.009
head(vals)
## [1] 1 4 9 16 25 36
memory block is fixed
n <- 10000
vals <- rep(0, n)
for(i in 1:n)
vals[i] <- address(vals)
head(vals)
## [1] "0x7fea8522c600" "0x7fea85270e00" "0x7fea85270e00" "0x7fea85270e00"
## [5] "0x7fea85270e00" "0x7fea85270e00"
sapply
is even faster since the function is written in C language
n <- 10000
system.time(
vals <- 1:n %>% sapply(function(x) x^2)
)
## user system elapsed
## 0.008 0.000 0.009
vapply()
is possibly a little faster than sapply()
since you tell the computer ahead of time what size the output array is going to be
n <- 10000
system.time(
vals <- 1:n %>% vapply(function(x) x^2, numeric(1))
)
## user system elapsed
## 0.009 0.000 0.009
If you can vectorize your function this is always the fastest.
n <- 10000
system.time(
vals <- ( 1:n)^2
)
## user system elapsed
## 0 0 0
In hw #4 I hope that you did this:
n <- 500
x <- runif(n,-1,1)
y <- runif(n, -1,1)
z <- x^2 +y^2<=1
head(z)
## [1] TRUE TRUE TRUE TRUE TRUE TRUE
The data verb ‘rank()’ replaces each number in a vector with its rank with respect to other numbers in the vector. The smallest number has the rank 1.
example:
c(20:11) %>% rank()
## [1] 10 9 8 7 6 5 4 3 2 1
desc() %>% rank()
reverses the direction of the ranking so the largest number in the vector has rank 1.
c(20:11) %>% desc() %>% rank()
## [1] 1 2 3 4 5 6 7 8 9 10
What happens if we have a tie?
c(4,4,4,4,4) %>% rank() #average of 1,2,3,4,5 is 3
## [1] 3 3 3 3 3
c(2,4,2,9,9,8) %>% rank() # average of 1,2 is 1.5, average of 5,6 is 5.5
## [1] 1.5 3.0 1.5 5.5 5.5 4.0
example:
Recall the Babynames data table.
head(BabyNames)
name | sex | count | year |
---|---|---|---|
Mary | F | 7065 | 1880 |
Anna | F | 2604 | 1880 |
Emma | F | 2003 | 1880 |
Elizabeth | F | 1939 | 1880 |
Minnie | F | 1746 | 1880 |
Margaret | F | 1578 | 1880 |
The dataverb dplyr::arrange()
arranges a data frame in increasing order.
mtcars %>%
arrange(cyl,desc(disp)) %>% #increasing for cyl, decreasing for disp
head()
mpg | cyl | disp | hp | drat | wt | qsec | vs | am | gear | carb |
---|---|---|---|---|---|---|---|---|---|---|
24.4 | 4 | 146.7 | 62 | 3.69 | 3.190 | 20.00 | 1 | 0 | 4 | 2 |
22.8 | 4 | 140.8 | 95 | 3.92 | 3.150 | 22.90 | 1 | 0 | 4 | 2 |
21.4 | 4 | 121.0 | 109 | 4.11 | 2.780 | 18.60 | 1 | 1 | 4 | 2 |
26.0 | 4 | 120.3 | 91 | 4.43 | 2.140 | 16.70 | 0 | 1 | 5 | 2 |
21.5 | 4 | 120.1 | 97 | 3.70 | 2.465 | 20.01 | 1 | 0 | 3 | 1 |
22.8 | 4 | 108.0 | 93 | 3.85 | 2.320 | 18.61 | 1 | 1 | 4 | 1 |
Next provide the rank of the most popular BabyNames:
BabyNames %>%
group_by(name) %>%
summarise(total=sum(count)) %>%
arrange(desc(total)) %>%
mutate(rank=rank(total)) %>%
head()
name | total | rank |
---|---|---|
James | 5114325 | 92600 |
John | 5095590 | 92599 |
Robert | 4809858 | 92598 |
Michael | 4315029 | 92597 |
Mary | 4127615 | 92596 |
William | 4054318 | 92595 |
BabyNames %>%
group_by(name) %>%
summarise(total=sum(count)) %>%
arrange(desc(total)) %>%
mutate(rank=rank(desc(total))) %>%
head()
name | total | rank |
---|---|---|
James | 5114325 | 1 |
John | 5095590 | 2 |
Robert | 4809858 | 3 |
Michael | 4315029 | 4 |
Mary | 4127615 | 5 |
William | 4054318 | 6 |
Suppose you want the 50th most popular name:
BabyNames %>%
group_by(name) %>%
summarise(total=sum(count)) %>%
mutate(popularity=rank(desc(total))) %>%
filter( popularity ==50)
name | total | popularity |
---|---|---|
Nicholas | 874177 | 50 |
Here is a selection of variables from the data table NCHS
NCHS %>% select(age,weight,sex,diabetic) %>%
head(3)
age | weight | sex | diabetic |
---|---|---|---|
2 | 12.5 | female | Not |
77 | 75.4 | male | Not |
10 | 32.9 | female | Not |
Density glyphs can be used to compare distributions for different groups. You need merely indicate which variable should be used for grouping. For instance, here are the densities of body weight for each sex:
NCHS %>% ggplot( aes(x=weight) ) +
geom_density(aes(color=sex, fill=sex), alpha=.5) +
xlab("Weight (kg)") + ylab( "Density (1/kg)")
The density shows detail, but is often a fairly simple shape. This allows you to place densities side by side for comparison. For instance, here people are being divided into different age groups.
The density shows detail, but is often a fairly simple shape. This allows you to place densities side by side for comparison. For instance, here people are being divided into different age groups.
#library(mosaic) Need library(mosaic) for ntiles()
NCHS <-
NCHS %>%
mutate( ageGroup=mosaic::ntiles(age, n=6, format="interval") )
NCHS %>%
ggplot( aes(x=weight, group=sex) ) +
geom_density(aes(color=sex, fill=sex), alpha=.25) +
xlab("Weight (kg)") + ylab( "Density (1/kg)") +
facet_wrap( ~ ageGroup )
As always, the best choice of a mapping between variables and graphical attributes depends on what aspect of the data you would like to emphasize. For instance, to see how weight changes with age, you might prefer to use sex
as the faceting variable and age
as the grouping variable.
NCHS %>%
ggplot( aes(x=weight) ) +
geom_density(aes(color=ageGroup, fill=ageGroup), alpha=.25) +
xlab("Weight (kg)") + ylab( "Density (1/kg)") +
facet_wrap( ~ sex ) + theme(text = element_text(size = 20))
Sometimes a more stylized depiction of the distribution is helpful. This makes it feasible to put the distributions side-by-side for comparison. The most common such depiction is the box and whisker glyph.
The lower and upper “hinges” correspond to the first and third quartiles (the 25th and 75th percentiles). The upper whisker extends from the hinge to the highest value that is within 1.5 * IQR of the hinge, where IQR is the inter-quartile range, or distance between the first and third quartiles. The lower whisker extends from the hinge to the lowest value within 1.5 * IQR of the hinge. Data beyond the end of the whiskers are outliers and plotted as point.
Here is data from the mpg
data table giving fuel economy data for 38 popular models of cars
head(ggplot2::mpg,3)
manufacturer | model | displ | year | cyl | trans | drv | cty | hwy | fl | class |
---|---|---|---|---|---|---|---|---|---|---|
audi | a4 | 1.8 | 1999 | 4 | auto(l5) | f | 18 | 29 | p | compact |
audi | a4 | 1.8 | 1999 | 4 | manual(m5) | f | 21 | 29 | p | compact |
audi | a4 | 2.0 | 2008 | 4 | manual(m6) | f | 20 | 31 | p | compact |
p <- mpg %>% ggplot(aes(class, hwy))
p + geom_boxplot()
and with geom_jitter()
p <- mpg %>% ggplot(aes(class, hwy))
p + geom_boxplot() + geom_jitter()
You can use aesethic fill
in geom_boxplot()
to color the inside of the box plot.
p <- mpg %>% ggplot(aes(class, hwy))
p + geom_boxplot(aes(fill=drv))
Here is another example with the NCHS
data
NCHS %>% select(age,weight,sex,diabetic,ageGroup) %>%
head(3)
age | weight | sex | diabetic | ageGroup |
---|---|---|---|---|
2 | 12.5 | female | Not | [1,8] |
77 | 75.4 | male | Not | [62,85] |
10 | 32.9 | female | Not | [8,14] |
NCHS %>%
ggplot( aes(y=weight, x=ageGroup ) ) +
geom_boxplot(
aes(color=ageGroup, fill=ageGroup),
alpha=.25, outlier.size=1, outlier.colour="gray") +
ylab("Weight (kg)") + xlab( "Age Group (yrs)") +
facet_wrap( ~ sex )
Do problem 3:
https://scf.berkeley.edu:3838/shiny/alucas/Lecture-11-collection/
The violin glyph provides a similar ease of comparison to the box-and-whisker glyph, but shows more detail in the density.
NCHS %>%
ggplot( aes(y=weight, x=ageGroup) ) +
geom_violin( aes(color=ageGroup, fill=ageGroup),
alpha=.25) +
ylab("Weight (kg)") + xlab( "Age Group (yrs)") +
facet_wrap( ~ sex )
Both the box-and-whisker and violin glyphs use ink very efficiently. This allows you to show other data. For instance, here’s the same box-and-whisker plot, but with people with diabetics shown in red.
NCHS %>%
ggplot(aes(y=weight, x=ageGroup)) + geom_boxplot(aes(fill=diabetic), position = position_dodge(width = 0.8),outlier.size = 1, outlier.colour = "gray", alpha=.75) + facet_grid(.~sex) + ylab("Weight (kg)") + xlab("Age Group (yrs)")
You can see that diabetics tend to be heavier than non-diabetics, especially for ages 19+.