“Dear past-Hadley: PLEASE COMMENT YOUR CODE BETTER. Love present-Hadley.” @hadleywickham
mutate(): Creates a new variable based on a function of existing variablessummarise(): Calculates a single statistic based on a number of observationsfilter(): Selects specific observations based on their valuesgather(): Transforms ‘wide’ data to ‘tall’ data - see Figure 1.knitr::include_graphics("gather.png")
Figure 1: The results of spread() in R.
spread(): Transforms ‘tall’ data to ‘wide’ data - see Figure 2.knitr::include_graphics("spread.png")
Figure 2: The results of gather() in R.
row.names=F). Variable names in an R dataframe are not allowed to start with a number; this is not true for tibbles.%>%, %<>%, %T>%, %$%) used throughout the tidyverse.knitr - Knitr allows us to “knit” figures and tables into an HTML document somewhat seamlessly.
DT - DT allows us to view an R dataframe inside an HTML document.
knitr::include_graphics("pipe.png")
Figure 3: Pipes in Windows 95.
The most common pipe operator, %>%, is a very useful tool for running a sequence of functions without having to save a new R object at each intermediate step. The pipe operator allows you to take the output of one function and use it as the input of the next function. When reading code that utilizes the pipe operator, you can think of the word THEN as equivalent to %>%. Click Code in the bottom right to see code chunk below.
pi %>%
sin %>%
cos
## [1] 1
This code chunk will read pi THEN sin() THEN cos(). Alternatively, you can think of piping as the composition of functions: \((f\circ g)(x)\), or equivalently \(f(g(x))\). Click Code in the bottom right to see the code chunk below.
pi %>%
sin %>%
cos
## [1] 1
## The above is eqivalent to:
cos(sin(pi))
## [1] 1
In these trivial examples, the utility of this piping technology is not clearly visible or understood. The true utility (as I see it) of the piping technology can be observed in a number of dataframe manipulation scenarios.
In this example, we will utilize the built-in R dataframe: ToothGrowth. This dataframe contains the results of an experiment examining the effect of vitamin C on tooth growth in Guinea pigs. We have \(n=60\) observations and \(p=3\) variables. Our variables of interest are:
len - tooth length (our dependent variable)
supp - supplement delivery method, either orange juice (OJ) or ascorbic acid (VC)
dose - vitamin C dosage, either 0.5, 1, or 2 mg/day
## This code includes a package for displaying data tables, then shows the data in the html document.
## You can also view the dataset within R
library(DT)
## Warning: package 'DT' was built under R version 3.4.4
datatable(ToothGrowth,rownames=T,filter="top",options=list(pageLength=5,scrollX=T))
Here, we will show how the pipe operator, which is used to chain code together, can be especially practical when the user is performing several operations on a dataframe and he or she does not want to save a new object at each intermediate step. Let’s say the user first wants to remove all observations corresponding to a dosage of 0.5 mg/day and then find the mean tooth length for 1 mg/day and 2 mg/day. Click Code in the bottom right to view the conventional way of doing this.
## Loading the necessary library
library(dplyr)
## This line of code filters out observations associated with a dosage of 0.5
filteredData <- filter(ToothGrowth,dose!=0.5)
## Warning: package 'bindrcpp' was built under R version 3.4.4
## This line of code groups the filtered data by the remainging dosages
groupedData <- group_by(filteredData,dose)
## Here, we calculate the mean length my the grouping specified in the group_by function
summarise(groupedData,mean(len,na.rm=TRUE))
## # A tibble: 2 x 2
## dose `mean(len, na.rm = TRUE)`
## <dbl> <dbl>
## 1 1 19.7
## 2 2 26.1
## Equivalently:
summarise(group_by(filter(ToothGrowth,dose!=0.5),dose),mean(len,na.rm=T))
## # A tibble: 2 x 2
## dose `mean(len, na.rm = T)`
## <dbl> <dbl>
## 1 1 19.7
## 2 2 26.1
It is important to note that the group_by() function does not change how these tibble data look. It does, however, change how the tibble acts within other tidyverse functions, i.e. summarise(). In the following code chunk, we perform the same operations only using the piping technology.
ToothGrowth %>%
filter(dose!=0.5) %>%
group_by(dose) %>%
summarise(mean(len,na.rm=T))
## # A tibble: 2 x 2
## dose `mean(len, na.rm = T)`
## <dbl> <dbl>
## 1 1 19.7
## 2 2 26.1
However, the pipe operator does not work as you may expect in all settings. Click Code in the bottom right to view an example
## Here, we use the assign function to assign the number 10 to the variable x
assign('Var1',10)
Var1
## [1] 10
## Now, we use the pipe operator to assign 100 to the variable x
'Var1' %>% assign(100)
Var1
## [1] 10
## So, what is going here?
If you think about the main purpose of the pipe operator, you may be able to reason through it. The pipe operator is used when we do not want to save objects at each intermediate step. Because of this, the pipe operator takes advantage of a temporary environment. Thus, we must be more explicit when using the assign function - we need to specify what environment to use. See Code in the bottom right.
## Define your environment
Env <- environment()
## Assigning 100 to x
"Var1" %>% assign(100,Env)
Var1
## [1] 100
Problems will also arise when using functions with lazy evaluation (or call-by-need) in R. A function with lazy evaluation delays the evaluation of a function until its object is needed. In magrittr, pipe operators use a non-standard function evaluation. A function is first formed using all of the individual right-hand expressions, working right to left. After this new function is formed, it is applied to the expression on the left-hand side. Let’s examine the example below.
## Attempt to take the squareroot of a character string
sqrt('Go Dawgs')
## Error in sqrt("Go Dawgs"): non-numeric argument to mathematical function
## More specific error message
tryCatch(sqrt('Go Dawgs'), error=function(e) {"You cannot take the square root of a character string"})
## [1] "You cannot take the square root of a character string"
In R, tryCatch() utilizes lazy evalution. If the argument on the left satisfies some condition, then the the function on the right is called [if(error), then(do this)]. Arguments within functions are only computed when the function uses them in R. This means that no arguments are computed before you call your function! That means also that the pipe computes each element of the function in turn. This means that the condition (left-hand expression) is not present and tryCatch() is not properly evaluated. See Code below.
## Using tryCatch() with piping
sqrt('Go Dawgs') %>%
tryCatch(error=function(e) {"You cannot take the square root of a character string"})
## Error in sqrt("Go Dawgs"): non-numeric argument to mathematical function
Other functions with similar behavior include:
try()suppressMessages()suppressWarnings()%>% \(f(y, z = .)\)%>% \(f(y=nrow(.),z=ncol(.))\) is equivalent to \(f(x,y=nrow(x),z=ncol(x))\)%>% \(\{f(y=nrow(.),z=ncol(.))\}\) is equivalent to \(f(y=nrow(x),z=ncol(x))\)## Standard placeholder
6 %>%
round(pi, digits=.)
## [1] 3.141593
# The nested function call with dot placeholder
1:5 %>%
paste(., letters[.])
## [1] "1 a" "2 b" "3 c" "4 d" "5 e"
# Or equivalently
1:5 %>%
paste(letters[.])
## [1] "1 a" "2 b" "3 c" "4 d" "5 e"
# Override first argument
1:5 %>%
{paste(letters[.])}
## [1] "a" "b" "c" "d" "e"
Function1 <- . %>% sin %>% cos
Function1(pi)
## [1] 1
datatable(iris,rownames=T,filter="top",options=list(pageLength=5,scrollX=T))
Say we would like to compute the square root of Sepal.Length and assign it to the variable. Below we code it using the standard piping operator as well as a shorthand alternative using the compound assignment pipe operator.
library(magrittr)
## Creating a new column variable identical to Sepal.Length
iris$Sepal.Length2 <- iris$Sepal.Length
## Compute the square root of `iris$Sepal.Length2` and assign it to the variable
iris$Sepal.Length2 <- iris$Sepal.Length2 %>% sqrt %>% round(2)
head(iris$Sepal.Length2)
## [1] 2.26 2.21 2.17 2.14 2.24 2.32
## Overwriting Sepal.Length2 to reproduce example
iris$Sepal.Length2 <- iris$Sepal.Length
## A shorthand alternative
iris$Sepal.Length2 %<>% sqrt %>% round(2)
head(iris$Sepal.Length2)
## [1] 2.26 2.21 2.17 2.14 2.24 2.32
%T>%.## This example will plot but ends the pipeline after the plot function
set.seed(318)
rnorm(100) %>%
matrix(ncol = 2) %>%
plot(xlab='var1',ylab='var2') %>%
str()
## NULL
## This example will plot and continue the pipeline
set.seed(318)
rnorm(100) %>%
matrix(ncol = 2) %T>%
plot(xlab='Var1',ylab='Var2') %>%
str()
## num [1:50, 1:2] -1.2718 -0.7809 -0.4869 -2.2107 0.0792 ...
lm() function) commonly take in a data argument. Once a dataset is specified in the function, variable names may be referred to directly (i.e. iris$Sepal.Length is equivalent to simply Sepal.Length). Frequently, it may be helpful to expose the variables in the dataset you are using, even for functions without a dataset argument. This action can be done using the exposition operator: %$%.## Here, we use cor() & refer to Sepal.Length and Sepal.Width directly
iris %>%
subset(Sepal.Length > mean(Sepal.Length)) %$%
cor(Sepal.Length, Sepal.Width)
## [1] 0.3361992
## Another example of the exposition operator
data.frame(z = rnorm(100)) %$%
ts.plot(z)
median(x=1:10)
median(y<-1:10)
median(x=1:10)
## [1] 5.5
median(y<-1:10)
## [1] 5.5
print(x)
## Error in print(x): object 'x' not found
print(y)
## [1] 1 2 3 4 5 6 7 8 9 10
1:5 %>%
paste(letters[.])
1:5 %>%
paste(., letters[.])
1:5 %>%
paste(letters[.])
## [1] "1 a" "2 b" "3 c" "4 d" "5 e"
1:5 %>%
paste(., letters[.])
## [1] "1 a" "2 b" "3 c" "4 d" "5 e"
%>%: The standard pipe operator uses the output of one function call (left side) as the first argument for the next function call (right side). It can also be used as a place holder for arguments that are left out of the right hand function call: ..%<>%: The compound assignment operator will once again take an object on the left side of the compound assignment operator and use it as the first argument in the function call on the right side. This time, however, the object to the left of the compound assignment will be reassigned to the value resulting from the most recent function call (be careful: it will overwrite your original object!!).%T>%: The tee operator will allow you to essentially act as a fork in your original pipeline. This allows the user to use functions like save() or print() in the middle of your pipeline.%$%: Frequently functions will take a dataframe name. If a dataframe is used as an argument in a function then subsequent argument names can be column variables (lm()). Many functions do not take a dataframe name argument and therefore column variables cannot be referred to directly. The exposition operator takes a dataframe on the left side of the operator and allows the user to call column variables directly on the right side of the operator.See pipe operator use from this weekend!
ggplot() function. Inside of the ggplot() function, we specify the dataset of interest. Note: this function will only accept a dataframe, no matrices.aes() within ggplot() to map variables to the main aesthetics layer.aes():
+ character. Layers inherit the aesthetics from the original ggplot call, but may be respecified at each layer as well.geom_bar(): barplot with base on the x-axisgeom_boxplot(): standard boxplot with boxes and whiskersgeom_errorbar(): T-shaped error barsgeom_histogram(): histogramgeom_line(): line plotgeom_point(): scatterplotgeom_ribbon(): uncertainty bands spanning y-values across a range of x-valuesgeom_smooth(): smooth curve (loess)library(datasets)
## Busy example to display the layering process
ggplot(data=iris,aes(x=Sepal.Length,y=Sepal.Width)) +
geom_point() +
geom_boxplot(aes(fill=Species)) +
geom_smooth(aes(color=Species),linetype=2)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
## Where are these boxes centered?
iris %>%
group_by(Species) %>%
summarise(mean(Sepal.Length,na.rm=T))
## # A tibble: 3 x 2
## Species `mean(Sepal.Length, na.rm = T)`
## <fct> <dbl>
## 1 setosa 5.01
## 2 versicolor 5.94
## 3 virginica 6.59
ggplot(data=iris,aes(x=Sepal.Length,y=Sepal.Width,colour=Species)) +
geom_point() +
xlab('X-axis Name') +
ylab('Y-axis Name') +
xlim(3,8)
library(ggalt)
## Warning: package 'ggalt' was built under R version 3.4.4
ggplot(iris,aes(x=Petal.Length,y=Petal.Width)) +
geom_point(aes(color=Species)) +
geom_smooth(method='loess',se=F,span=0.8,aes(color=Species)) +
stat_ellipse(aes(x=Petal.Length,y=Petal.Width,color=Species),
size=1,level=0.98)
library(ggalt)
ggplot(iris,aes(x=Petal.Length,y=Petal.Width)) +
geom_point(aes(color=Species)) +
geom_smooth(method='loess',se=F,span=0.8,aes(color=Species)) +
geom_encircle(data=filter(iris,Species=='setosa'),
aes(x=Petal.Length,y=Petal.Width),
color='#F8766D',size=2)
ggmap has recently become more difficult to use. In 2018, Google changed its API requirements and users are now required to provide an API key and enable billing (although your card is not charged). Because of this, ggmap itself is outdated and the creators are trying to have new version available soon. To create an API key, you can simply follow the instructions (under step 5. Install ggmap) at the following web address: https://www.littlemissdata.com/blog/maps.
After obtaining an API key, you must register it in your r script using the function register_google. At this point, you are ready to begin mapping in ggmap! Various map types can be observed the following web address: https://www.nceas.ucsb.edu/~frazier/RSpatialGuides/ggmap/ggmapCheatsheet.pdf. Below we show two examples of maps created with ggmap.
df <- read_csv("https://raw.githubusercontent.com/fastah/sample-data/master/FastahDatasetMapsTutorial.csv")
## Warning: Missing column names filled in: 'X1' [1]
## Parsed with column specification:
## cols(
## X1 = col_integer(),
## lat = col_double(),
## lon = col_double(),
## Operator = col_character(),
## Class = col_character()
## )
lat <- c(12.9,13.05)
long <- c(77.52,77.7)
bbox <- make_bbox(long,lat,f=0.05)
b <- get_map(bbox,maptype="toner-lite",source="stamen")
## Source : http://tile.stamen.com/terrain/12/2929/1898.png
## Source : http://tile.stamen.com/terrain/12/2930/1898.png
## Source : http://tile.stamen.com/terrain/12/2931/1898.png
## Source : http://tile.stamen.com/terrain/12/2932/1898.png
## Source : http://tile.stamen.com/terrain/12/2929/1899.png
## Source : http://tile.stamen.com/terrain/12/2930/1899.png
## Source : http://tile.stamen.com/terrain/12/2931/1899.png
## Source : http://tile.stamen.com/terrain/12/2932/1899.png
## Source : http://tile.stamen.com/terrain/12/2929/1900.png
## Source : http://tile.stamen.com/terrain/12/2930/1900.png
## Source : http://tile.stamen.com/terrain/12/2931/1900.png
## Source : http://tile.stamen.com/terrain/12/2932/1900.png
ggmap(b) + geom_point(data = df,aes(lon,lat,color=Operator),size=2,alpha=0.7) +
labs(x = "Longitude", y = "Latitude",
title="Ping Locations", color = "Operator")
## Warning: Removed 146 rows containing missing values (geom_point).
gganimate extends the original ggplot2 grammar of graphics by allowing for a set of new commands which describe the intended animation. These commands customize how plotted objects should change throughout the animation. Some of these commands are listed below. + transition_*(): defines how the data should be spread out and how it relates to itself across time + view_*(): defines how the positional scales should change along the animation + shadow_*(): defines how the data at other points in time should be displayed at the current point in time in the animation + enter_*()/exit_*(): defines how new data should appear and old data should disappear throughout the animation + ease_aes(): defines how different aesthetics should be eased during tweening (the process of generating intermediate frames between two images)
library(ggplot2)
library(gganimate)
ggplot(mtcars, aes(factor(cyl), mpg)) +
geom_boxplot() +
## gganimate code
transition_states(
gear,
transition_length = 2,
state_length = 1
) +
enter_fade() +
exit_shrink() +
ease_aes('linear')
In the above example we have created a GIF which alternates between three states in a loop. At each of the three states, we have three boxplots displaying miles per gallon as a function of number of cylinders. Each state indicates a different number of gears in the car (3-5). At each state the images will fade in, shrink out, and the GIF will progress linearly. transition_length indicates the time to use for the entrance of a new layer and state_length indicates the length of the pause at each state.
##Source: https://github.com/thomasp85/gganimate
library(ggplot2)
library(gganimate)
library(gapminder)
## Warning: package 'gapminder' was built under R version 3.4.4
library(gifski)
## Warning: package 'gifski' was built under R version 3.4.4
ggplot(gapminder, aes(gdpPercap, lifeExp, size = pop, colour = country)) +
geom_point(alpha = 0.7, show.legend = FALSE) +
scale_colour_manual(values = country_colors) +
scale_size(range = c(2, 12)) +
scale_x_log10() +
facet_wrap(~continent) +
# Here comes the gganimate specific bits
labs(title = 'Year: {frame_time}', x = 'GDP per capita', y = 'life expectancy') +
transition_time(year) +
ease_aes('linear')