Introduction to Pipes in R & GGPlot2

Data Science Tweets to Live By:

“Dear past-Hadley: PLEASE COMMENT YOUR CODE BETTER. Love present-Hadley.” @hadleywickham

R Packages for Today’s class

tidyverse - Tidyverse is a diverse set of packages (most written by Hadley Wickam) that work in harmony. These packages share common data representations and a similar programming interface. Some of the more significant packages included in tidyverse are listed below:

ggplot2: Data visualization - we will discuss common ggplot2 functions later.
dplyr: A set of verbs which can solve many common data manipulation challenges. Some commonly used functions inside of dplyr include:
- mutate(): Creates a new variable based on a function of existing variables
- summarise(): Calculates a single statistic based on a number of observations
- filter(): Selects specific observations based on their values
tidyr: Used for data tidying. Tidy data will have: 1. a single variable in each column, 2. a single observation in each row, and 3. a single value in each cell.
- gather(): Transforms ‘wide’ data to ‘tall’ data - see Figure 1.
```
knitr::include_graphics("gather.png")
```
Figure 1: The results of spread() in R.
- spread(): Transforms ‘tall’ data to ‘wide’ data - see Figure 2.
```
knitr::include_graphics("spread.png")
```
Figure 2: The results of gather() in R.
tibble: A tibble object is a “modern re-imagining” of dataframes in R. Tibble objects or tibbles have an enhanced printing method which only prints the first 10 rows of observations and only the columns which fit in the console window. This makes it easier to deal with large dataframes. In general, tibbles do much less than dataframes in R. Tibbles never change the type of input (never change strings to factors) and tibbles never create row names (no more row.names=F). Variable names in an R dataframe are not allowed to start with a number; this is not true for tibbles.
magrittr: Provides the pipe operators (%>%, %<>%, %T>%, %$%) used throughout the tidyverse.

knitr - Knitr allows us to “knit” figures and tables into an HTML document somewhat seamlessly.
DT - DT allows us to view an R dataframe inside an HTML document.

Pipes in R:

knitr::include_graphics("pipe.png")

Figure 3: Pipes in Windows 95.

The most common pipe operator, %>%, is a very useful tool for running a sequence of functions without having to save a new R object at each intermediate step. The pipe operator allows you to take the output of one function and use it as the input of the next function. When reading code that utilizes the pipe operator, you can think of the word THEN as equivalent to %>%. Click Code in the bottom right to see code chunk below.

pi %>%
  sin %>%
  cos
## [1] 1

This code chunk will read pi THEN sin() THEN cos(). Alternatively, you can think of piping as the composition of functions: $(f\circ g)(x)$, or equivalently $f(g(x))$. Click Code in the bottom right to see the code chunk below.

pi %>%
  sin %>%
  cos
## [1] 1

## The above is eqivalent to:
cos(sin(pi))
## [1] 1

In these trivial examples, the utility of this piping technology is not clearly visible or understood. The true utility (as I see it) of the piping technology can be observed in a number of dataframe manipulation scenarios.

Example: Tooth Growth

In this example, we will utilize the built-in R dataframe: ToothGrowth. This dataframe contains the results of an experiment examining the effect of vitamin C on tooth growth in Guinea pigs. We have $n=60$ observations and $p=3$ variables. Our variables of interest are:

len - tooth length (our dependent variable)
supp - supplement delivery method, either orange juice (OJ) or ascorbic acid (VC)
dose - vitamin C dosage, either 0.5, 1, or 2 mg/day

## This code includes a package for displaying data tables, then shows the data in the html document. 
## You can also view the dataset within R

library(DT)
## Warning: package 'DT' was built under R version 3.4.4

datatable(ToothGrowth,rownames=T,filter="top",options=list(pageLength=5,scrollX=T))

Here, we will show how the pipe operator, which is used to chain code together, can be especially practical when the user is performing several operations on a dataframe and he or she does not want to save a new object at each intermediate step. Let’s say the user first wants to remove all observations corresponding to a dosage of 0.5 mg/day and then find the mean tooth length for 1 mg/day and 2 mg/day. Click Code in the bottom right to view the conventional way of doing this.

## Loading the necessary library
library(dplyr)

## This line of code filters out observations associated with a dosage of 0.5
filteredData <- filter(ToothGrowth,dose!=0.5)
## Warning: package 'bindrcpp' was built under R version 3.4.4
## This line of code groups the filtered data by the remainging dosages
groupedData <- group_by(filteredData,dose)
## Here, we calculate the mean length my the grouping specified in the group_by function
summarise(groupedData,mean(len,na.rm=TRUE))
## # A tibble: 2 x 2
##    dose `mean(len, na.rm = TRUE)`
##   <dbl>                     <dbl>
## 1     1                      19.7
## 2     2                      26.1

## Equivalently:
summarise(group_by(filter(ToothGrowth,dose!=0.5),dose),mean(len,na.rm=T))
## # A tibble: 2 x 2
##    dose `mean(len, na.rm = T)`
##   <dbl>                  <dbl>
## 1     1                   19.7
## 2     2                   26.1

It is important to note that the group_by() function does not change how these tibble data look. It does, however, change how the tibble acts within other tidyverse functions, i.e. summarise(). In the following code chunk, we perform the same operations only using the piping technology.

ToothGrowth %>%
  filter(dose!=0.5) %>%
  group_by(dose) %>%
  summarise(mean(len,na.rm=T))
## # A tibble: 2 x 2
##    dose `mean(len, na.rm = T)`
##   <dbl>                  <dbl>
## 1     1                   19.7
## 2     2                   26.1

Technical Notes & Potential Issues

However, the pipe operator does not work as you may expect in all settings. Click Code in the bottom right to view an example

## Here, we use the assign function to assign the number 10 to the variable x
assign('Var1',10)
Var1
## [1] 10

## Now, we use the pipe operator to assign 100 to the variable x
'Var1' %>% assign(100)
Var1
## [1] 10

## So, what is going here?

If you think about the main purpose of the pipe operator, you may be able to reason through it. The pipe operator is used when we do not want to save objects at each intermediate step. Because of this, the pipe operator takes advantage of a temporary environment. Thus, we must be more explicit when using the assign function - we need to specify what environment to use. See Code in the bottom right.

## Define your environment
Env <- environment()

## Assigning 100 to x
"Var1" %>% assign(100,Env)
Var1
## [1] 100

Problems will also arise when using functions with lazy evaluation (or call-by-need) in R. A function with lazy evaluation delays the evaluation of a function until its object is needed. In magrittr, pipe operators use a non-standard function evaluation. A function is first formed using all of the individual right-hand expressions, working right to left. After this new function is formed, it is applied to the expression on the left-hand side. Let’s examine the example below.

## Attempt to take the squareroot of a character string
sqrt('Go Dawgs')
## Error in sqrt("Go Dawgs"): non-numeric argument to mathematical function

## More specific error message
tryCatch(sqrt('Go Dawgs'), error=function(e) {"You cannot take the square root of a character string"})
## [1] "You cannot take the square root of a character string"

In R, tryCatch() utilizes lazy evalution. If the argument on the left satisfies some condition, then the the function on the right is called [if(error), then(do this)]. Arguments within functions are only computed when the function uses them in R. This means that no arguments are computed before you call your function! That means also that the pipe computes each element of the function in turn. This means that the condition (left-hand expression) is not present and tryCatch() is not properly evaluated. See Code below.

## Using tryCatch() with piping
sqrt('Go Dawgs') %>%
  tryCatch(error=function(e) {"You cannot take the square root of a character string"})
## Error in sqrt("Go Dawgs"): non-numeric argument to mathematical function

Other functions with similar behavior include:

try()
suppressMessages()
suppressWarnings()

Other Pipe Uses & Alternative Pipe Operator

The standard pipe operator can be used as an argument placeholder. For example:
- $f(y, z = x)$ can be rewritten as $x$ %>% $f(y, z = .)$
- $x$ %>% $f(y=nrow(.),z=ncol(.))$ is equivalent to $f(x,y=nrow(x),z=ncol(x))$
- $x$ %>% $\{f(y=nrow(.),z=ncol(.))\}$ is equivalent to $f(y=nrow(x),z=ncol(x))$

## Standard placeholder
6 %>%
  round(pi, digits=.)
## [1] 3.141593

# The nested function call with dot placeholder
1:5 %>%
  paste(., letters[.])
## [1] "1 a" "2 b" "3 c" "4 d" "5 e"

# Or equivalently
1:5 %>%
  paste(letters[.])
## [1] "1 a" "2 b" "3 c" "4 d" "5 e"

# Override first argument
1:5 %>%
  {paste(letters[.])}
## [1] "a" "b" "c" "d" "e"

Building Unary Functions: A unary function is a function which takes one argument. The dot (.) placeholder can be used as the single argument used by the function. See example below.

Function1 <- . %>% sin %>% cos
Function1(pi)
## [1] 1

Compound Assignment: Very frequently, a situation may arise where you would like to overwrite the left-hand side of an expression with the right-hand side. This can be done very easily using the standard assignment operator. In this example, we will utilize another built-in R dataframe: iris.

datatable(iris,rownames=T,filter="top",options=list(pageLength=5,scrollX=T))

Say we would like to compute the square root of Sepal.Length and assign it to the variable. Below we code it using the standard piping operator as well as a shorthand alternative using the compound assignment pipe operator.

library(magrittr)
## Creating a new column variable identical to Sepal.Length
iris$Sepal.Length2 <- iris$Sepal.Length

## Compute the square root of `iris$Sepal.Length2` and assign it to the variable
iris$Sepal.Length2 <- iris$Sepal.Length2 %>% sqrt %>% round(2)
head(iris$Sepal.Length2)
## [1] 2.26 2.21 2.17 2.14 2.24 2.32

## Overwriting Sepal.Length2 to reproduce example
iris$Sepal.Length2 <- iris$Sepal.Length

## A shorthand alternative
iris$Sepal.Length2 %<>% sqrt %>% round(2)
head(iris$Sepal.Length2)
## [1] 2.26 2.21 2.17 2.14 2.24 2.32

Tee Operator: Sometimes it can be useful to call a function for its side effects inside of a pipe. Maybe you would like to print, plot, or save an object at a current step in the middle of a pipeline. Typically, most functions like this will not return anything inside of a pipeline. A solution to this problem is to use the tee operator: %T>%.

## This example will plot but ends the pipeline after the plot function
set.seed(318)
rnorm(100) %>%
  matrix(ncol = 2) %>%
  plot(xlab='var1',ylab='var2') %>%
  str()

##  NULL

## This example will plot and continue the pipeline
set.seed(318)
rnorm(100) %>%
  matrix(ncol = 2) %T>%
  plot(xlab='Var1',ylab='Var2') %>%
  str()

##  num [1:50, 1:2] -1.2718 -0.7809 -0.4869 -2.2107 0.0792 ...

Exposition Operator: In R, many functions (such as the lm() function) commonly take in a data argument. Once a dataset is specified in the function, variable names may be referred to directly (i.e. iris$Sepal.Length is equivalent to simply Sepal.Length). Frequently, it may be helpful to expose the variables in the dataset you are using, even for functions without a dataset argument. This action can be done using the exposition operator: %$%.

## Here, we use cor() & refer to Sepal.Length and Sepal.Width directly
iris %>%
  subset(Sepal.Length > mean(Sepal.Length)) %$%
  cor(Sepal.Length, Sepal.Width)
## [1] 0.3361992

## Another example of the exposition operator
data.frame(z = rnorm(100)) %$%
  ts.plot(z)

Variable Assignment and Pipe Coding Questions

Is there a difference in the output of these two expressions? If so, what?

median(x=1:10)
median(y<-1:10)

median(x=1:10)
## [1] 5.5
median(y<-1:10)
## [1] 5.5

What would happen if we were to run print(x) and print(y) after running the code in 1?

print(x)
## Error in print(x): object 'x' not found
print(y)
##  [1]  1  2  3  4  5  6  7  8  9 10

Is there a difference in the output of these two expressions? If so, what?

1:5 %>%
  paste(letters[.])
1:5 %>%
  paste(., letters[.])

1:5 %>%
  paste(letters[.])
## [1] "1 a" "2 b" "3 c" "4 d" "5 e"
1:5 %>%
  paste(., letters[.])
## [1] "1 a" "2 b" "3 c" "4 d" "5 e"

Piping Summary

%>%: The standard pipe operator uses the output of one function call (left side) as the first argument for the next function call (right side). It can also be used as a place holder for arguments that are left out of the right hand function call: ..
%<>%: The compound assignment operator will once again take an object on the left side of the compound assignment operator and use it as the first argument in the function call on the right side. This time, however, the object to the left of the compound assignment will be reassigned to the value resulting from the most recent function call (be careful: it will overwrite your original object!!).
%T>%: The tee operator will allow you to essentially act as a fork in your original pipeline. This allows the user to use functions like save() or print() in the middle of your pipeline.
%$%: Frequently functions will take a dataframe name. If a dataframe is used as an argument in a function then subsequent argument names can be column variables (lm()). Many functions do not take a dataframe name argument and therefore column variables cannot be referred to directly. The exposition operator takes a dataframe on the left side of the operator and allows the user to call column variables directly on the right side of the operator.

See pipe operator use from this weekend!

Data Visualization with Ggplot2

Introduction

Leland Wilkinson’s Grammar of Graphics

Data: variables mapped to aesthetic features of the graph
Geoms: objects/shapes on the graph
Stats: statistical transformations that summarize the data
Scales: legends and axes used to display variable mapping
Coordinate system: plane upon which points are mapped
Faceting: subsetting the data to create multiple variations of the same plot (paneling)

Constucting a Simple Plot with ggplot

All plots start with the ggplot() function. Inside of the ggplot() function, we specify the dataset of interest. Note: this function will only accept a dataframe, no matrices.
The first layer of any ggplot2 graph is an aesthetics layer. We will use the function aes() within ggplot() to map variables to the main aesthetics layer.
Additional aesthetics specifications may be added to subsequent layers, but these specifications will override the default aesthetics for that layer only.
A list of example aesthetics that may be specified within aes():
- x: x-axis variable
- y: y-axis variable
- color: color of objects, for 2D objects, the color of the object’s outline
- fill: fill color of 2D objects
- alpha: transparency of objects (a value between 0 (transparent) and 1 (opaque))
- linetype: solid, dashed, dotted, etc.
- shape: shape of the markers on a scatterplot
- size: how large objects will appear
Ggplot2 plots are built layer-by-layer. More layers may be added with the + character. Layers inherit the aesthetics from the original ggplot call, but may be respecified at each layer as well.
Geom functions differ in what is produced for the plot. These geoms represent different layers that may be added to a ggplot2 figure.
- geom_bar(): barplot with base on the x-axis
- geom_boxplot(): standard boxplot with boxes and whiskers
- geom_errorbar(): T-shaped error bars
- geom_histogram(): histogram
- geom_line(): line plot
- geom_point(): scatterplot
- geom_ribbon(): uncertainty bands spanning y-values across a range of x-values
- geom_smooth(): smooth curve (loess)
Geoms will differ with which aesthetic arguments they will accept.

Examples

Scatterplot/boxplot/loess curve combination

library(datasets)
## Busy example to display the layering process
ggplot(data=iris,aes(x=Sepal.Length,y=Sepal.Width)) + 
  geom_point() +
  geom_boxplot(aes(fill=Species)) +
  geom_smooth(aes(color=Species),linetype=2)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'


## Where are these boxes centered?
iris %>%
  group_by(Species) %>%
  summarise(mean(Sepal.Length,na.rm=T))
## # A tibble: 3 x 2
##   Species    `mean(Sepal.Length, na.rm = T)`
##   <fct>                                <dbl>
## 1 setosa                                5.01
## 2 versicolor                            5.94
## 3 virginica                             6.59

Setting Axis Limits and Labeling

ggplot(data=iris,aes(x=Sepal.Length,y=Sepal.Width,colour=Species)) + 
  geom_point() +
  xlab('X-axis Name') +
  ylab('Y-axis Name') +
  xlim(3,8)

Standard Scatterplot with Clusters of Values Circled

Option A

library(ggalt)
## Warning: package 'ggalt' was built under R version 3.4.4
ggplot(iris,aes(x=Petal.Length,y=Petal.Width)) +
  geom_point(aes(color=Species)) +
  geom_smooth(method='loess',se=F,span=0.8,aes(color=Species)) +
  stat_ellipse(aes(x=Petal.Length,y=Petal.Width,color=Species),
               size=1,level=0.98)

Option B

library(ggalt)
ggplot(iris,aes(x=Petal.Length,y=Petal.Width)) +
  geom_point(aes(color=Species)) +
  geom_smooth(method='loess',se=F,span=0.8,aes(color=Species)) +
  geom_encircle(data=filter(iris,Species=='setosa'),
                aes(x=Petal.Length,y=Petal.Width),
                color='#F8766D',size=2)

Creating Matrices of Plots all Using a Shared Legend

It was difficult to find a reproducible example here, so let’s take a look at an example I have used before. In the following code, we create a matrix of monthly kernel density estimates for quality-controlled minimum and maximum temperatures. We also overlay a dotplot where dots indicate points which were removed. This plot is useful as it shows that the removed temperatures did not completely remove distribution tails.

## Loading dataframe
Total2 = read.csv('PR.40Yr.Scaled.csv')
## Loading necessary libraries
library(ks)
## Warning: package 'ks' was built under R version 3.4.4
library(graphics)
library(ggplot2)
## Visual summaries of the removed temperatures
TempsTall = cbind.data.frame(Temps = c(Total2$TMAX.Clean,Total2$TMIN.Clean),
                  Indicator = c(rep('TMAX',nrow(Total2)),rep('TMIN',nrow(Total2))),
                  Month = c(Total2$Month,Total2$Month),
                  Removed = c(Total2$TMAX.Flag,Total2$TMIN.Flag),
                  OrigTemps = c(Total2$TMAX,Total2$TMIN))
## Set params
Month = 'December'
MonthNum = 12
## Plot
Plot12 = ggplot(TempsTall[TempsTall$Month==MonthNum,]) +
  geom_density(bw=0.5,alpha=0.30,aes(x=c(Temps),fill=Indicator)) +
  geom_dotplot(data=TempsTall[TempsTall$Removed==1 & TempsTall$Month==MonthNum,],
               aes(x=OrigTemps,fill=Indicator),dotsize=0.8,alpha=0.60) +
  scale_fill_manual(values=c("red", "blue3")) +
  scale_color_manual(values=c("red", "blue3")) +
  ylab('') + xlab('Temperature (F)') + ggtitle(Month) +
  theme(plot.title = element_text(hjust=0.5)) +
  ylim(0,0.155) + xlim(-20,115)
## Plotting in tandem
library(lemon)
## Warning: package 'lemon' was built under R version 3.4.4
## 
## Attaching package: 'lemon'
## The following object is masked from 'package:purrr':
## 
##     %||%
#grid_arrange_shared_legend(Plot1,Plot2,Plot3,Plot4,Plot5,Plot6,
#                Plot7,Plot8,Plot9,Plot10,Plot11,Plot12,ncol=3,nrow=4)

knitr::include_graphics("MonthlyKDE.png")

Figure 4: Monthly kernel density estimates for minimum and maximum temperatures in Puerto Rico.

Mapping with ggmap

ggmap has recently become more difficult to use. In 2018, Google changed its API requirements and users are now required to provide an API key and enable billing (although your card is not charged). Because of this, ggmap itself is outdated and the creators are trying to have new version available soon. To create an API key, you can simply follow the instructions (under step 5. Install ggmap) at the following web address: https://www.littlemissdata.com/blog/maps.

After obtaining an API key, you must register it in your r script using the function register_google. At this point, you are ready to begin mapping in ggmap! Various map types can be observed the following web address: https://www.nceas.ucsb.edu/~frazier/RSpatialGuides/ggmap/ggmapCheatsheet.pdf. Below we show two examples of maps created with ggmap.

df <- read_csv("https://raw.githubusercontent.com/fastah/sample-data/master/FastahDatasetMapsTutorial.csv")
## Warning: Missing column names filled in: 'X1' [1]
## Parsed with column specification:
## cols(
##   X1 = col_integer(),
##   lat = col_double(),
##   lon = col_double(),
##   Operator = col_character(),
##   Class = col_character()
## )
lat <- c(12.9,13.05)
long <- c(77.52,77.7)
bbox <- make_bbox(long,lat,f=0.05)
b <- get_map(bbox,maptype="toner-lite",source="stamen")
## Source : http://tile.stamen.com/terrain/12/2929/1898.png
## Source : http://tile.stamen.com/terrain/12/2930/1898.png
## Source : http://tile.stamen.com/terrain/12/2931/1898.png
## Source : http://tile.stamen.com/terrain/12/2932/1898.png
## Source : http://tile.stamen.com/terrain/12/2929/1899.png
## Source : http://tile.stamen.com/terrain/12/2930/1899.png
## Source : http://tile.stamen.com/terrain/12/2931/1899.png
## Source : http://tile.stamen.com/terrain/12/2932/1899.png
## Source : http://tile.stamen.com/terrain/12/2929/1900.png
## Source : http://tile.stamen.com/terrain/12/2930/1900.png
## Source : http://tile.stamen.com/terrain/12/2931/1900.png
## Source : http://tile.stamen.com/terrain/12/2932/1900.png
ggmap(b) + geom_point(data = df,aes(lon,lat,color=Operator),size=2,alpha=0.7) +
  labs(x = "Longitude", y = "Latitude",
       title="Ping Locations", color = "Operator")
## Warning: Removed 146 rows containing missing values (geom_point).

Animations (GIFs) with gganimate

gganimate extends the original ggplot2 grammar of graphics by allowing for a set of new commands which describe the intended animation. These commands customize how plotted objects should change throughout the animation. Some of these commands are listed below. + transition_*(): defines how the data should be spread out and how it relates to itself across time + view_*(): defines how the positional scales should change along the animation + shadow_*(): defines how the data at other points in time should be displayed at the current point in time in the animation + enter_*()/exit_*(): defines how new data should appear and old data should disappear throughout the animation + ease_aes(): defines how different aesthetics should be eased during tweening (the process of generating intermediate frames between two images)

library(ggplot2)
library(gganimate)
ggplot(mtcars, aes(factor(cyl), mpg)) + 
  geom_boxplot() + 
  ## gganimate code
  transition_states(
    gear,
    transition_length = 2,
    state_length = 1
  ) +
  enter_fade() + 
  exit_shrink() +
  ease_aes('linear')

In the above example we have created a GIF which alternates between three states in a loop. At each of the three states, we have three boxplots displaying miles per gallon as a function of number of cylinders. Each state indicates a different number of gears in the car (3-5). At each state the images will fade in, shrink out, and the GIF will progress linearly. transition_length indicates the time to use for the entrance of a new layer and state_length indicates the length of the pause at each state.

##Source: https://github.com/thomasp85/gganimate
library(ggplot2)
library(gganimate)
library(gapminder)
## Warning: package 'gapminder' was built under R version 3.4.4
library(gifski)
## Warning: package 'gifski' was built under R version 3.4.4

ggplot(gapminder, aes(gdpPercap, lifeExp, size = pop, colour = country)) +
  geom_point(alpha = 0.7, show.legend = FALSE) +
  scale_colour_manual(values = country_colors) +
  scale_size(range = c(2, 12)) +
  scale_x_log10() +
  facet_wrap(~continent) +
  # Here comes the gganimate specific bits
  labs(title = 'Year: {frame_time}', x = 'GDP per capita', y = 'life expectancy') +
  transition_time(year) +
  ease_aes('linear')