Pipes

Pipes are a new tool for expressing a sequence of multiple operations.

“object %>% function1() %>% function2()” ### The point of the pipe is to help you write code in a way that is easier to read and understand.

lapply(c("ggplot2","tidyverse"),library,character.only=1) 
## [[1]]
## [1] "ggplot2"   "stats"     "graphics"  "grDevices" "utils"     "datasets" 
## [7] "methods"   "base"     
## 
## [[2]]
##  [1] "lubridate" "forcats"   "stringr"   "dplyr"     "purrr"     "readr"    
##  [7] "tidyr"     "tibble"    "tidyverse" "ggplot2"   "stats"     "graphics" 
## [13] "grDevices" "utils"     "datasets"  "methods"   "base"
sqrt(16) #action object
## [1] 4
#object action action 
16 %>% sqrt()
## [1] 4
sqrt(sqrt(16))
## [1] 2
a <- sqrt(16)
sqrt(a)
## [1] 2
16 %>% sqrt() %>% sqrt()
## [1] 2

Aesthetic mappings

Task: Let’s start by visualizing the relationship between displ and hwy for various classes of cars. Common practice: We can do this with a scatterplot where the numerical variables are mapped to the x and y aesthetics and the categorical variable is mapped to an aesthetic like color or shape.

str(mpg)
## tibble [234 × 11] (S3: tbl_df/tbl/data.frame)
##  $ manufacturer: chr [1:234] "audi" "audi" "audi" "audi" ...
##  $ model       : chr [1:234] "a4" "a4" "a4" "a4" ...
##  $ displ       : num [1:234] 1.8 1.8 2 2 2.8 2.8 3.1 1.8 1.8 2 ...
##  $ year        : int [1:234] 1999 1999 2008 2008 1999 1999 2008 1999 1999 2008 ...
##  $ cyl         : int [1:234] 4 4 4 4 6 6 6 4 4 4 ...
##  $ trans       : chr [1:234] "auto(l5)" "manual(m5)" "manual(m6)" "auto(av)" ...
##  $ drv         : chr [1:234] "f" "f" "f" "f" ...
##  $ cty         : int [1:234] 18 21 20 21 16 18 18 18 16 20 ...
##  $ hwy         : int [1:234] 29 29 31 30 26 26 27 26 25 28 ...
##  $ fl          : chr [1:234] "p" "p" "p" "p" ...
##  $ class       : chr [1:234] "compact" "compact" "compact" "compact" ...
mpg |> ggplot(aes(x=displ,y=hwy,color=class)) + geom_point() 

str(mpg)
## tibble [234 × 11] (S3: tbl_df/tbl/data.frame)
##  $ manufacturer: chr [1:234] "audi" "audi" "audi" "audi" ...
##  $ model       : chr [1:234] "a4" "a4" "a4" "a4" ...
##  $ displ       : num [1:234] 1.8 1.8 2 2 2.8 2.8 3.1 1.8 1.8 2 ...
##  $ year        : int [1:234] 1999 1999 2008 2008 1999 1999 2008 1999 1999 2008 ...
##  $ cyl         : int [1:234] 4 4 4 4 6 6 6 4 4 4 ...
##  $ trans       : chr [1:234] "auto(l5)" "manual(m5)" "manual(m6)" "auto(av)" ...
##  $ drv         : chr [1:234] "f" "f" "f" "f" ...
##  $ cty         : int [1:234] 18 21 20 21 16 18 18 18 16 20 ...
##  $ hwy         : int [1:234] 29 29 31 30 26 26 27 26 25 28 ...
##  $ fl          : chr [1:234] "p" "p" "p" "p" ...
##  $ class       : chr [1:234] "compact" "compact" "compact" "compact" ...
mpg |> ggplot(aes(x=displ,y=hwy)) + geom_point(aes(color=class))  

Exercise

cir <- read.csv("~/Charm_City_Circulator_Ridership.csv",check.names = FALSE)
cir.long<- pivot_longer(cir,cols = -c(day,date,daily),names_to="type",values_to="rides")
cir.long <- mutate(cir.long,date=mdy(date))
cir.long <- separate(cir.long,type,into=c("a","b"),sep="[_]")
cir.long <- filter(cir.long,a=="orange") 
ggplot(cir.long,aes(x=date,y=rides)) + geom_point() 

The main downside of this form is that it forces you to name each intermediate element.

cir %>% 
  pivot_longer(,cols=-c(day,date,daily),names_to="type",values_to="rides") %>%
  mutate(date=mdy(date)) %>%
  separate(type,into=c("a","b"),sep="[_]") %>%
  filter(a=="orange") %>%
  ggplot(aes(x=date,y=rides)) + geom_point()

### This is my favourite form, because it focusses on verbs, not nouns.The pipe version is much more readable because it follows the flow of operations, and you don’t have to constantly read inside-out. By default, the pipe passes the left-hand side object as the first argument to the function on the right-hand side.

Using .() Placeholder for More Flexibility

If the function you’re piping into doesn’t take the data as the first argument, you can use the dot . as a placeholder to explicitly state where the piped object should go:

mtcars %>% lm(mpg~cyl,data=.)
## 
## Call:
## lm(formula = mpg ~ cyl, data = .)
## 
## Coefficients:
## (Intercept)          cyl  
##      37.885       -2.876

You can also use |> as an alternative.

Practice

Using the |> operator, write a pipeline that: - Filters for cars with 6 cylinders. - Selects the relevant columns (mpg, hp, wt). - Mutates a new column that converts weight to kilograms.

m <- mtcars |>
     filter(cyl == 6) |>
     select(mpg, hp, wt) |>
     mutate(wt_kg = wt * 453.592) 

Case study: new insights on poverty

Time series plots

Time series plots have time in the x-axis and an outcome or measurement of interest on the y-axis. For example, here is a trend plot of United States fertility rates:

gapminder |> 
  filter(country == "United States") |> 
  ggplot(aes(year, fertility)) +
  geom_point()

We see that the trend is not linear at all. Instead there is sharp drop during the 1960s and 1970s to below 2. Then the trend comes back to 2 and stabilizes during the 1990s.

When the points are regularly and densely spaced, as they are here, we create curves by joining the points with lines, to convey that these data are from a single series, here a country. To do this, we use the geom_line function instead of geom_point.

gapminder |> 
  filter(country == "United States") |> 
  ggplot(aes(year, fertility)) +
  geom_line()

This is particularly helpful when we look at two countries. If we subset the data to include two countries, one from Europe and one from Asia, then adapt the code above:

countries <- c("South Korea","Germany")

gapminder |> filter(country %in% countries) |> 
  ggplot(aes(year,fertility)) +
  geom_line()

Unfortunately, this is not the plot that we want. Rather than a line for each country, the points for both countries are joined. This is actually expected since we have not told ggplot anything about wanting two separate lines. To let ggplot know that there are two curves that need to be made separately, we assign each point to a group, one for each country:

countries <- c("South Korea","Germany")

gapminder |> filter(country %in% countries & !is.na(fertility)) |> 
  ggplot(aes(year, fertility, group = country,color=country)) +
  geom_line()

For trend plots we recommend labeling the lines rather than using legends since the viewer can quickly see which line is which country. This suggestion actually applies to most plots: labeling is usually preferred over legends.

library(geomtextpath)

gapminder |> 
  filter(country %in% countries) |> 
  ggplot(aes(year, life_expectancy, col = country, label = country)) +
  geom_textpath() +
  theme(legend.position = "none")

The plot clearly shows how an improvement in life expectancy followed the drops in fertility rates. In 1960, Germans lived 15 years longer than South Koreans, although by 2010 the gap is completely closed. It exemplifies the improvement that many non-western countries have achieved in the last 40 years.

When general audiences are asked if poor countries have become poorer and rich countries become richer, the majority answers yes. By using stratification, histograms, smooth densities, and boxplots, we will be able to understand if this is in fact the case.