STAT3000 092324

Pipes

Pipes are a new tool for expressing a sequence of multiple operations.

“object %>% function1() %>% function2()” ### The point of the pipe is to help you write code in a way that is easier to read and understand.

lapply(c("ggplot2","tidyverse"),library,character.only=1)

## [[1]]
## [1] "ggplot2"   "stats"     "graphics"  "grDevices" "utils"     "datasets" 
## [7] "methods"   "base"     
## 
## [[2]]
##  [1] "lubridate" "forcats"   "stringr"   "dplyr"     "purrr"     "readr"    
##  [7] "tidyr"     "tibble"    "tidyverse" "ggplot2"   "stats"     "graphics" 
## [13] "grDevices" "utils"     "datasets"  "methods"   "base"

sqrt(16) #action object

## [1] 4

#object action action 
16 %>% sqrt()

## [1] 4

sqrt(sqrt(16))

## [1] 2

a <- sqrt(16)
sqrt(a)

## [1] 2

16 %>% sqrt() %>% sqrt()

## [1] 2

Aesthetic mappings

Task: Let’s start by visualizing the relationship between displ and hwy for various classes of cars. Common practice: We can do this with a scatterplot where the numerical variables are mapped to the x and y aesthetics and the categorical variable is mapped to an aesthetic like color or shape.

str(mpg)

## tibble [234 × 11] (S3: tbl_df/tbl/data.frame)
##  $ manufacturer: chr [1:234] "audi" "audi" "audi" "audi" ...
##  $ model       : chr [1:234] "a4" "a4" "a4" "a4" ...
##  $ displ       : num [1:234] 1.8 1.8 2 2 2.8 2.8 3.1 1.8 1.8 2 ...
##  $ year        : int [1:234] 1999 1999 2008 2008 1999 1999 2008 1999 1999 2008 ...
##  $ cyl         : int [1:234] 4 4 4 4 6 6 6 4 4 4 ...
##  $ trans       : chr [1:234] "auto(l5)" "manual(m5)" "manual(m6)" "auto(av)" ...
##  $ drv         : chr [1:234] "f" "f" "f" "f" ...
##  $ cty         : int [1:234] 18 21 20 21 16 18 18 18 16 20 ...
##  $ hwy         : int [1:234] 29 29 31 30 26 26 27 26 25 28 ...
##  $ fl          : chr [1:234] "p" "p" "p" "p" ...
##  $ class       : chr [1:234] "compact" "compact" "compact" "compact" ...

mpg |> ggplot(aes(x=displ,y=hwy,color=class)) + geom_point()

str(mpg)

## tibble [234 × 11] (S3: tbl_df/tbl/data.frame)
##  $ manufacturer: chr [1:234] "audi" "audi" "audi" "audi" ...
##  $ model       : chr [1:234] "a4" "a4" "a4" "a4" ...
##  $ displ       : num [1:234] 1.8 1.8 2 2 2.8 2.8 3.1 1.8 1.8 2 ...
##  $ year        : int [1:234] 1999 1999 2008 2008 1999 1999 2008 1999 1999 2008 ...
##  $ cyl         : int [1:234] 4 4 4 4 6 6 6 4 4 4 ...
##  $ trans       : chr [1:234] "auto(l5)" "manual(m5)" "manual(m6)" "auto(av)" ...
##  $ drv         : chr [1:234] "f" "f" "f" "f" ...
##  $ cty         : int [1:234] 18 21 20 21 16 18 18 18 16 20 ...
##  $ hwy         : int [1:234] 29 29 31 30 26 26 27 26 25 28 ...
##  $ fl          : chr [1:234] "p" "p" "p" "p" ...
##  $ class       : chr [1:234] "compact" "compact" "compact" "compact" ...

mpg |> ggplot(aes(x=displ,y=hwy)) + geom_point(aes(color=class))

Add different trending lines

mpg |> ggplot(aes(x=displ,y=hwy,color=class)) + 
  geom_point() +
  geom_smooth(method=lm)

Add different trending lines to class suv.

filter(mpg,class=="suv") |> ggplot(aes(x=displ,y=hwy,color=class)) + 
  geom_point() +
  geom_smooth(method=loess,se=FALSE)

Exercise

Recreate the R code necessary to generate the following graphs for mpg data. Note that wherever a categorical variable is used in the plot, it’s drv.

cir <- read.csv("~/Charm_City_Circulator_Ridership.csv",check.names = FALSE)
cir.long<- pivot_longer(cir,cols = -c(day,date,daily),names_to="type",values_to="rides")
cir.long <- mutate(cir.long,date=mdy(date))
cir.long <- separate(cir.long,type,into=c("a","b"),sep="[_]")
cir.long <- filter(cir.long,a=="orange") 
ggplot(cir.long,aes(x=date,y=rides)) + geom_point()

The main downside of this form is that it forces you to name each intermediate element.

cir %>% 
  pivot_longer(,cols=-c(day,date,daily),names_to="type",values_to="rides") %>%
  mutate(date=mdy(date)) %>%
  separate(type,into=c("a","b"),sep="[_]") %>%
  filter(a=="orange") %>%
  ggplot(aes(x=date,y=rides)) + geom_point()

### This is my favourite form, because it focusses on verbs, not nouns.The pipe version is much more readable because it follows the flow of operations, and you don’t have to constantly read inside-out. By default, the pipe passes the left-hand side object as the first argument to the function on the right-hand side.

Using .() Placeholder for More Flexibility

If the function you’re piping into doesn’t take the data as the first argument, you can use the dot . as a placeholder to explicitly state where the piped object should go:

mtcars %>% lm(mpg~cyl,data=.)

## 
## Call:
## lm(formula = mpg ~ cyl, data = .)
## 
## Coefficients:
## (Intercept)          cyl  
##      37.885       -2.876

You can also use |> as an alternative.

Practice

Using the |> operator, write a pipeline that: - Filters for cars with 6 cylinders. - Selects the relevant columns (mpg, hp, wt). - Mutates a new column that converts weight to kilograms.

m <- mtcars |>
     filter(cyl == 6) |>
     select(mpg, hp, wt) |>
     mutate(wt_kg = wt * 453.592)

Case study: new insights on poverty

Hans Rosling26 was the co-founder of the Gapminder Foundation27, an organization dedicated to educating the public by using data to dispel common myths about the so-called developing world. The organization uses data to show how actual trends in health and economics contradict the narratives that emanate from sensationalist media coverage of catastrophes, tragedies, and other unfortunate events.

Specifically, in this section, we use data to attempt to answer the following two questions:

Is it a fair characterization of today’s world to say it is divided into western rich nations and the developing world in Africa, Asia, and Latin America?
Has income inequality across countries worsened during the last 40 years?

To answer these questions, we will be using the gapminder dataset provided in dslabs.

library(dslabs)
data(gapminder)
gapminder |> as_tibble()

## # A tibble: 10,545 × 9
##    country   year infant_mortality life_expectancy fertility population      gdp
##    <fct>    <int>            <dbl>           <dbl>     <dbl>      <dbl>    <dbl>
##  1 Albania   1960            115.             62.9      6.19    1636054 NA      
##  2 Algeria   1960            148.             47.5      7.65   11124892  1.38e10
##  3 Angola    1960            208              36.0      7.32    5270844 NA      
##  4 Antigua…  1960             NA              63.0      4.43      54681 NA      
##  5 Argenti…  1960             59.9            65.4      3.11   20619075  1.08e11
##  6 Armenia   1960             NA              66.9      4.55    1867396 NA      
##  7 Aruba     1960             NA              65.7      4.82      54208 NA      
##  8 Austral…  1960             20.3            70.9      3.45   10292328  9.67e10
##  9 Austria   1960             37.3            68.8      2.7     7065525  5.24e10
## 10 Azerbai…  1960             NA              61.3      5.57    3897889 NA      
## # ℹ 10,535 more rows
## # ℹ 2 more variables: continent <fct>, region <fct>

gapminder |> 
  filter(year == 2015 & country %in% c("Sri Lanka","Turkey")) |> 
  select(country, infant_mortality)

##     country infant_mortality
## 1 Sri Lanka              8.4
## 2    Turkey             11.6

Turkey has the higher infant mortality rate.

We can use this code on all comparisons and find the following:

gapminder |> 
  filter(year == 2015 & country %in% c("South Korea","Poland")) |> 
  select(country, infant_mortality)

##       country infant_mortality
## 1 South Korea              2.9
## 2      Poland              4.5

The reason for this stems from the preconceived notion that the world is divided into two groups: the western world (Western Europe and North America), characterized by long life spans and small families, versus the developing world (Africa, Asia, and Latin America) characterized by short life spans and large families. But do the data support this dichotomous view?

The necessary data to answer this question is also available in our gapminder table. Using our newly learned data visualization skills, we will be able to tackle this challenge.

filter(gapminder, year == 1962) |>
  ggplot(aes(fertility, life_expectancy)) +
  geom_point()

Most points fall into two distinct categories:

Life expectancy around 70 years and 3 or fewer children per family.
Life expectancy lower than 65 years and more than 5 children per family. To confirm that indeed these countries are from the regions we expect, we can use color to represent continent.

filter(gapminder, year == 1962) |>
  ggplot( aes(fertility, life_expectancy, color = continent)) +
  geom_point()

In 1962, “the West versus developing world” view was grounded in some reality. Is this still the case 50 years later?

filter(gapminder, year%in%c(1962, 2012)) |>
  ggplot(aes(fertility, life_expectancy, col = continent)) +
  geom_point() +
  facet_grid(year~continent)

We see a plot for each continent/year pair. However, this is just an example and more than what we want, which is simply to compare 1962 and 2012. In this case, there is just one variable and we use . to let facet know that we are not using one of the variables:

filter(gapminder, year%in%c(1962, 2012)) |>
  ggplot(aes(fertility, life_expectancy, col = continent)) +
  geom_point() +
  facet_grid(. ~ year)

This plot clearly shows that the majority of countries have moved from the developing world cluster to the western world one.

To explore how this transformation happened through the years, we can make the plot for several years.

years <- c(1962, 1980, 1990, 2000, 2012)
continents <- c("Europe", "Asia")
gapminder |> 
  filter(year %in% years & continent %in% continents) |>
  ggplot( aes(fertility, life_expectancy, col = continent)) +
  geom_point() +
  facet_wrap(~year)

This plot clearly shows how most Asian countries have improved at a much faster rate than European ones.

The default choice of the range of the axes is important. When not using facet, this range is determined by the data shown in the plot. When using facet, this range is determined by the data shown in all plots and therefore kept fixed across plots. This makes comparisons across plots much easier.

filter(gapminder, year%in%c(1962, 2012)) |>
  ggplot(aes(fertility, life_expectancy, col = continent)) +
  geom_point() +
  facet_wrap(. ~ year, scales = "free")

In the plot above, we have to pay special attention to the range to notice that the plot on the right has a larger life expectancy.

Time series plots

Time series plots have time in the x-axis and an outcome or measurement of interest on the y-axis. For example, here is a trend plot of United States fertility rates:

gapminder |> 
  filter(country == "United States") |> 
  ggplot(aes(year, fertility)) +
  geom_point()

We see that the trend is not linear at all. Instead there is sharp drop during the 1960s and 1970s to below 2. Then the trend comes back to 2 and stabilizes during the 1990s.

When the points are regularly and densely spaced, as they are here, we create curves by joining the points with lines, to convey that these data are from a single series, here a country. To do this, we use the geom_line function instead of geom_point.

gapminder |> 
  filter(country == "United States") |> 
  ggplot(aes(year, fertility)) +
  geom_line()

This is particularly helpful when we look at two countries. If we subset the data to include two countries, one from Europe and one from Asia, then adapt the code above:

countries <- c("South Korea","Germany")

gapminder |> filter(country %in% countries) |> 
  ggplot(aes(year,fertility)) +
  geom_line()

Unfortunately, this is not the plot that we want. Rather than a line for each country, the points for both countries are joined. This is actually expected since we have not told ggplot anything about wanting two separate lines. To let ggplot know that there are two curves that need to be made separately, we assign each point to a group, one for each country:

countries <- c("South Korea","Germany")

gapminder |> filter(country %in% countries & !is.na(fertility)) |> 
  ggplot(aes(year, fertility, group = country,color=country)) +
  geom_line()

For trend plots we recommend labeling the lines rather than using legends since the viewer can quickly see which line is which country. This suggestion actually applies to most plots: labeling is usually preferred over legends.

library(geomtextpath)

gapminder |> 
  filter(country %in% countries) |> 
  ggplot(aes(year, life_expectancy, col = country, label = country)) +
  geom_textpath() +
  theme(legend.position = "none")

The plot clearly shows how an improvement in life expectancy followed the drops in fertility rates. In 1960, Germans lived 15 years longer than South Koreans, although by 2010 the gap is completely closed. It exemplifies the improvement that many non-western countries have achieved in the last 40 years.

When general audiences are asked if poor countries have become poorer and rich countries become richer, the majority answers yes. By using stratification, histograms, smooth densities, and boxplots, we will be able to understand if this is in fact the case.