“object %>% function1() %>% function2()” ### The point of the pipe is to help you write code in a way that is easier to read and understand.
lapply(c("ggplot2","tidyverse"),library,character.only=1)
## [[1]]
## [1] "ggplot2" "stats" "graphics" "grDevices" "utils" "datasets"
## [7] "methods" "base"
##
## [[2]]
## [1] "lubridate" "forcats" "stringr" "dplyr" "purrr" "readr"
## [7] "tidyr" "tibble" "tidyverse" "ggplot2" "stats" "graphics"
## [13] "grDevices" "utils" "datasets" "methods" "base"
sqrt(16) #action object
## [1] 4
#object action action
16 %>% sqrt()
## [1] 4
sqrt(sqrt(16))
## [1] 2
a <- sqrt(16)
sqrt(a)
## [1] 2
16 %>% sqrt() %>% sqrt()
## [1] 2
Task: Let’s start by visualizing the relationship between displ and hwy for various classes of cars. Common practice: We can do this with a scatterplot where the numerical variables are mapped to the x and y aesthetics and the categorical variable is mapped to an aesthetic like color or shape.
str(mpg)
## tibble [234 × 11] (S3: tbl_df/tbl/data.frame)
## $ manufacturer: chr [1:234] "audi" "audi" "audi" "audi" ...
## $ model : chr [1:234] "a4" "a4" "a4" "a4" ...
## $ displ : num [1:234] 1.8 1.8 2 2 2.8 2.8 3.1 1.8 1.8 2 ...
## $ year : int [1:234] 1999 1999 2008 2008 1999 1999 2008 1999 1999 2008 ...
## $ cyl : int [1:234] 4 4 4 4 6 6 6 4 4 4 ...
## $ trans : chr [1:234] "auto(l5)" "manual(m5)" "manual(m6)" "auto(av)" ...
## $ drv : chr [1:234] "f" "f" "f" "f" ...
## $ cty : int [1:234] 18 21 20 21 16 18 18 18 16 20 ...
## $ hwy : int [1:234] 29 29 31 30 26 26 27 26 25 28 ...
## $ fl : chr [1:234] "p" "p" "p" "p" ...
## $ class : chr [1:234] "compact" "compact" "compact" "compact" ...
mpg |> ggplot(aes(x=displ,y=hwy,color=class)) + geom_point()
str(mpg)
## tibble [234 × 11] (S3: tbl_df/tbl/data.frame)
## $ manufacturer: chr [1:234] "audi" "audi" "audi" "audi" ...
## $ model : chr [1:234] "a4" "a4" "a4" "a4" ...
## $ displ : num [1:234] 1.8 1.8 2 2 2.8 2.8 3.1 1.8 1.8 2 ...
## $ year : int [1:234] 1999 1999 2008 2008 1999 1999 2008 1999 1999 2008 ...
## $ cyl : int [1:234] 4 4 4 4 6 6 6 4 4 4 ...
## $ trans : chr [1:234] "auto(l5)" "manual(m5)" "manual(m6)" "auto(av)" ...
## $ drv : chr [1:234] "f" "f" "f" "f" ...
## $ cty : int [1:234] 18 21 20 21 16 18 18 18 16 20 ...
## $ hwy : int [1:234] 29 29 31 30 26 26 27 26 25 28 ...
## $ fl : chr [1:234] "p" "p" "p" "p" ...
## $ class : chr [1:234] "compact" "compact" "compact" "compact" ...
mpg |> ggplot(aes(x=displ,y=hwy)) + geom_point(aes(color=class))
mpg |> ggplot(aes(x=displ,y=hwy,color=class)) +
geom_point() +
geom_smooth(method=lm)
filter(mpg,class=="suv") |> ggplot(aes(x=displ,y=hwy,color=class)) +
geom_point() +
geom_smooth(method=loess,se=FALSE)
cir <- read.csv("~/Charm_City_Circulator_Ridership.csv",check.names = FALSE)
cir.long<- pivot_longer(cir,cols = -c(day,date,daily),names_to="type",values_to="rides")
cir.long <- mutate(cir.long,date=mdy(date))
cir.long <- separate(cir.long,type,into=c("a","b"),sep="[_]")
cir.long <- filter(cir.long,a=="orange")
ggplot(cir.long,aes(x=date,y=rides)) + geom_point()
cir %>%
pivot_longer(,cols=-c(day,date,daily),names_to="type",values_to="rides") %>%
mutate(date=mdy(date)) %>%
separate(type,into=c("a","b"),sep="[_]") %>%
filter(a=="orange") %>%
ggplot(aes(x=date,y=rides)) + geom_point()
### This is my favourite form, because it focusses on verbs, not
nouns.The pipe version is much more readable because it follows the flow
of operations, and you don’t have to constantly read inside-out. By
default, the pipe passes the left-hand side object as the first argument
to the function on the right-hand side.
If the function you’re piping into doesn’t take the data as the first argument, you can use the dot . as a placeholder to explicitly state where the piped object should go:
mtcars %>% lm(mpg~cyl,data=.)
##
## Call:
## lm(formula = mpg ~ cyl, data = .)
##
## Coefficients:
## (Intercept) cyl
## 37.885 -2.876
You can also use |> as an alternative.
Using the |> operator, write a pipeline that: - Filters for cars with 6 cylinders. - Selects the relevant columns (mpg, hp, wt). - Mutates a new column that converts weight to kilograms.
m <- mtcars |>
filter(cyl == 6) |>
select(mpg, hp, wt) |>
mutate(wt_kg = wt * 453.592)
Specifically, in this section, we use data to attempt to answer the following two questions:
To answer these questions, we will be using the gapminder dataset provided in dslabs.
library(dslabs)
data(gapminder)
gapminder |> as_tibble()
## # A tibble: 10,545 × 9
## country year infant_mortality life_expectancy fertility population gdp
## <fct> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Albania 1960 115. 62.9 6.19 1636054 NA
## 2 Algeria 1960 148. 47.5 7.65 11124892 1.38e10
## 3 Angola 1960 208 36.0 7.32 5270844 NA
## 4 Antigua… 1960 NA 63.0 4.43 54681 NA
## 5 Argenti… 1960 59.9 65.4 3.11 20619075 1.08e11
## 6 Armenia 1960 NA 66.9 4.55 1867396 NA
## 7 Aruba 1960 NA 65.7 4.82 54208 NA
## 8 Austral… 1960 20.3 70.9 3.45 10292328 9.67e10
## 9 Austria 1960 37.3 68.8 2.7 7065525 5.24e10
## 10 Azerbai… 1960 NA 61.3 5.57 3897889 NA
## # ℹ 10,535 more rows
## # ℹ 2 more variables: continent <fct>, region <fct>
gapminder |>
filter(year == 2015 & country %in% c("Sri Lanka","Turkey")) |>
select(country, infant_mortality)
## country infant_mortality
## 1 Sri Lanka 8.4
## 2 Turkey 11.6
Turkey has the higher infant mortality rate.
We can use this code on all comparisons and find the following:
gapminder |>
filter(year == 2015 & country %in% c("South Korea","Poland")) |>
select(country, infant_mortality)
## country infant_mortality
## 1 South Korea 2.9
## 2 Poland 4.5
The reason for this stems from the preconceived notion that the world is divided into two groups: the western world (Western Europe and North America), characterized by long life spans and small families, versus the developing world (Africa, Asia, and Latin America) characterized by short life spans and large families. But do the data support this dichotomous view?
The necessary data to answer this question is also available in our gapminder table. Using our newly learned data visualization skills, we will be able to tackle this challenge.
filter(gapminder, year == 1962) |>
ggplot(aes(fertility, life_expectancy)) +
geom_point()
Most points fall into two distinct categories:
filter(gapminder, year == 1962) |>
ggplot( aes(fertility, life_expectancy, color = continent)) +
geom_point()
In 1962, “the West versus developing world” view was grounded in some
reality. Is this still the case 50 years later?
filter(gapminder, year%in%c(1962, 2012)) |>
ggplot(aes(fertility, life_expectancy, col = continent)) +
geom_point() +
facet_grid(year~continent)
We see a plot for each continent/year pair. However, this is just an
example and more than what we want, which is simply to compare 1962 and
2012. In this case, there is just one variable and we use . to let facet
know that we are not using one of the variables:
filter(gapminder, year%in%c(1962, 2012)) |>
ggplot(aes(fertility, life_expectancy, col = continent)) +
geom_point() +
facet_grid(. ~ year)
This plot clearly shows that the majority of countries have moved from
the developing world cluster to the western world one.
years <- c(1962, 1980, 1990, 2000, 2012)
continents <- c("Europe", "Asia")
gapminder |>
filter(year %in% years & continent %in% continents) |>
ggplot( aes(fertility, life_expectancy, col = continent)) +
geom_point() +
facet_wrap(~year)
This plot clearly shows how most Asian countries have improved at a much
faster rate than European ones.
The default choice of the range of the axes is important. When not using facet, this range is determined by the data shown in the plot. When using facet, this range is determined by the data shown in all plots and therefore kept fixed across plots. This makes comparisons across plots much easier.
filter(gapminder, year%in%c(1962, 2012)) |>
ggplot(aes(fertility, life_expectancy, col = continent)) +
geom_point() +
facet_wrap(. ~ year, scales = "free")
In the plot above, we have to pay special attention to the range to
notice that the plot on the right has a larger life expectancy.
Time series plots have time in the x-axis and an outcome or measurement of interest on the y-axis. For example, here is a trend plot of United States fertility rates:
gapminder |>
filter(country == "United States") |>
ggplot(aes(year, fertility)) +
geom_point()
We see that the trend is not linear at all. Instead there is sharp drop
during the 1960s and 1970s to below 2. Then the trend comes back to 2
and stabilizes during the 1990s.
When the points are regularly and densely spaced, as they are here, we create curves by joining the points with lines, to convey that these data are from a single series, here a country. To do this, we use the geom_line function instead of geom_point.
gapminder |>
filter(country == "United States") |>
ggplot(aes(year, fertility)) +
geom_line()
This is particularly helpful when we look at two countries. If we subset
the data to include two countries, one from Europe and one from Asia,
then adapt the code above:
countries <- c("South Korea","Germany")
gapminder |> filter(country %in% countries) |>
ggplot(aes(year,fertility)) +
geom_line()
Unfortunately, this is not the plot that we want. Rather than a line for
each country, the points for both countries are joined. This is actually
expected since we have not told ggplot anything about wanting two
separate lines. To let ggplot know that there are two curves that need
to be made separately, we assign each point to a group, one for each
country:
countries <- c("South Korea","Germany")
gapminder |> filter(country %in% countries & !is.na(fertility)) |>
ggplot(aes(year, fertility, group = country,color=country)) +
geom_line()
For trend plots we recommend labeling the lines rather than using
legends since the viewer can quickly see which line is which country.
This suggestion actually applies to most plots: labeling is usually
preferred over legends.
library(geomtextpath)
gapminder |>
filter(country %in% countries) |>
ggplot(aes(year, life_expectancy, col = country, label = country)) +
geom_textpath() +
theme(legend.position = "none")
The plot clearly shows how an improvement in life expectancy followed
the drops in fertility rates. In 1960, Germans lived 15 years longer
than South Koreans, although by 2010 the gap is completely closed. It
exemplifies the improvement that many non-western countries have
achieved in the last 40 years.