4.2 - Using the Gapminder Dataset


Faceting

🎞 Faceting

📖 Faceting

  • Faceting makes multiple side-by-side plots stratified by some variable. This is a way to ease comparisons.
  • The facet_grid() function allows faceting by up to two variables, with rows faceted by one variable and columns faceted by the other variable. To facet by only one variable, use the dot operator as the other variable.
  • The facet_wrap() function facets by one variable and automatically wraps the series of plots so they have readable dimensions.
  • Faceting keeps the axes fixed across all plots, easing comparisons between plots.
  • The data suggest that the developing versus Western world view no longer makes sense in 2012.
library(dplyr)

Attaching package: ‘dplyr’

The following objects are masked from ‘package:stats’:

    filter, lag

The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union
library(dslabs)
library(ggplot2)
data(gapminder)
# Facet by continent and year
filter(gapminder, year %in% c(1962, 2012)) %>%
    ggplot(aes(fertility, life_expectancy, col=continent)) +
    geom_point() +
    facet_grid(continent~year)

# Facet by year only
filter(gapminder, year %in% c(1962, 1980, 2012)) %>%
    ggplot(aes(fertility, life_expectancy, col=continent)) +
    geom_point() +
    facet_grid(.~year)

# Facet by year, plots wrapped onto multiple rows
years <- c(1962, 1980, 1990, 2000, 2012)
continents <- c("Europe", "Asia")
gapminder %>%
    filter(year %in% years & continent %in% continents) %>%
    ggplot(aes(fertility, life_expectancy, col=continent)) +
    geom_point() +
    facet_wrap(~year)


Time Series Plots

🎞 Time Series Plots

📖 Time series plots

  • Time series plots have time on the x-axis and a variable of interest on the y-axis.
  • The geom_line() geometry connects adjacent data points to form a continuous line. A line plot is appropriate when points are regularly spaced, densely packed and from a single data series.
  • You can plot multiple lines on the same graph. Remember to group or color by a variable so that the lines are plotted independently.
  • Labeling is usually preferred over legends. However, legends are easier to make and appear by default. Add a label with geom_text(), specifying the coordinates where the label should appear on the graph.

Code: Single time series

# Scatterplot of US fertility by year
gapminder %>%
    filter(country=="United States") %>%
    ggplot(aes(year, fertility)) +
    geom_point()

# Line plot of US fertility by year
gapminder %>%
    filter(country=="United States") %>%
    ggplot(aes(year, fertility)) +
    geom_line()

Code: Multiple time series

# Line plot fertility time series for two countries- only one line (incorrect)
countries <- c("South Korea", "Germany")
gapminder %>% filter(country %in% countries) %>%
    ggplot(aes(year, fertility)) +
    geom_line()

# Line plot fertility time series for two countries - one line per country
gapminder %>% filter(country %in% countries) %>%
    ggplot(aes(year, fertility, group=country)) +
    geom_line()

# Fertility time series for two countries - lines colored by country
gapminder %>% filter(country %in% countries) %>%
    ggplot(aes(year, fertility, col=country)) +
    geom_line()

Code: Adding text labels to a plot

# Life expectancy time series - lines colored by country and labeled, no legend
labels <- data.frame(country=countries, x=c(1975, 1965), y=c(60, 72))
gapminder %>% filter(country %in% countries) %>%
    ggplot(aes(year, life_expectancy, col=country)) +
    geom_line() +
    geom_text(data=labels, aes(x, y, label=country), size=5) +
    theme(legend.position="none")


Transformations

🎞 Transformations

📖 Data transformations

📖 Visualizing multimodal distributions

  • We use GDP data to compute income in US dollars per day, adjusted for inflation.
  • Log transformations convert multiplicative changes into additive changes.
  • Common transformations are the log base 2 transformation and the log base 10 transformation. The choice of base depends on the range of the data. The natural log is not recommended for visualization because it is difficult to interpret.
  • The mode of a distribution is the value with the highest frequency. The mode of a normal distribution is the average. A distribution can have multiple local modes.
  • There are two ways to use log transformations in plots: transform the data before plotting or transform the axes of the plot. Log scales have the advantage of showing the original values as axis labels, while log transformed values ease interpretation of intermediate values between labels.
  • Scale the x-axis using scale_x_continuous() or scale_x_log10() layers in ggplot2. Similar functions exist for the y-axis.
  • In 1970, income distribution is bimodal, consistent with the dichotomous Western versus developing worldview.
# Add dollars per day variable
gapminder <- gapminder %>%
    mutate(dollars_per_day=gdp/population/365)

# Histogram of dollars per day
past_year <- 1970
gapminder %>%
    filter(year==past_year & !is.na(gdp)) %>%
    ggplot(aes(dollars_per_day)) +
    geom_histogram(binwidth=1, color="black")

# Repeat histogram with log2 scaled data
gapminder %>%
    filter(year==past_year & !is.na(gdp)) %>%
    ggplot(aes(log2(dollars_per_day))) +
    geom_histogram(binwidth=1, color="black")

# Repeat histogram with log2 scaled x-axis
gapminder %>%
    filter(year==past_year & !is.na(gdp)) %>%
    ggplot(aes(dollars_per_day)) +
    geom_histogram(binwidth=1, color="black") +
    scale_x_continuous(trans="log2")


Transformations

🎞 Stratify and Boxplot

📖 Comparing multiple distributions with boxplots and ridge plots

Note that many boxplots from the video are instead dot plots in the textbook and that a different boxplot is constructed in the textbook. Also read that section to see an example of grouping factors with the case_when function.

  • Make boxplots stratified by a categorical variable using the geom_boxplot() geometry.
  • Rotate axis labels by changing the theme through element_text(). You can change the angle and justification of the text labels.
  • Consider ordering your factors by a meaningful value with the reorder() function, which changes the order of factor levels based on a related numeric vector. This is a way to ease comparisons.
  • Show the data by adding data points to the boxplot with a geom_point() layer. This adds information beyond the five-number summary to your plot, but too many data points can obfuscate your message.

Code: Boxplot of GDP by region

# Add dollars per day variable
gapminder <- gapminder %>%
    mutate(dollars_per_day = gdp/population/365)
# Number of regions
length(levels(gapminder$region))
[1] 22
# Boxplot of GDP by region in 1970
past_year <- 1970
p <- gapminder %>%
    filter(year == past_year & !is.na(gdp)) %>%
    ggplot(aes(region, dollars_per_day))
p + geom_boxplot()

# Rotate names on x-axis
p + geom_boxplot() +
    theme(axis.text.x = element_text(angle = 90, hjust = 1))

Code: The reorder function

# By default, factor order is alphabetical
fac <- factor(c("Asia", "Asia", "West", "West", "West"))
levels(fac)
[1] "Asia" "West"
# Reorder factor by the category means
value <- c(10, 11, 12, 6, 4)
fac <- reorder(fac, value, FUN = mean)
levels(fac)
[1] "West" "Asia"

Code: Enhanced boxplot ordered by median income, scaled, and showing data

# Reorder by median income and color by continent
p <- gapminder %>%
    filter(year == past_year & !is.na(gdp)) %>%
    mutate(region = reorder(region, dollars_per_day, FUN = median)) %>%    # Reorder
    ggplot(aes(region, dollars_per_day, fill = continent)) +    # Color by continent
    geom_boxplot() +
    theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
    xlab("")
p

# log2 scale y-axis
p + scale_y_continuous(trans = "log2")

# Add data points
p + scale_y_continuous(trans = "log2") + geom_point(show.legend = FALSE)

---
title: "Course 2 - Data Science: Visualization"
output: 
  html_notebook: 
    highlight: zenburn
    theme: flatly
---

## 4.2 - Using the Gapminder Dataset

___ 
#### **Faceting**

> 🎞 [Faceting](https://edx-video.net/HARB020D2017-V002500_DTH.mp4)
> 
> 📖 [Faceting](https://rafalab.github.io/dsbook/gapminder.html#faceting)

- Faceting makes multiple side-by-side plots stratified by some variable. This is a way to ease comparisons.
- The `facet_grid()` function allows faceting by up to two variables, with rows faceted by one variable and columns faceted by the other variable. To facet by only one variable, use the *dot operator* as the other variable.
- The `facet_wrap()` function facets by one variable and automatically wraps the series of plots so they have readable dimensions.
- Faceting keeps the axes fixed across all plots, easing comparisons between plots.
- The data suggest that the developing versus Western world view no longer makes sense in 2012.
```{r}
library(dplyr)
library(dslabs)
library(ggplot2)
data(gapminder)
# Facet by continent and year
filter(gapminder, year %in% c(1962, 2012)) %>%
    ggplot(aes(fertility, life_expectancy, col=continent)) +
    geom_point() +
    facet_grid(continent~year)
# Facet by year only
filter(gapminder, year %in% c(1962, 1980, 2012)) %>%
    ggplot(aes(fertility, life_expectancy, col=continent)) +
    geom_point() +
    facet_grid(.~year)
# Facet by year, plots wrapped onto multiple rows
years <- c(1962, 1980, 1990, 2000, 2012)
continents <- c("Europe", "Asia")
gapminder %>%
    filter(year %in% years & continent %in% continents) %>%
    ggplot(aes(fertility, life_expectancy, col=continent)) +
    geom_point() +
    facet_wrap(~year)
```
___ 
#### **Time Series Plots**

> 🎞 [Time Series Plots](https://edx-video.net/HARB020D2017-V002900_DTH.mp4)
> 
> 📖 [Time series plots](https://rafalab.github.io/dsbook/gapminder.html#time-series-plots)

- Time series plots have time on the x-axis and a variable of interest on the y-axis.
- The `geom_line()` geometry connects adjacent data points to form a continuous line. A line plot is appropriate when points are regularly spaced, densely packed and from a single data series.
- You can plot multiple lines on the same graph. Remember to group or color by a variable so that the lines are plotted independently.
- Labeling is usually preferred over legends. However, legends are easier to make and appear by default. Add a label with `geom_text()`, specifying the coordinates where the label should appear on the graph.

Code: Single time series
```{r}
# Scatterplot of US fertility by year
gapminder %>%
    filter(country=="United States") %>%
    ggplot(aes(year, fertility)) +
    geom_point()
# Line plot of US fertility by year
gapminder %>%
    filter(country=="United States") %>%
    ggplot(aes(year, fertility)) +
    geom_line()
```
Code: Multiple time series
```{r}
# Line plot fertility time series for two countries- only one line (incorrect)
countries <- c("South Korea", "Germany")
gapminder %>% filter(country %in% countries) %>%
    ggplot(aes(year, fertility)) +
    geom_line()
# Line plot fertility time series for two countries - one line per country
gapminder %>% filter(country %in% countries) %>%
    ggplot(aes(year, fertility, group=country)) +
    geom_line()
# Fertility time series for two countries - lines colored by country
gapminder %>% filter(country %in% countries) %>%
    ggplot(aes(year, fertility, col=country)) +
    geom_line()
```
Code: Adding text labels to a plot
```{r}
# Life expectancy time series - lines colored by country and labeled, no legend
labels <- data.frame(country=countries, x=c(1975, 1965), y=c(60, 72))
gapminder %>% filter(country %in% countries) %>%
    ggplot(aes(year, life_expectancy, col=country)) +
    geom_line() +
    geom_text(data=labels, aes(x, y, label=country), size=5) +
    theme(legend.position="none")
```
___ 
#### **Transformations**

> 🎞 [Transformations](https://edx-video.net/HARB020D2017-V002700_DTH.mp4)
> 
> 📖 [Data transformations](https://rafalab.github.io/dsbook/gapminder.html#data-transformations)
> 
> 📖 [Visualizing multimodal distributions](https://rafalab.github.io/dsbook/gapminder.html#visualizing-multimodal-distributions)

- We use GDP data to compute income in US dollars per day, adjusted for inflation.
- Log transformations convert multiplicative changes into additive changes.
- Common transformations are the log base 2 transformation and the log base 10 transformation. The choice of base depends on the range of the data. The natural log is not recommended for visualization because it is difficult to interpret.
- The mode of a distribution is the value with the highest frequency. The mode of a normal distribution is the average. A distribution can have multiple local modes.
- There are two ways to use log transformations in plots: transform the data before plotting or transform the axes of the plot. Log scales have the advantage of showing the original values as axis labels, while log transformed values ease interpretation of intermediate values between labels.
- Scale the x-axis using `scale_x_continuous()` or `scale_x_log10()` layers in *ggplot2*. Similar functions exist for the y-axis.
- In 1970, income distribution is bimodal, consistent with the dichotomous Western versus developing worldview.
```{r}
# Add dollars per day variable
gapminder <- gapminder %>%
    mutate(dollars_per_day=gdp/population/365)

# Histogram of dollars per day
past_year <- 1970
gapminder %>%
    filter(year==past_year & !is.na(gdp)) %>%
    ggplot(aes(dollars_per_day)) +
    geom_histogram(binwidth=1, color="black")
# Repeat histogram with log2 scaled data
gapminder %>%
    filter(year==past_year & !is.na(gdp)) %>%
    ggplot(aes(log2(dollars_per_day))) +
    geom_histogram(binwidth=1, color="black")
# Repeat histogram with log2 scaled x-axis
gapminder %>%
    filter(year==past_year & !is.na(gdp)) %>%
    ggplot(aes(dollars_per_day)) +
    geom_histogram(binwidth=1, color="black") +
    scale_x_continuous(trans="log2")
```
___ 
#### **Transformations**

> 🎞 [Stratify and Boxplot](https://edx-video.net/HARB020D2017-V002600_DTH.mp4)
> 
> 📖 [Comparing multiple distributions with boxplots and ridge plots](https://rafalab.github.io/dsbook/gapminder.html#comparing-multiple-distributions-with-boxplots-and-ridge-plots)

> Note that many boxplots from the video are instead dot plots in the textbook and that a different boxplot is constructed in the textbook. Also read that section to see an example of grouping factors with the case_when function.

- Make boxplots stratified by a categorical variable using the `geom_boxplot()` geometry.
- Rotate axis labels by changing the theme through `element_text()`. You can change the angle and justification of the text labels.
- Consider ordering your factors by a meaningful value with the `reorder()` function, which changes the order of factor levels based on a related numeric vector. This is a way to ease comparisons.
- Show the data by adding data points to the boxplot with a `geom_point()` layer. This adds information beyond the five-number summary to your plot, but too many data points can obfuscate your message.

Code: Boxplot of GDP by region
```{r}
# Add dollars per day variable
gapminder <- gapminder %>%
    mutate(dollars_per_day=gdp/population/365)
# Number of regions
length(levels(gapminder$region))
# Boxplot of GDP by region in 1970
past_year <- 1970
p <- gapminder %>%
    filter(year == past_year & !is.na(gdp)) %>%
    ggplot(aes(region, dollars_per_day))
p + geom_boxplot()
# Rotate names on x-axis
p + geom_boxplot() +
    theme(axis.text.x=element_text(angle=90, hjust=1))
```
Code: The reorder function
```{r}
# By default, factor order is alphabetical
fac <- factor(c("Asia", "Asia", "West", "West", "West"))
levels(fac)
# Reorder factor by the category means
value <- c(10, 11, 12, 6, 4)
fac <- reorder(fac, value, FUN=mean)
levels(fac)
```
Code: Enhanced boxplot ordered by median income, scaled, and showing data
```{r}
# Reorder by median income and color by continent
p <- gapminder %>%
    filter(year == past_year & !is.na(gdp)) %>%
    mutate(region=reorder(region, dollars_per_day, FUN=median)) %>%    # Reorder
    ggplot(aes(region, dollars_per_day, fill=continent)) +    # Color by continent
    geom_boxplot() +
    theme(axis.text.x=element_text(angle=90, hjust=1)) +
    xlab("")
p
# log2 scale y-axis
p + scale_y_continuous(trans="log2")
# Add data points
p + scale_y_continuous(trans="log2") + geom_point(show.legend=FALSE)
```

