RMarkdown Cheat Sheet

Slope Charts

When examining the relationship between two variables, it is often useful to use scatterplots. One exception might be when you are examining the value of the same variable, but at different points in time.

There is no geometry for slope charts in ggplot2, but we can construct one using geom_lines. An advantage of the slope chart is that it permits us to quickly get an idea of changes based on the slope of the lines.

Leveraging the gapminder dataset, we create a utility vector we’ll call ‘west’:

west <- c("Western Europe", "Northern Europe", "Southern Europe", "Northern America", "Australia and New Zealand")

Now a utility vector ‘dat’ filtering from gapminder those rows that:
* include either 2010 or 2015, AND…
* include regions in our vector west, AND…
* include only rows where life_expectancy is NOT NA, AND…
* the population is greater than 10^7, or 10 million

dat <- gapminder %>%
    filter(year %in% c(2010, 2015) & region %in% west & !is.na(life_expectancy) & population > 10^7)

With this data frame now in hand, we can start to plot the life expectancy data we have collected from 2010 to 2015 and quickly ascertain that it has risen across all considered regions:

dat %>%
    mutate(location = ifelse(year == 2010, 1, 2),
           location = ifelse(year == 2015 & country %in% c("United Kingdom", "Portugal"),
                             location + 0.22, location),
           hjust = ifelse(year == 2010, 1, 0)) %>%
    mutate(year = as.factor(year)) %>%
    ggplot(aes(year, life_expectancy, group = country)) +
    geom_line(aes(color = country), show.legend = FALSE) +
    geom_text(aes(x = location, label = country, hjust = hjust), show.legend = FALSE) +
    xlab("") +
    ylab("Life Expectancy") 

By applying a slope chart, we take advantage of common-axis, which make comparisons across a few data points easy. However, what happens when you want to evaluated more than just a few?

The Bland-Altman Plot

Since what we’re interested in is in differences, it makes sense to dedicate one of our axes to differences. The Bland-Altman Plot, also known as the Tukey Mean Different plot, and also the MA plot, shows the difference versus the average.

library(ggrepel)
dat %>%
    mutate(year = paste0("life_expectancy_", year)) %>%
    select(country, year, life_expectancy) %>% spread(year, life_expectancy) %>%
    mutate(average = (life_expectancy_2015 + life_expectancy_2010)/2,
                difference = life_expectancy_2015 - life_expectancy_2010) %>%
    ggplot(aes(average, difference, label = country)) +
    geom_point() +
    geom_text_repel() +
    geom_abline(lty = 2) +
    xlab("Average of 2010 and 2015") +
    ylab("Difference between 2015 and 2010")

Those countries that have improved the most are easy to identify as it is represented in the Y-axis. We also get an idea of the overall value from the x-axis.

The Bland-Altman plot (Tukey mean difference plot, MA plot) graphs the difference between conditions on the y-axis and the mean between conditions on the x-axis. It is more appropriate for large numbers of observations than slope charts.

Encode a Third Variable

  • Encode a categorical third variable on a scatterplot using color hue or shape. Use the shape argument to control shape.
  • Encode a continuous third variable on a using color intensity or size.

Case Study: Vaccines

Case studies on vaccines can be fueled by datasets made possible through The Tyco Project. The Tyco Project serves to help institutes and researchers to make valuable global health data available to others, contributing to open and reproducible science through the offering of open data. This data includes weekly reported counts data for 7 diseases from 1928 to 2011 from all 50 states.

The dslabs package also makes available the yearly totals:

library(tidyverse)
library(dslabs)
data(us_contagious_diseases)
str(us_contagious_diseases)
## 'data.frame':    16065 obs. of  6 variables:
##  $ disease        : Factor w/ 7 levels "Hepatitis A",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ state          : Factor w/ 51 levels "Alabama","Alaska",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ year           : num  1966 1967 1968 1969 1970 ...
##  $ weeks_reporting: num  50 49 52 49 51 51 45 45 45 46 ...
##  $ count          : num  321 291 314 380 413 378 342 467 244 286 ...
##  $ population     : num  3345787 3364130 3386068 3412450 3444165 ...
as_tibble(us_contagious_diseases)
## # A tibble: 16,065 x 6
##    disease     state    year weeks_reporting count population
##    <fct>       <fct>   <dbl>           <dbl> <dbl>      <dbl>
##  1 Hepatitis A Alabama  1966              50   321    3345787
##  2 Hepatitis A Alabama  1967              49   291    3364130
##  3 Hepatitis A Alabama  1968              52   314    3386068
##  4 Hepatitis A Alabama  1969              49   380    3412450
##  5 Hepatitis A Alabama  1970              51   413    3444165
##  6 Hepatitis A Alabama  1971              51   378    3481798
##  7 Hepatitis A Alabama  1972              45   342    3524543
##  8 Hepatitis A Alabama  1973              45   467    3571209
##  9 Hepatitis A Alabama  1974              45   244    3620548
## 10 Hepatitis A Alabama  1975              46   286    3671246
## # ... with 16,055 more rows

This exercise will focus only on measles. So we create a utility string to sort and the vector we’ll temporarily store the data frame in we’ll call dat.
* Pipe in us_contagious_diseases in filter
* Filter out states in the data that include Hawaii and Alaska (no data until 1950) AND
* Filter only those rows where the disease is measles THEN
* Pipe the filtered results to mutate, we we add a new column called rate, which is the rate per year of measles THEN
* Reorder the table to list state by order of rate

the_disease <- "Measles"
dat <- us_contagious_diseases %>%
    filter(!state %in% c("Hawaii", "Alaska") & disease == the_disease) %>%
    mutate(rate = count / population * 10000 * 52/weeks_reporting) %>%
    mutate(state = reorder(state, rate))

We can now easily plot the rate of measles per year.
* We take the dat data set and filter out where state is ‘California’ and NOT NA (recall this as the data component step) THEN
* Pipe this to ggplot and map it to x and y (recall this as the aesthetic mapping component), THEN
* Select a geom_line() as the geometry (recall this as the geometry component) - 1963 is the year that measles vaccine was discovered AND
* Add some visual tweaks to assist visually with interpreting the data (recall this as the style component):

dat %>% filter(state == "California" & !is.na(rate)) %>%
    ggplot(aes(year, rate)) +
    geom_line() +
    ylab("Cases per 10,000") +
    geom_vline(xintercept=1963, col = "blue")

But this is just California, and just measles. How can we show the data for all states, in one plot?

dat %>% ggplot(aes(year, state, fill=rate)) +
    geom_tile(color = "grey50") +
    scale_x_continuous(expand = c(0,0)) +
    scale_fill_gradientn(colors = RColorBrewer::brewer.pal(9, "Reds"), trans = "sqrt") +
    geom_vline(xintercept = 1963, col = "blue") +
    theme_minimal() + theme(panel.grid = element_blank()) +
    ggtitle(the_disease) +
    ylab("") +
    xlab("")

# compute US average measles rate by year
avg <- us_contagious_diseases %>%
    filter(disease == the_disease) %>% group_by(year) %>%
    summarize(us_rate = sum(count, na.rm = TRUE)/sum(population, na.rm = TRUE)*10000)
## `summarise()` ungrouping output (override with `.groups` argument)
# make line plot of measles rate by year by state
dat %>%
    filter(!is.na(rate)) %>%
    ggplot() +
    geom_line(aes(year, rate, group = state), color = "grey50", 
        show.legend = FALSE, alpha = 0.2, size = 1) +
    geom_line(mapping = aes(year, us_rate), data = avg, size = 1, col = "black") +
    scale_y_continuous(trans = "sqrt", breaks = c(5, 25, 125, 300)) +
    ggtitle("Cases per 10,000 by state") +
    xlab("") +
    ylab("") +
    geom_text(data = data.frame(x = 1955, y = 50),
        mapping = aes(x, y, label = "US average"), color = "black") +
    geom_vline(xintercept = 1963, col = "blue")