When examining the relationship between two variables, it is often useful to use scatterplots. One exception might be when you are examining the value of the same variable, but at different points in time.
There is no geometry for slope charts in ggplot2, but we can construct one using geom_lines. An advantage of the slope chart is that it permits us to quickly get an idea of changes based on the slope of the lines.
Leveraging the gapminder dataset, we create a utility vector we’ll call ‘west’:
west <- c("Western Europe", "Northern Europe", "Southern Europe", "Northern America", "Australia and New Zealand")
Now a utility vector ‘dat’ filtering from gapminder those rows that:
* include either 2010 or 2015, AND…
* include regions in our vector west, AND…
* include only rows where life_expectancy is NOT NA, AND…
* the population is greater than 10^7, or 10 million
dat <- gapminder %>%
filter(year %in% c(2010, 2015) & region %in% west & !is.na(life_expectancy) & population > 10^7)
With this data frame now in hand, we can start to plot the life expectancy data we have collected from 2010 to 2015 and quickly ascertain that it has risen across all considered regions:
dat %>%
mutate(location = ifelse(year == 2010, 1, 2),
location = ifelse(year == 2015 & country %in% c("United Kingdom", "Portugal"),
location + 0.22, location),
hjust = ifelse(year == 2010, 1, 0)) %>%
mutate(year = as.factor(year)) %>%
ggplot(aes(year, life_expectancy, group = country)) +
geom_line(aes(color = country), show.legend = FALSE) +
geom_text(aes(x = location, label = country, hjust = hjust), show.legend = FALSE) +
xlab("") +
ylab("Life Expectancy")
By applying a slope chart, we take advantage of common-axis, which make comparisons across a few data points easy. However, what happens when you want to evaluated more than just a few?
Since what we’re interested in is in differences, it makes sense to dedicate one of our axes to differences. The Bland-Altman Plot, also known as the Tukey Mean Different plot, and also the MA plot, shows the difference versus the average.
library(ggrepel)
dat %>%
mutate(year = paste0("life_expectancy_", year)) %>%
select(country, year, life_expectancy) %>% spread(year, life_expectancy) %>%
mutate(average = (life_expectancy_2015 + life_expectancy_2010)/2,
difference = life_expectancy_2015 - life_expectancy_2010) %>%
ggplot(aes(average, difference, label = country)) +
geom_point() +
geom_text_repel() +
geom_abline(lty = 2) +
xlab("Average of 2010 and 2015") +
ylab("Difference between 2015 and 2010")
Those countries that have improved the most are easy to identify as it is represented in the Y-axis. We also get an idea of the overall value from the x-axis.
The Bland-Altman plot (Tukey mean difference plot, MA plot) graphs the difference between conditions on the y-axis and the mean between conditions on the x-axis. It is more appropriate for large numbers of observations than slope charts.
Case studies on vaccines can be fueled by datasets made possible through The Tyco Project. The Tyco Project serves to help institutes and researchers to make valuable global health data available to others, contributing to open and reproducible science through the offering of open data. This data includes weekly reported counts data for 7 diseases from 1928 to 2011 from all 50 states.
The dslabs package also makes available the yearly totals:
library(tidyverse)
library(dslabs)
data(us_contagious_diseases)
str(us_contagious_diseases)
## 'data.frame': 16065 obs. of 6 variables:
## $ disease : Factor w/ 7 levels "Hepatitis A",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ state : Factor w/ 51 levels "Alabama","Alaska",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ year : num 1966 1967 1968 1969 1970 ...
## $ weeks_reporting: num 50 49 52 49 51 51 45 45 45 46 ...
## $ count : num 321 291 314 380 413 378 342 467 244 286 ...
## $ population : num 3345787 3364130 3386068 3412450 3444165 ...
as_tibble(us_contagious_diseases)
## # A tibble: 16,065 x 6
## disease state year weeks_reporting count population
## <fct> <fct> <dbl> <dbl> <dbl> <dbl>
## 1 Hepatitis A Alabama 1966 50 321 3345787
## 2 Hepatitis A Alabama 1967 49 291 3364130
## 3 Hepatitis A Alabama 1968 52 314 3386068
## 4 Hepatitis A Alabama 1969 49 380 3412450
## 5 Hepatitis A Alabama 1970 51 413 3444165
## 6 Hepatitis A Alabama 1971 51 378 3481798
## 7 Hepatitis A Alabama 1972 45 342 3524543
## 8 Hepatitis A Alabama 1973 45 467 3571209
## 9 Hepatitis A Alabama 1974 45 244 3620548
## 10 Hepatitis A Alabama 1975 46 286 3671246
## # ... with 16,055 more rows
This exercise will focus only on measles. So we create a utility string to sort and the vector we’ll temporarily store the data frame in we’ll call dat.
* Pipe in us_contagious_diseases in filter
* Filter out states in the data that include Hawaii and Alaska (no data until 1950) AND
* Filter only those rows where the disease is measles THEN
* Pipe the filtered results to mutate, we we add a new column called rate, which is the rate per year of measles THEN
* Reorder the table to list state by order of rate
the_disease <- "Measles"
dat <- us_contagious_diseases %>%
filter(!state %in% c("Hawaii", "Alaska") & disease == the_disease) %>%
mutate(rate = count / population * 10000 * 52/weeks_reporting) %>%
mutate(state = reorder(state, rate))
We can now easily plot the rate of measles per year.
* We take the dat data set and filter out where state is ‘California’ and NOT NA (recall this as the data component step) THEN
* Pipe this to ggplot and map it to x and y (recall this as the aesthetic mapping component), THEN
* Select a geom_line() as the geometry (recall this as the geometry component) - 1963 is the year that measles vaccine was discovered AND
* Add some visual tweaks to assist visually with interpreting the data (recall this as the style component):
dat %>% filter(state == "California" & !is.na(rate)) %>%
ggplot(aes(year, rate)) +
geom_line() +
ylab("Cases per 10,000") +
geom_vline(xintercept=1963, col = "blue")
But this is just California, and just measles. How can we show the data for all states, in one plot?
dat %>% ggplot(aes(year, state, fill=rate)) +
geom_tile(color = "grey50") +
scale_x_continuous(expand = c(0,0)) +
scale_fill_gradientn(colors = RColorBrewer::brewer.pal(9, "Reds"), trans = "sqrt") +
geom_vline(xintercept = 1963, col = "blue") +
theme_minimal() + theme(panel.grid = element_blank()) +
ggtitle(the_disease) +
ylab("") +
xlab("")
# compute US average measles rate by year
avg <- us_contagious_diseases %>%
filter(disease == the_disease) %>% group_by(year) %>%
summarize(us_rate = sum(count, na.rm = TRUE)/sum(population, na.rm = TRUE)*10000)
## `summarise()` ungrouping output (override with `.groups` argument)
# make line plot of measles rate by year by state
dat %>%
filter(!is.na(rate)) %>%
ggplot() +
geom_line(aes(year, rate, group = state), color = "grey50",
show.legend = FALSE, alpha = 0.2, size = 1) +
geom_line(mapping = aes(year, us_rate), data = avg, size = 1, col = "black") +
scale_y_continuous(trans = "sqrt", breaks = c(5, 25, 125, 300)) +
ggtitle("Cases per 10,000 by state") +
xlab("") +
ylab("") +
geom_text(data = data.frame(x = 1955, y = 50),
mapping = aes(x, y, label = "US average"), color = "black") +
geom_vline(xintercept = 1963, col = "blue")