Slope Chart

If you are comparing 2 variables, scatter plots are recommended. But if you are comparing 1 variable at different times, then slope charts are recommended.

Example: Comparing life expectancy between 2010 & 2015. (There’s no slope chart in ggplot2, but we can construct one using geom_lines)

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)
library(dslabs)
data("gapminder")

west <- c("Western Europe", "Northern Europe", "Southern Europe", "Northern America", "Australia and New Zealand")
dat <- gapminder %>%
  filter(year %in% c(2010, 2015) & region %in% west &!is.na(life_expectancy) & population > 10^7)
dat %>%
  mutate(location = ifelse(year == 2010, 1, 2),
         location = ifelse(year == 2015 & country%in%c("United Kingdom", "Portugal"),
                           location + 0.22, location),
         hjust = ifelse(year == 2010, 1, 0)) %>%
  mutate(year = as.factor(year)) %>%
  ggplot(aes(year, life_expectancy, group = country)) +
  geom_line(aes(color = country), show.legend = FALSE) +
  geom_text(aes(x = location, label = country, hjust = hjust),
            show.legend = FALSE) +
  xlab("") + ylab("Life Expectancy")

Note: This is using angles as a visual cue, but now we have a position to see the values on the axis.
- If there were many points, the slope chart would not be useful and we would use a scatter plot instead.

Bland-Altman plot

Also known as the Two Key Mean Different plot & MA plot. This plot shows the differences between the values by dedicating one of the axes to the differences.

Here is an example: image:
- We can quickly see which country improved the most (y-axis).
 

Case Study: Vaccines

Since the 19th century, vaccinating programs have prevented death from infectious diseases like smallpox and polio.
However, in 1988 Andrew Wakefield published a controversial article claiming a link between measles, mumps, and rubella MMR vaccines to autism and bowel disease.
Despite the many scientific evidence contradicting this, sensationalist media reports and fear mongering from conspiracy theories lead parts of the public to believe that vaccines were harmful.
The UK government retracted the 1988 paper and banned Andrew Wakefield from practicing medicine in the UK for deliberate falsification in the research. However, the misconception persists.

The data used in these plots were collected, organized, and distributed by the Tycho project. They include weekly reported counts data for 7 diseases from 1928 to 2011 from all 50 states (Included in dslabs package).

First, take a look at the data.

library(dslabs)
data("us_contagious_diseases")
str(us_contagious_diseases)
## 'data.frame':    18870 obs. of  6 variables:
##  $ disease        : Factor w/ 7 levels "Hepatitis A",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ state          : Factor w/ 51 levels "Alabama","Alaska",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ year           : num  1966 1967 1968 1969 1970 ...
##  $ weeks_reporting: int  50 49 52 49 51 51 45 45 45 46 ...
##  $ count          : num  321 291 314 380 413 378 342 467 244 286 ...
##  $ population     : num  3345787 3364130 3386068 3412450 3444165 ...

 

Prepare for plot

Make a temporary object called “dat” that stores all the measles data.
The dat object will include a per 100,000 rate, orders states by the average value of disease, and removes Alaska & Hawaii since they only became states in the late 50s.

the_disease <- "Measles"

dat <- us_contagious_diseases %>%
  filter(!state%in%c("Hawaii", "Alaska") & disease == the_disease) %>%
  mutate(rate = count / population * 1000) %>%
  mutate(state = reorder(state, rate))
  • We can now easily plot disease rates for per year.
     

Plot Measles data for California

Add a verticle line at year 1963 to show when vaccine was introduced.

dat %>% filter(state == "California") %>%
  ggplot(aes(year, rate)) +
  geom_line() + ylab("Case per 10,000") +
  geom_vline(xintercept = 1963, col = "blue")

  • We see a dramatic decline of measle cases after year 1963.
     

Using colors to show pattern

If we wanted to show ALL states, we would have to use color to show the pattern.
In the Wall Street Journal figure, they use the x-axis for year, the y-axis for state, and color hue represent rates.
image:

  • When choosing colors to quantify a numeric variable, we choose between sequential and diverging.
      1. Sequential palettes are best for data that goes from high to low.
    • Example from R color Brewer:
library(RColorBrewer)
display.brewer.all(type="seq")



2. Diverging colors are used to represent values that verge from a center.
It’s useful to use to represent something with a range, like height. Can show high & low from center point.
- Here are some examples:

library(RColorBrewer)
display.brewer.all(type="div")


Since we don’t have a meaningful center in our example, we will use Sequential palettes.

Use geom_tile() to tile the region with colors that represent disease rates.

dat %>% ggplot(aes(year, state, fill = rate)) +
  geom_tile(color = "grey50") +
  scale_x_continuous(expand=c(0,0)) +
  scale_fill_gradientn(colors = brewer.pal(9, "Reds"), trans = "sqrt") +
  geom_vline(xintercept=1963, col = "blue") +
  theme_minimal() + theme(panel.grid = element_blank()) +
  ggtitle(the_disease) +
  ylab("") +
  xlab("")


- Above is a significant plot that shows the massive difference in measle outbreak after 1963, but the color shade represents the number of measle cases.
- Before we spoke on how hard it is to see qty by color shade.

Showing the value of the position is essential. We’ll lose more information, but it might be better to show the position in the plot.
First, calculate the average for the US.

avg <- us_contagious_diseases %>%
  filter(disease == the_disease) %>%
  group_by(year) %>%
  summarize(us_rate = sum(count, na.rm = TRUE) / sum(population, na.rm = TRUE) * 10000)


Next, make the plot by using geom_line().

dat %>% ggplot() +
  geom_line(aes(year, rate, group = state),  color = "grey50", 
            show.legend = FALSE, alpha = 0.2, size = 1) +
  geom_line(mapping = aes(year, us_rate),  data = avg, size = 1, color = "black") +
  scale_y_continuous(trans = "sqrt", breaks = c(5,25,125,300)) + 
  ggtitle("Cases per 10,000 by state") + 
  xlab("") + 
  ylab("") +
  geom_text(data = data.frame(x=1955, y=50), mapping = aes(x, y, label="US average"), color="black") + 
  geom_vline(xintercept=1963, col = "blue")

Avoid Pseudo and Gratuitous 3D Plots

Do not use psedo 3D plots like the one below:
image:

Do not use gratuitous 3D plots like the one below:
image: