Section 5 Overview

Section 5 covers some general principles that can serve as guides for effective data visualization.

After completing Section 5, you will:

There are 3 assignments that use the DataCamp platform for you to practice your coding skills. There is also 1 assignment on the edX platform to allow you to practice exploratory data analysis.

Introduction to Data Visualization Principles

Key points

Encoding Data Using Visual Cues

Key points

Know When to Include Zero

Key points

Do Not Distort Quantities

Key points

Order by a Meaningful Value

Key points

Show the Data

Key points

Code

# dot plot showing the data
# heights %>% ggplot(aes(sex, height)) + geom_point()

# jittered, alpha blended point plot
# heights %>% ggplot(aes(sex, height)) + geom_jitter(width = 0.1, alpha = 0.2)

Ease Comparisons: Use Common Axes

Key points

Consider Transformations

Key points

Ease Comparisons: Compared Visual Cues Should Be Adjacent

Key points

Code

color_blind_friendly_cols <- c("#999999", "#E69F00", "#56B4E9", "#009E73", "#F0E442", "#0072B2", "#D55E00", "#CC79A7")

# p1 <- data.frame(x = 1:8, y = 1:8, col = as.character(1:8)) %>%
#    ggplot(aes(x, y, color = col)) +
#    geom_point(size = 5)
# p1 + scale_color_manual(values = color_blind_friendly_cols)

Slope Charts

Transcript

In every single instance in which we have examined the relationship between two variables, total murders versus population size, life expectancy versus fertility rates, and child mortality versus income, we have used scatterplots. This is the plot we generally recommend.

One exception where another type of plot may be more informative is when you are comparing variables of the same type but at different time points and for a relatively small number of comparison. For example, comparing life expectancy between 2010 and 2015. In this case, we might consider a slope chart.

There’s no geometry for slope charts in ggplot2, but we can construct one using geom_line. We need to do some tinkering to add labels and some other changes. The code looks something like this. This piece of code produces the following slope chart.

An advantage of the slope chart is that it permits us to quickly get an idea of changes based on the slope of the lines. Note, that we’re using angle as a visual cue, but we also have position to determine the exact values.

Comparing the improvement is a bit harder when we use the scatterplot. Note that in the scatterplot, we have followed the principle use common axes since we are comparing values before and after.

Now, note that when we have many points, the slope chart stops being useful because it becomes too cluttered, and in this case, we would use a scatterplot.

Finally, we’re going to describe the Bland-Altman plot. Since what we’re interested in is in differences, it makes sense to dedicate one of our axes to differences. The Bland-Altman plot, also known as the Tukey Mean Different plot, and also the MA plot, shows the difference versus the average.

Here’s an example. Here we quickly see which countries have improved the most as it’s represented in the y-axis. We also get an idea of the overall value from the x-axis.

Key points

Code: Slope chart

library(tidyverse)
## ── Attaching packages ──────────────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.2.1     ✓ purrr   0.3.3
## ✓ tibble  2.1.3     ✓ dplyr   0.8.3
## ✓ tidyr   1.0.0     ✓ stringr 1.4.0
## ✓ readr   1.3.1     ✓ forcats 0.4.0
## ── Conflicts ─────────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(dslabs)
data(gapminder)

west <- c("Western Europe", "Northern Europe", "Southern Europe", "Northern America", "Australia and New Zealand")

dat <- gapminder %>%
    filter(year %in% c(2010, 2015) & region %in% west & !is.na(life_expectancy) & population > 10^7)

dat %>%
    mutate(location = ifelse(year == 2010, 1, 2),
           location = ifelse(year == 2015 & country %in% c("United Kingdom", "Portugal"),
                             location + 0.22, location),
           hjust = ifelse(year == 2010, 1, 0)) %>%
    mutate(year = as.factor(year)) %>%
    ggplot(aes(year, life_expectancy, group = country)) +
    geom_line(aes(color = country), show.legend = FALSE) +
    geom_text(aes(x = location, label = country, hjust = hjust), show.legend = FALSE) +
    xlab("") +
    ylab("Life Expectancy") 

Code: Bland-Altman plot

library(ggrepel)
dat %>%
    mutate(year = paste0("life_expectancy_", year)) %>%
    select(country, year, life_expectancy) %>% spread(year, life_expectancy) %>%
    mutate(average = (life_expectancy_2015 + life_expectancy_2010)/2,
                difference = life_expectancy_2015 - life_expectancy_2010) %>%
    ggplot(aes(average, difference, label = country)) +
    geom_point() +
    geom_text_repel() +
    geom_abline(lty = 2) +
    xlab("Average of 2010 and 2015") +
    ylab("Difference between 2015 and 2010")

Encoding a Third Variable

Transcript

We previously showed a scatterplot showing the relationship between infant survival rates and average income. Here’s a version of this plot where we encode three more variables, OPEC membership, region, and population size.

Note that we encode categorical variables with color hue and shape. These shapes can be controlled with a shape argument. Here are the shapes available for use in R. Note that for the last five, the color goes inside.

For continuous variables, we can use color, intensity, or size. In the next video, we’re going to show a case study that demonstrates how to do this.

Key points

Case Study: Vaccines

Transcript

Vaccines have helped save millions of lives. In the 19th century, before herd immunization was achieved through vaccination programs, deaths from infectious diseases, like smallpox and polio, were common. However, today, despite all the scientific evidence for their importance, vaccination programs have become somewhat controversial. The controversy started with a paper published in 1988 and led by Andrew Wakefield claiming there was a link between the administration of the measles, mumps, and rubella MMR vaccine, and the appearance of autism and bowel disease. Despite much scientific evidence contradicting this finding, sensationalist media reports and fear mongering from conspiracy theorists lead parts of the public to believe that vaccines were harmful. Some parents even stopped vaccinating their children. This dangerous practice can be potentially disastrous, given that the Center for Disease Control, CDC, estimates that vaccination will prevent more than 21 million hospitalizations and 732,000 deaths among children born in the last 20 years. The 1988 paper has since been retracted, and Andrew Wakefield was eventually struck off the UK medical register with a statement and identifying deliberate falsification in the research published in The Lancet, and was thereby barred from practicing medicine in the UK. Yet misconceptions persist, in part, due to self-proclaimed activists that continue to [spread] misinformation about vaccines. Effective communication of data is a strong antidote to misinformation and fear mongering.

Earlier we showed an example provided by the Wall Street Journal showing data related to the impact of vaccines on battling infectious diseases. Here we reconstruct that example. The data used in these plots were collected, organized, and distributed by the Tycho project. They include weekly reported counts data for 7 diseases from 1928 to 2011 from all 50 states. We include the yearly totals in the DS labs package. You can get it like this and look at the structure using this command. For the plot there we’re going to make in this video, we create a temporary object called dat that stores all the measles data. It includes a per 100,000 rate, orders states by average value of disease, and removes Alaska and Hawaii, since they only became states in the late 50s.

Here’s the code where we define dat. We can now easily plot disease rates for per year. Here are the measles data for California. We can use this simple code to show it. We add a vertical line at 1963, since this is when the vaccine was introduced.

Now can we show data for all states in one plot? We have three variables to show, year, state, and rate. In the Wall Street Journal figure, they use the x-axis for year, the y-axis for state, and color hue represent rates. We’re using color to represent a continuous variable. However, the color scale they use, which goes from yellow to blue to green to orange to red can be improved.

When choosing colors to quantify a numeric variable, we choose between two options, sequential and diverging. Sequential palettes are suited for data that goes from high to low. High values are clearly distinguished from the low values. Here are some examples offered by the package R color Brewer.

On the other hand, diverging colors are used to represent values that verge from a center. We put equal emphasis on both ends of the data range, higher than the center and lower than the center. An example of when we would use a divergent pattern would be if we were to show heights and standard deviations away from the average. Here is an example of divergent patterns available from R Color Brewer.

In our example, we want to use a sequential palette since there is no meaningful center, just low and high rates. We use the geometry geom_tile to tile the region with colors representing disease rates. We use square root transformation to avoid having the really high counts dominate the plot. Here’s the code that generates a very nice and impactful plot. This plot makes a very striking argument for the contribution of vaccines.

However, one limitation of this plot is that it uses color to represent quantity, which we earlier explained makes it a bit harder to know exactly how high it is going. Position and length are better cues.

If we are willing to lose data information, we can make a version of the plot that shows the values with position. We can also show the average for the US, which we compute like this. Now to make the plot, we simply use the geom_line geometry. We are going to make every state the same color. This is because it’s harder to choose 50 distinct colors. However, the plot is very impactful. It shows very clearly how after the vaccine was introduced the rates went down across all states. It shows the same information as our previous plot, but now we can actually see what the values are.

Key points

Code: Tile plot of measles rate by year and state

# import data and inspect
library(tidyverse)
library(dslabs)
data(us_contagious_diseases)
str(us_contagious_diseases)
## 'data.frame':    16065 obs. of  6 variables:
##  $ disease        : Factor w/ 7 levels "Hepatitis A",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ state          : Factor w/ 51 levels "Alabama","Alaska",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ year           : num  1966 1967 1968 1969 1970 ...
##  $ weeks_reporting: num  50 49 52 49 51 51 45 45 45 46 ...
##  $ count          : num  321 291 314 380 413 378 342 467 244 286 ...
##  $ population     : num  3345787 3364130 3386068 3412450 3444165 ...
# assign dat to the per 10,000 rate of measles, removing Alaska and Hawaii and adjusting for weeks reporting
the_disease <- "Measles"
dat <- us_contagious_diseases %>%
    filter(!state %in% c("Hawaii", "Alaska") & disease == the_disease) %>%
    mutate(rate = count / population * 10000 * 52/weeks_reporting) %>%
    mutate(state = reorder(state, rate))

# plot disease rates per year in California
dat %>% filter(state == "California" & !is.na(rate)) %>%
    ggplot(aes(year, rate)) +
    geom_line() +
    ylab("Cases per 10,000") +
    geom_vline(xintercept=1963, col = "blue")

# tile plot of disease rate by state and year
dat %>% ggplot(aes(year, state, fill=rate)) +
    geom_tile(color = "grey50") +
    scale_x_continuous(expand = c(0,0)) +
    scale_fill_gradientn(colors = RColorBrewer::brewer.pal(9, "Reds"), trans = "sqrt") +
    geom_vline(xintercept = 1963, col = "blue") +
    theme_minimal() + theme(panel.grid = element_blank()) +
    ggtitle(the_disease) +
    ylab("") +
    xlab("")

Code: Line plot of measles rate by year and state

# compute US average measles rate by year
avg <- us_contagious_diseases %>%
    filter(disease == the_disease) %>% group_by(year) %>%
    summarize(us_rate = sum(count, na.rm = TRUE)/sum(population, na.rm = TRUE)*10000)

# make line plot of measles rate by year by state
dat %>%
    filter(!is.na(rate)) %>%
    ggplot() +
    geom_line(aes(year, rate, group = state), color = "grey50", 
        show.legend = FALSE, alpha = 0.2, size = 1) +
    geom_line(mapping = aes(year, us_rate), data = avg, size = 1, col = "black") +
    scale_y_continuous(trans = "sqrt", breaks = c(5, 25, 125, 300)) +
    ggtitle("Cases per 10,000 by state") +
    xlab("") +
    ylab("") +
    geom_text(data = data.frame(x = 1955, y = 50),
        mapping = aes(x, y, label = "US average"), color = "black") +
    geom_vline(xintercept = 1963, col = "blue")

Avoid Pseudo and Gratuitous 3D Plots

Transcript

Here we describe an important data visualization principle. Avoid pseudo three dimensional plots. The figure we show here was taken from the scientific literature. It shows three variables, dose, drug type, and survival. Although when you look at a plot, you’re almost always looking at a screen or a book page, which are both flat and two dimensional, this plot tries to imitate three-dimension and assigns a dimension to each variable. Humans are not good at seeing in three dimensions. Think about how hard it is to parallel park. And our limitation is even worse when it’s pseudo three-dimensional, as it is when you put it on a page or a web page. To see this, try to determine the value of the survival variable in the plot. Can you tell when the purple ribbon intersects the red one? This is an example in which it’s easy to use color to represent the categorical variable. We can make the plot like this. This plot demonstrate that using color is more than enough to distinguish the three lines. Pseudo 3D is somewhat used completely gratuitously. Plots are made to look 3D, even when the third dimension does not represent any quantity. This only adds confusion and makes it harder to relay your message. Here is an example. This is a three-dimensional bar plot. The third dimension adds nothing, only confusion. So in general, avoid pseudo 3D plots, and even more avoid gratuitous 3D plots.

Key point

Avoid Too Many Significant Digits

Transcript

By default, statistical software like R returns many significant digits. The principle we’re about to discuss relates to tables, not graphs, and it’s to avoid too many statistical digits. The default behavior in R is to show seven significant digits. So many digits often adds no information, and the visual clutter makes it hard for the consumer of your table to understand the message. As an example, here the per 10,000 disease rates for California across five decades. We are reporting positions of up to 0.00001 cases per 10,000, a very small value in the context of the changes that are occurring across the dates. In this case, two significant figures is more than enough and makes the point that the rates are decreasing. Useful functions in R to change the number of significant digits or to round numbers are signif and round. You can define the number of significant digits to use globally by setting an option. You can do it like this. Another principle related to displaying tables is the place values being compared on columns rather than rows. Here’s what the table would look like if we placed the numbers being compared horizontally. It’s a little bit harder to make the comparison.

Key points