Section 5 covers some general principles that can serve as guides for effective data visualization.
After completing Section 5, you will:
There are 3 assignments that use the DataCamp platform for you to practice your coding skills. There is also 1 assignment on the edX platform to allow you to practice exploratory data analysis.
Key points
Key points
Visual cues for encoding data include position, length, angle, area, brightness and color hue.
Position and length are the preferred way to display quantities, followed by angles, which are preferred over area. Brightness and color are even harder to quantify but can sometimes be useful.
Pie charts represent visual cues as both angles and area, while donut charts use only area. Humans are not good at visually quantifying angles and are even worse at quantifying area. Therefore pie and donut charts should be avoided - use a bar plot instead. If you must make a pie chart, include percentages as labels.
Bar plots represent visual cues as position and length. Humans are good at visually quantifying linear measures, making bar plots a strong alternative to pie or donut charts.
Key points
When using bar plots, always start at 0. It is deceptive not to start at 0 because bar plots imply length is proportional to the quantity displayed. Cutting off the y-axis can make differences look bigger than they actually are.
When using position rather than length, it is not necessary to include 0 (scatterplot, dot plot, boxplot).
Key points
Key points
It is easiest to visually extract information from a plot when categories are ordered by a meaningful value. The exact value on which to order will depend on your data and the message you wish to convey with your plot.
The default ordering for categories is alphabetical if the categories are strings or by factor level if factors. However, we rarely want alphabetical order.
Key points
A dynamite plot - a bar graph of group averages with error bars denoting standard errors - provides almost no information about a distribution.
By showing the data, you provide viewers extra information about distributions.
Jitter is adding a small random shift to each point in order to minimize the number of overlapping points. To add jitter, use the geom_jitter() geometry instead of geom_point(). (See example below.)
Alpha blending is making points somewhat transparent, helping visualize the density of overlapping points. Add an alpha argument to the geometry.
Code
# dot plot showing the data
# heights %>% ggplot(aes(sex, height)) + geom_point()
# jittered, alpha blended point plot
# heights %>% ggplot(aes(sex, height)) + geom_jitter(width = 0.1, alpha = 0.2)
Key points
Ease comparisons by keeping axes the same when comparing data across multiple plots.
Align plots vertically to see horizontal changes. Align plots horizontally to see vertical changes.
Bar plots are useful for showing one number but not useful for showing distributions.
Key points
Use transformations when warranted to ease visual interpretation.
The log transformation is useful for data with multiplicative changes. The logistic transformation is useful for fold changes in odds. The square root transformation is useful for count data.
We learned how to apply transformations earlier in the course.
Key points
When two groups are to be compared, it is optimal to place them adjacent in the plot.
Use color to encode groups to be compared.
Consider using a color blind friendly palette like the one in this video.
Code
color_blind_friendly_cols <- c("#999999", "#E69F00", "#56B4E9", "#009E73", "#F0E442", "#0072B2", "#D55E00", "#CC79A7")
# p1 <- data.frame(x = 1:8, y = 1:8, col = as.character(1:8)) %>%
# ggplot(aes(x, y, color = col)) +
# geom_point(size = 5)
# p1 + scale_color_manual(values = color_blind_friendly_cols)
Transcript
In every single instance in which we have examined the relationship between two variables, total murders versus population size, life expectancy versus fertility rates, and child mortality versus income, we have used scatterplots. This is the plot we generally recommend.
One exception where another type of plot may be more informative is when you are comparing variables of the same type but at different time points and for a relatively small number of comparison. For example, comparing life expectancy between 2010 and 2015. In this case, we might consider a slope chart.
There’s no geometry for slope charts in ggplot2, but we can construct one using geom_line. We need to do some tinkering to add labels and some other changes. The code looks something like this. This piece of code produces the following slope chart.
An advantage of the slope chart is that it permits us to quickly get an idea of changes based on the slope of the lines. Note, that we’re using angle as a visual cue, but we also have position to determine the exact values.
Comparing the improvement is a bit harder when we use the scatterplot. Note that in the scatterplot, we have followed the principle use common axes since we are comparing values before and after.
Now, note that when we have many points, the slope chart stops being useful because it becomes too cluttered, and in this case, we would use a scatterplot.
Finally, we’re going to describe the Bland-Altman plot. Since what we’re interested in is in differences, it makes sense to dedicate one of our axes to differences. The Bland-Altman plot, also known as the Tukey Mean Different plot, and also the MA plot, shows the difference versus the average.
Here’s an example. Here we quickly see which countries have improved the most as it’s represented in the y-axis. We also get an idea of the overall value from the x-axis.
Key points
Consider using a slope chart or Bland-Altman plot when comparing one variable at two different time points, especially for a small number of observations.
Slope charts use angle to encode change. Use geom_line() to create slope charts. It is useful when comparing a small number of observations.
The Bland-Altman plot (Tukey mean difference plot, MA plot) graphs the difference between conditions on the y-axis and the mean between conditions on the x-axis. It is more appropriate for large numbers of observations than slope charts.
Code: Slope chart
library(tidyverse)
## ── Attaching packages ──────────────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.2.1 ✓ purrr 0.3.3
## ✓ tibble 2.1.3 ✓ dplyr 0.8.3
## ✓ tidyr 1.0.0 ✓ stringr 1.4.0
## ✓ readr 1.3.1 ✓ forcats 0.4.0
## ── Conflicts ─────────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(dslabs)
data(gapminder)
west <- c("Western Europe", "Northern Europe", "Southern Europe", "Northern America", "Australia and New Zealand")
dat <- gapminder %>%
filter(year %in% c(2010, 2015) & region %in% west & !is.na(life_expectancy) & population > 10^7)
dat %>%
mutate(location = ifelse(year == 2010, 1, 2),
location = ifelse(year == 2015 & country %in% c("United Kingdom", "Portugal"),
location + 0.22, location),
hjust = ifelse(year == 2010, 1, 0)) %>%
mutate(year = as.factor(year)) %>%
ggplot(aes(year, life_expectancy, group = country)) +
geom_line(aes(color = country), show.legend = FALSE) +
geom_text(aes(x = location, label = country, hjust = hjust), show.legend = FALSE) +
xlab("") +
ylab("Life Expectancy")
Code: Bland-Altman plot
library(ggrepel)
dat %>%
mutate(year = paste0("life_expectancy_", year)) %>%
select(country, year, life_expectancy) %>% spread(year, life_expectancy) %>%
mutate(average = (life_expectancy_2015 + life_expectancy_2010)/2,
difference = life_expectancy_2015 - life_expectancy_2010) %>%
ggplot(aes(average, difference, label = country)) +
geom_point() +
geom_text_repel() +
geom_abline(lty = 2) +
xlab("Average of 2010 and 2015") +
ylab("Difference between 2015 and 2010")
Transcript
We previously showed a scatterplot showing the relationship between infant survival rates and average income. Here’s a version of this plot where we encode three more variables, OPEC membership, region, and population size.
Note that we encode categorical variables with color hue and shape. These shapes can be controlled with a shape argument. Here are the shapes available for use in R. Note that for the last five, the color goes inside.
For continuous variables, we can use color, intensity, or size. In the next video, we’re going to show a case study that demonstrates how to do this.
Key points
Encode a categorical third variable on a scatterplot using color hue or shape. Use the shape argument to control shape.
Encode a continuous third variable on a using color intensity or size.
Transcript
Vaccines have helped save millions of lives. In the 19th century, before herd immunization was achieved through vaccination programs, deaths from infectious diseases, like smallpox and polio, were common. However, today, despite all the scientific evidence for their importance, vaccination programs have become somewhat controversial. The controversy started with a paper published in 1988 and led by Andrew Wakefield claiming there was a link between the administration of the measles, mumps, and rubella MMR vaccine, and the appearance of autism and bowel disease. Despite much scientific evidence contradicting this finding, sensationalist media reports and fear mongering from conspiracy theorists lead parts of the public to believe that vaccines were harmful. Some parents even stopped vaccinating their children. This dangerous practice can be potentially disastrous, given that the Center for Disease Control, CDC, estimates that vaccination will prevent more than 21 million hospitalizations and 732,000 deaths among children born in the last 20 years. The 1988 paper has since been retracted, and Andrew Wakefield was eventually struck off the UK medical register with a statement and identifying deliberate falsification in the research published in The Lancet, and was thereby barred from practicing medicine in the UK. Yet misconceptions persist, in part, due to self-proclaimed activists that continue to [spread] misinformation about vaccines. Effective communication of data is a strong antidote to misinformation and fear mongering.
Earlier we showed an example provided by the Wall Street Journal showing data related to the impact of vaccines on battling infectious diseases. Here we reconstruct that example. The data used in these plots were collected, organized, and distributed by the Tycho project. They include weekly reported counts data for 7 diseases from 1928 to 2011 from all 50 states. We include the yearly totals in the DS labs package. You can get it like this and look at the structure using this command. For the plot there we’re going to make in this video, we create a temporary object called dat that stores all the measles data. It includes a per 100,000 rate, orders states by average value of disease, and removes Alaska and Hawaii, since they only became states in the late 50s.
Here’s the code where we define dat. We can now easily plot disease rates for per year. Here are the measles data for California. We can use this simple code to show it. We add a vertical line at 1963, since this is when the vaccine was introduced.
Now can we show data for all states in one plot? We have three variables to show, year, state, and rate. In the Wall Street Journal figure, they use the x-axis for year, the y-axis for state, and color hue represent rates. We’re using color to represent a continuous variable. However, the color scale they use, which goes from yellow to blue to green to orange to red can be improved.
When choosing colors to quantify a numeric variable, we choose between two options, sequential and diverging. Sequential palettes are suited for data that goes from high to low. High values are clearly distinguished from the low values. Here are some examples offered by the package R color Brewer.
On the other hand, diverging colors are used to represent values that verge from a center. We put equal emphasis on both ends of the data range, higher than the center and lower than the center. An example of when we would use a divergent pattern would be if we were to show heights and standard deviations away from the average. Here is an example of divergent patterns available from R Color Brewer.
In our example, we want to use a sequential palette since there is no meaningful center, just low and high rates. We use the geometry geom_tile to tile the region with colors representing disease rates. We use square root transformation to avoid having the really high counts dominate the plot. Here’s the code that generates a very nice and impactful plot. This plot makes a very striking argument for the contribution of vaccines.
However, one limitation of this plot is that it uses color to represent quantity, which we earlier explained makes it a bit harder to know exactly how high it is going. Position and length are better cues.
If we are willing to lose data information, we can make a version of the plot that shows the values with position. We can also show the average for the US, which we compute like this. Now to make the plot, we simply use the geom_line geometry. We are going to make every state the same color. This is because it’s harder to choose 50 distinct colors. However, the plot is very impactful. It shows very clearly how after the vaccine was introduced the rates went down across all states. It shows the same information as our previous plot, but now we can actually see what the values are.
Key points
Vaccines save millions of lives, but misinformation has led some to question the safety of vaccines. The data support vaccines as safe and effective. We visualize data about measles incidence in order to demonstrate the impact of vaccination programs on disease rate.
The RColorBrewer package offers several color palettes. Sequential color palettes are best suited for data that span from high to low. Diverging color palettes are best suited for data that are centered and diverge towards high or low values.
The geom_tile() geometry creates a grid of colored tiles.
Position and length are stronger cues than color for numeric values, but color can be appropriate sometimes.
Code: Tile plot of measles rate by year and state
# import data and inspect
library(tidyverse)
library(dslabs)
data(us_contagious_diseases)
str(us_contagious_diseases)
## 'data.frame': 16065 obs. of 6 variables:
## $ disease : Factor w/ 7 levels "Hepatitis A",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ state : Factor w/ 51 levels "Alabama","Alaska",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ year : num 1966 1967 1968 1969 1970 ...
## $ weeks_reporting: num 50 49 52 49 51 51 45 45 45 46 ...
## $ count : num 321 291 314 380 413 378 342 467 244 286 ...
## $ population : num 3345787 3364130 3386068 3412450 3444165 ...
# assign dat to the per 10,000 rate of measles, removing Alaska and Hawaii and adjusting for weeks reporting
the_disease <- "Measles"
dat <- us_contagious_diseases %>%
filter(!state %in% c("Hawaii", "Alaska") & disease == the_disease) %>%
mutate(rate = count / population * 10000 * 52/weeks_reporting) %>%
mutate(state = reorder(state, rate))
# plot disease rates per year in California
dat %>% filter(state == "California" & !is.na(rate)) %>%
ggplot(aes(year, rate)) +
geom_line() +
ylab("Cases per 10,000") +
geom_vline(xintercept=1963, col = "blue")
# tile plot of disease rate by state and year
dat %>% ggplot(aes(year, state, fill=rate)) +
geom_tile(color = "grey50") +
scale_x_continuous(expand = c(0,0)) +
scale_fill_gradientn(colors = RColorBrewer::brewer.pal(9, "Reds"), trans = "sqrt") +
geom_vline(xintercept = 1963, col = "blue") +
theme_minimal() + theme(panel.grid = element_blank()) +
ggtitle(the_disease) +
ylab("") +
xlab("")
Code: Line plot of measles rate by year and state
# compute US average measles rate by year
avg <- us_contagious_diseases %>%
filter(disease == the_disease) %>% group_by(year) %>%
summarize(us_rate = sum(count, na.rm = TRUE)/sum(population, na.rm = TRUE)*10000)
# make line plot of measles rate by year by state
dat %>%
filter(!is.na(rate)) %>%
ggplot() +
geom_line(aes(year, rate, group = state), color = "grey50",
show.legend = FALSE, alpha = 0.2, size = 1) +
geom_line(mapping = aes(year, us_rate), data = avg, size = 1, col = "black") +
scale_y_continuous(trans = "sqrt", breaks = c(5, 25, 125, 300)) +
ggtitle("Cases per 10,000 by state") +
xlab("") +
ylab("") +
geom_text(data = data.frame(x = 1955, y = 50),
mapping = aes(x, y, label = "US average"), color = "black") +
geom_vline(xintercept = 1963, col = "blue")
Transcript
Here we describe an important data visualization principle. Avoid pseudo three dimensional plots. The figure we show here was taken from the scientific literature. It shows three variables, dose, drug type, and survival. Although when you look at a plot, you’re almost always looking at a screen or a book page, which are both flat and two dimensional, this plot tries to imitate three-dimension and assigns a dimension to each variable. Humans are not good at seeing in three dimensions. Think about how hard it is to parallel park. And our limitation is even worse when it’s pseudo three-dimensional, as it is when you put it on a page or a web page. To see this, try to determine the value of the survival variable in the plot. Can you tell when the purple ribbon intersects the red one? This is an example in which it’s easy to use color to represent the categorical variable. We can make the plot like this. This plot demonstrate that using color is more than enough to distinguish the three lines. Pseudo 3D is somewhat used completely gratuitously. Plots are made to look 3D, even when the third dimension does not represent any quantity. This only adds confusion and makes it harder to relay your message. Here is an example. This is a three-dimensional bar plot. The third dimension adds nothing, only confusion. So in general, avoid pseudo 3D plots, and even more avoid gratuitous 3D plots.
Key point
Transcript
By default, statistical software like R returns many significant digits. The principle we’re about to discuss relates to tables, not graphs, and it’s to avoid too many statistical digits. The default behavior in R is to show seven significant digits. So many digits often adds no information, and the visual clutter makes it hard for the consumer of your table to understand the message. As an example, here the per 10,000 disease rates for California across five decades. We are reporting positions of up to 0.00001 cases per 10,000, a very small value in the context of the changes that are occurring across the dates. In this case, two significant figures is more than enough and makes the point that the rates are decreasing. Useful functions in R to change the number of significant digits or to round numbers are signif and round. You can define the number of significant digits to use globally by setting an option. You can do it like this. Another principle related to displaying tables is the place values being compared on columns rather than rows. Here’s what the table would look like if we placed the numbers being compared horizontally. It’s a little bit harder to make the comparison.
Key points
In tables, avoid using too many significant digits. Too many digits can distract from the meaning of your data.
Reduce the number of significant digits globally by setting an option. For example, options(digits = 3) will cause all future computations that session to have 3 significant digits.
Reduce the number of digits locally using round() or signif().