Data Visualization

This document was composed from Dr. Snopkowski’s ANTH 504 Week 4 lecture and from Introduction to Data Science: Data analysis and prediction algorithms with R by Rafael A.Irizarry (Irizarry 2019).

Ch.10 Principles

data.table

Sometimes you will see |> instead of the %>%.

The |> code works the same as %>%

murders |> head()
##        state abb region population total
## 1    Alabama  AL  South    4779736   135
## 2     Alaska  AK   West     710231    19
## 3    Arizona  AZ   West    6392017   232
## 4   Arkansas  AR  South    2915918    93
## 5 California  CA   West   37253956  1257
## 6   Colorado  CO   West    5029196    65

Data around the world

Which country has the highest child mortality? Sri Lanka or Turkey Poland or South Korea Malaysia or Russia Pakistan or Vietnam Thailand or South Africa

Ted Talk 2006

Gapminder Data

data(gapminder)
gapminder %>% as_tibble()
## # A tibble: 10,545 × 9
##    country          year infan…¹ life_…² ferti…³ popul…⁴      gdp conti…⁵ region
##    <fct>           <int>   <dbl>   <dbl>   <dbl>   <dbl>    <dbl> <fct>   <fct> 
##  1 Albania          1960   115.     62.9    6.19  1.64e6 NA       Europe  South…
##  2 Algeria          1960   148.     47.5    7.65  1.11e7  1.38e10 Africa  North…
##  3 Angola           1960   208      36.0    7.32  5.27e6 NA       Africa  Middl…
##  4 Antigua and Ba…  1960    NA      63.0    4.43  5.47e4 NA       Americ… Carib…
##  5 Argentina        1960    59.9    65.4    3.11  2.06e7  1.08e11 Americ… South…
##  6 Armenia          1960    NA      66.9    4.55  1.87e6 NA       Asia    Weste…
##  7 Aruba            1960    NA      65.7    4.82  5.42e4 NA       Americ… Carib…
##  8 Australia        1960    20.3    70.9    3.45  1.03e7  9.67e10 Oceania Austr…
##  9 Austria          1960    37.3    68.8    2.7   7.07e6  5.24e10 Europe  Weste…
## 10 Azerbaijan       1960    NA      61.3    5.57  3.90e6 NA       Asia    Weste…
## # … with 10,535 more rows, and abbreviated variable names ¹​infant_mortality,
## #   ²​life_expectancy, ³​fertility, ⁴​population, ⁵​continent

Let’s examine the infant mortality rates of those countries displayed in the video for 2015: Sri Lanka vs. Turkey Thailand vs. South Africa Column headings include: country and infant_mortality Write the code to get these values

Plot a scatterplot of fertility (x-axis) vs. life_expectancy (y-axis) for 1960.

gapminder %>% 
  filter(year == 1960 & country %in% c("Sri Lanka", "Turkey")) 
##     country year infant_mortality life_expectancy fertility population
## 1 Sri Lanka 1960             72.7           59.76      5.54    9896172
## 2    Turkey 1960            166.0           46.91      6.30   27553280
##           gdp continent        region
## 1  2708601390      Asia Southern Asia
## 2 44566050533      Asia  Western Asia

Add unique colors by continent

p1 <- gapminder |>
  filter(year  == 1960) |>
  ggplot(aes(fertility, life_expectancy, color = continent)) +
  geom_point()
p1

Compare this to 2015

p2 <- gapminder |>
  filter(year  == 2015) |>
  ggplot(aes(fertility, life_expectancy, color = continent)) +
  geom_point()
p2
## Warning: Removed 1 rows containing missing values (`geom_point()`).

Can you make side-by-side plots? grid.arrange() is part of gridExtra package.

Notice that the plots were saved under p1 and p2

grid.arrange(p1, p2, ncol = 2)
## Warning: Removed 1 rows containing missing values (`geom_point()`).

Faceting

Faceting variables allows us to show multiple plots by stratifying the data by some variable We add a layer with the function facet_grid(), which separates our plots. You can facet by 2 variables: included as row column

gapminder %>% 
  filter(year %in% c(1962, 2015)) %>% 
  ggplot(aes(fertility, life_expectancy, col=continent)) + 
    geom_point() + 
    facet_grid(continent ~ year)
## Warning: Removed 1 rows containing missing values (`geom_point()`).

year %in% tells what year you want included. continent is row and year is columns,

If instead we just want to have 2 side-by-side graphs separated by year (as in the last slide), we can use facet_grid like this:

gapminder %>% filter(year %in% c(1962, 2015)) %>% 
  ggplot(aes(fertility, life_expectancy, col=continent)) + 
  geom_point() + 
  facet_grid(. ~ year)
## Warning: Removed 1 rows containing missing values (`geom_point()`).

.~year separates it by year. It puts year in different columns.

Perhaps using the term “developing world” no longer makes sense.

facet_wrap

If we want to include several charts (for instance to see how these values have changed over time), we may not want to use facet_grid because they won’t all fit on one row. facet_wrap() permits us to wrap the graphs across multiple rows and columns so that each plot is view able.

years <- c(1960, 1970, 1980, 1990, 2000, 2010, 2015)
continents <- c("Europe", "Asia")
gapminder %>% 
  filter(year %in% years & continent %in% continents) %>% 
  ggplot(aes(fertility, life_expectancy, col=continent)) + 
    geom_point() + facet_wrap(~year) 

It just raps the text around when it runs out of space. Declaring the year as a separate line of code will help allow you to use and edit the code in the future. It is a good practice for what code is needed in the future but edited of specific analysis will occur. year and years is somewhat different but It might be better to makei it more unique.

By default, R will keep the scales the same (to compare across charts). If you don’t want the scales to be the same you can use:

gapminder %>% 
  filter(year %in% years & continent %in% continents) %>% 
  ggplot(aes(fertility, life_expectancy, col=continent)) + 
    geom_point() + 
    facet_wrap(~year, scales = "free")

This is not best practice as people expect the grid to be the same across the graphs.

Time series plots

We may be interested in understanding how changes in fertility and life expectancy changed over time. To look at factors that change over time, we can use time series plots (where time is on the x-axis)

For instance, let’s look at fertility rates in the US across time:

gapminder %>% 
  filter(country == "United States") %>% 
  ggplot(aes(year, fertility)) + 
    geom_point() 
## Warning: Removed 1 rows containing missing values (`geom_point()`).

OR – if you like time to be current to past

gapminder %>% 
  filter(country == "United States") %>% 
  ggplot(aes(year, fertility)) + 
    geom_point() + 
    scale_x_continuous(trans="reverse")
## Warning: Removed 1 rows containing missing values (`geom_point()`).

Since the points are close together, we might want to connect them with a line. This reverses the x-axis scale.It is not very useful when looking at time.

gapminder %>% 
  filter(country == "United States") %>% 
  ggplot(aes(year, fertility)) + 
    geom_line() 
## Warning: Removed 1 row containing missing values (`geom_line()`).

What interpretations would you draw from this chart? This plot shows how fertility rate has dropped drastically during the 1960s and 1970s. By about 1975, the fertility rate hit a low which did increase slightly but overall has remained stable to the present time.

Plotting 2 lines

We may be interested in comparing 2 countries over time. So how can we create 2 lines?

Let’s start with the points:

countries <- c("South Korea", "Germany")
gapminder %>% 
  filter(country %in% countries) %>% 
  ggplot(aes(year, fertility)) + 
    geom_point()
## Warning: Removed 2 rows containing missing values (`geom_point()`).

Then we can change it to lines:

countries <- c("South Korea", "Germany")
gapminder %>% 
  filter(country %in% countries) %>% 
  ggplot(aes(year, fertility)) + 
    geom_line()
## Warning: Removed 2 rows containing missing values (`geom_line()`).

But . . . this isn’t what we want. We need to tell R that we want separate lines, so we add a group argument.

gapminder %>% 
  filter(country %in% countries) %>% 
  ggplot(aes(year, fertility, group = country)) + 
    geom_line()
## Warning: Removed 2 rows containing missing values (`geom_line()`).

You can’t identify which country is which. We obviously want to indicate which line is which. We can do this by using the color argument instead of group.When you tell R that they are grouped, R will know to make the lines separate.

gapminder %>% 
  filter(country %in% countries) %>% 
  ggplot(aes(year, fertility, col = country)) + 
    geom_line()
## Warning: Removed 2 rows containing missing values (`geom_line()`).

Saying col = will make it a separate color with a legand.

Labels instead of legends

Labeling may be preferred over legends whenever possible since labels are easier for readers.

To include labels we need to decide where on the chart we want to place them. For South Korea, we may want to put the label at (1970, 5). For Germany, we may want to put the label at (1960, 2.7).

First, we have to create our labels:

labels <- data.frame(country = countries, x=c(1970, 1960), y=c(5, 2.5)) 
head(labels)
##       country    x   y
## 1 South Korea 1970 5.0
## 2     Germany 1960 2.5

It is best to do this step by step and check your work.

gapminder %>% 
  filter(country %in% countries) %>% 
  ggplot(aes(year, fertility, col=country)) + 
    geom_line() + 
    geom_text(data=labels, aes(x,y,label=country))
## Warning: Removed 2 rows containing missing values (`geom_line()`).

You can add the theme top give it a special position.

gapminder %>% 
  filter(country %in% countries) %>% 
  ggplot(aes(year, fertility, col=country)) + 
    geom_line() + 
    geom_text(data=labels, aes(x,y,label=country)) + 
    theme(legend.position = "none")
## Warning: Removed 2 rows containing missing values (`geom_line()`).

gapminder %>% 
  filter(country %in% countries) %>% 
  ggplot(aes(year, fertility, col=country)) + 
    geom_line() + 
    geom_text(data=labels, aes(x,y,label=country), size=5) + 
    theme(legend.position = "none")
## Warning: Removed 2 rows containing missing values (`geom_line()`).

We need theme(legend.position = none), otherwise a legend will pop up. You tell R where to put the label.

gapminder %>% 
  filter(country %in% countries) %>% 
  ggplot(aes(year, fertility, col=country)) + 
    geom_line() + 
    geom_text(aes(x,y,label=country), labels, size=5) + 
    theme(legend.position = "none")
## Warning: Removed 2 rows containing missing values (`geom_line()`).

Getting rid of legends can also be done like this, but requires more code:

gapminder %>% 
  filter(country %in% countries) %>% 
  ggplot(aes(year, fertility, col=country)) + 
    geom_line(show.legend=FALSE) + 
    geom_text(data=labels, aes(x,y,label=country), size = 5, show.legend=FALSE)
## Warning: Removed 2 rows containing missing values (`geom_line()`).

geomtextpath

That last method was long winded. Maybe we can do this a bit easier. An alternative way to label your lines:

#install.packages("geomtextpath")
library(geomtextpath)
p <- gapminder %>% 
  filter(country %in% countries) %>% 
  ggplot(aes(year, fertility, col=country, label=country)) + 
    geom_textpath() +
    theme(legend.position = "none")
p
## Warning: Removed 2 rows containing missing values (`geom_textpath()`).

theme(legend.position = "none") gets rid of the legend.

countries <- c("South Korea", "Germany", "Haiti")
gapminder %>% 
  filter(country %in% countries) %>% 
  ggplot(aes(year, fertility, col=country, label=country)) + 
    geom_textpath() +
    theme(legend.position = "none")
## Warning: Removed 3 rows containing missing values (`geom_textpath()`).

Country-level wealth

How has the wealth distribution changed across the world over time?

We can use GDP (gross domestic product) a measure of the market value of goods and services produced by a country in a year. We can then calculate this per person in a country, which is a rough estimate of a country’s wealth. Then we can divide by 365 to get the per person wealth per day. In countries where people live on less than $2/day on average, they are considered to live in absolute poverty.

Let’s create this variable: dollars_per_day (Note: this is measured in current US dollars)

gapminder <- gapminder %>% 
  mutate(dollars_per_day = gdp/population/365)
head(gapminder)
##               country year infant_mortality life_expectancy fertility
## 1             Albania 1960           115.40           62.87      6.19
## 2             Algeria 1960           148.20           47.50      7.65
## 3              Angola 1960           208.00           35.98      7.32
## 4 Antigua and Barbuda 1960               NA           62.97      4.43
## 5           Argentina 1960            59.87           65.39      3.11
## 6             Armenia 1960               NA           66.86      4.55
##   population          gdp continent          region dollars_per_day
## 1    1636054           NA    Europe Southern Europe              NA
## 2   11124892  13828152297    Africa Northern Africa        3.405458
## 3    5270844           NA    Africa   Middle Africa              NA
## 4      54681           NA  Americas       Caribbean              NA
## 5   20619075 108322326649  Americas   South America       14.393153
## 6    1867396           NA      Asia    Western Asia              NA

Get a look at the distribution. Let’s plot a histogram dollars of per day in 1970.

hist(gapminder$dollars_per_day)

Look at just one year

gapminder |>
  filter(year == 1970) %>%
  ggplot(aes(dollars_per_day)) +
  geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 72 rows containing non-finite values (`stat_bin()`).

Log transformations

As expected, the distribution of wealth is typically non-normal. We may want to examine this on a log scale. If we use log base 2, then each doubling of a value turns into an increase by 1. (Switches multiplicative changes into additive ones).

gapminder %>% 
  filter(year == 1970) %>% 
  ggplot(aes(log2(dollars_per_day))) + 
    geom_histogram(binwidth = 1, color = "black")
## Warning: Removed 72 rows containing non-finite values (`stat_bin()`).

OR

gapminder %>% 
  filter(year == 1970 & !is.na(gdp)) %>% 
  ggplot(aes(log2(dollars_per_day))) + 
    geom_histogram(binwidth = 1, color = "black")

Which base?
  • Common choices for bases include: e, 2, 10.

  • Recommendations: Don’t use e for data exploration / visualization because it’s hard to do these calculations in our head. What is e^3, e^4, etc. While it’s easier to do 2^3, 2^4 ,or 10^3, 10^4.

  • In the previous example, we used base 2 instead of base 10 because the range was easier to interpret – going from -2 to 6, which are relatively easy for us to calculate. If we chose base 10, our range would include only 0-2, so our range is small, and we need to choose a binwidth other than 1.

  • With base 2, a binwidth of 1 translates to a bin with range x to 2x.

  • But . . . Let’s look at population sizes. Create a histogram. Determine whether it would be best to use base e, 2, or 10 to display this data? Explain your choice.

Transform the scale of transform the variable?

Pros & Cons 1. If we transform the variable, then the scale is still interpretable (you can easily determine the value on the scale), but we have to do calculations on those values (2axis_value)

  1. If we use logged scales, it may be harder to determine the value on the scale – but when you do, it’s easier to interpret since it’s the original value. For ex., you see that it’s 32 dollars/day instead of 5 log base 2 dollars per day.

Transforming the scale can be done with scale_x_continuous

gapminder %>% 
  filter(year == 1970) %>% 
  ggplot(aes(dollars_per_day)) + 
    geom_histogram(binwidth= 1, color="black") + 
    scale_x_continuous(trans="log2")
## Warning: Removed 72 rows containing non-finite values (`stat_bin()`).

There are 2 modes. This is bi-modal.

Multimodal Wealth Distribution

Does this mean that there are 2 groups of countries in terms of wealth?

Let’s examine the data by region. Create a plot of dollars_per_day by region

p <- gapminder %>% 
  filter(year == 1962) %>% 
  ggplot(aes(region, dollars_per_day)) + 
  geom_point()
p
## Warning: Removed 89 rows containing missing values (`geom_point()`).

It would be nice if we could actually read those labels + theme() Check ?theme for details + axis.text.x is text on x-axis + element_text() is the non-data component of plot + hjust is horizontal justification from 0 to 1 where 1 = right justified

gapminder %>% 
  filter(year == 1962) %>% 
  ggplot(aes(region, dollars_per_day)) + 
  geom_point() +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))
## Warning: Removed 89 rows containing missing values (`geom_point()`).

Reorder in ggplot

Is there a difference between “west and”not west”?

These are ordered alphabetically. It’d be nice if they were ordered in a meaningful way.

What might be a meaningful way?

p <- gapminder %>% 
  filter(year == 1962 & !is.na(dollars_per_day)) %>% 
  mutate(region = reorder(region, dollars_per_day, FUN=median)) %>% 
  ggplot(aes(region, dollars_per_day)) + 
    geom_point()
p + theme(axis.text.x = element_text(angle=90, hjust=1)) + 
  scale_y_continuous(trans="log2")

!is.na() Need to exclude those that are missing and are NA FUN because we are calculating the median. When you calculate stats like this, you can’t have any NA in the data. Need to reorder region by median dollars_per_day (the FUN is optional) scale_y_continuous(trans="log2") Add a log scale to spread out the dots a little bit more. It is ordered by dollars_per_day

Interpretation: The the way we group countries by regions, there is a similarity in wealth except Eastern Asia has a wider spread.

p <- gapminder %>% 
  filter(year == 2010 & !is.na(dollars_per_day)) %>% 
  mutate(region = reorder(region, dollars_per_day, FUN=median)) %>% 
  ggplot(aes(region, dollars_per_day)) + 
    geom_point()
p + theme(axis.text.x = element_text(angle=90, hjust=1)) + 
  scale_y_continuous(trans="log2")

There are more regions on this second graph. Eastern and Western Aferica are still very poor.

Data Manipulation & boxplots

Let’s re-categorize our regions so that we don’t have so many. We can use the case_when() function to create this new variable

gapminder <- gapminder %>% 
  mutate(group = case_when(region %in% c("Western Europe", "Northern Europe", "Southern Europe", "Northern America", "Australia and New Zealand") ~ "West", region %in% c("Eastern Asia", "South- Eastern Asia") ~ "East Asia", region %in% c("Caribbean", "Central America", 
"South America") ~ "Latin America", continent == "Africa" & region != "Northern Africa" ~ "Sub-Saharan Africa", TRUE ~ "Others")) 

case_when() lets you change the variable like region and continent

You can check that it worked correctly by:

table(gapminder$region, gapminder$group)
##                            
##                             East Asia Latin America Others Sub-Saharan Africa
##   Australia and New Zealand         0             0      0                  0
##   Caribbean                         0           741      0                  0
##   Central America                   0           456      0                  0
##   Central Asia                      0             0    285                  0
##   Eastern Africa                    0             0      0                912
##   Eastern Asia                    342             0      0                  0
##   Eastern Europe                    0             0    570                  0
##   Melanesia                         0             0    285                  0
##   Micronesia                        0             0    114                  0
##   Middle Africa                     0             0      0                456
##   Northern Africa                   0             0    342                  0
##   Northern America                  0             0      0                  0
##   Northern Europe                   0             0      0                  0
##   Polynesia                         0             0    171                  0
##   South America                     0           684      0                  0
##   South-Eastern Asia                0             0    570                  0
##   Southern Africa                   0             0      0                285
##   Southern Asia                     0             0    456                  0
##   Southern Europe                   0             0      0                  0
##   Western Africa                    0             0      0                912
##   Western Asia                      0             0   1026                  0
##   Western Europe                    0             0      0                  0
##                            
##                             West
##   Australia and New Zealand  114
##   Caribbean                    0
##   Central America              0
##   Central Asia                 0
##   Eastern Africa               0
##   Eastern Asia                 0
##   Eastern Europe               0
##   Melanesia                    0
##   Micronesia                   0
##   Middle Africa                0
##   Northern Africa              0
##   Northern America           171
##   Northern Europe            570
##   Polynesia                    0
##   South America                0
##   South-Eastern Asia           0
##   Southern Africa              0
##   Southern Asia                0
##   Southern Europe            684
##   Western Africa               0
##   Western Asia                 0
##   Western Europe             399

Next, you can turn this variable: group into a factor so that we can control the order of the levels

gapminder <- gapminder %>% 
  mutate(group = factor(group, levels = c("Others", "Latin America", "East Asia", "Sub-Saharan Africa", "West")))
table(gapminder$region, gapminder$group)
##                            
##                             Others Latin America East Asia Sub-Saharan Africa
##   Australia and New Zealand      0             0         0                  0
##   Caribbean                      0           741         0                  0
##   Central America                0           456         0                  0
##   Central Asia                 285             0         0                  0
##   Eastern Africa                 0             0         0                912
##   Eastern Asia                   0             0       342                  0
##   Eastern Europe               570             0         0                  0
##   Melanesia                    285             0         0                  0
##   Micronesia                   114             0         0                  0
##   Middle Africa                  0             0         0                456
##   Northern Africa              342             0         0                  0
##   Northern America               0             0         0                  0
##   Northern Europe                0             0         0                  0
##   Polynesia                    171             0         0                  0
##   South America                  0           684         0                  0
##   South-Eastern Asia           570             0         0                  0
##   Southern Africa                0             0         0                285
##   Southern Asia                456             0         0                  0
##   Southern Europe                0             0         0                  0
##   Western Africa                 0             0         0                912
##   Western Asia                1026             0         0                  0
##   Western Europe                 0             0         0                  0
##                            
##                             West
##   Australia and New Zealand  114
##   Caribbean                    0
##   Central America              0
##   Central Asia                 0
##   Eastern Africa               0
##   Eastern Asia                 0
##   Eastern Europe               0
##   Melanesia                    0
##   Micronesia                   0
##   Middle Africa                0
##   Northern Africa              0
##   Northern America           171
##   Northern Europe            570
##   Polynesia                    0
##   South America                0
##   South-Eastern Asia           0
##   Southern Africa              0
##   Southern Asia                0
##   Southern Europe            684
##   Western Africa               0
##   Western Asia                 0
##   Western Europe             399

Now that we have 5 groups, we can compare their distributions (as boxplots)

p <- gapminder %>% 
  filter(year == 1962 & !is.na(dollars_per_day)) %>% 
  ggplot(aes(group, dollars_per_day)) + 
    geom_boxplot() + 
    theme(axis.text.x = element_text(angle=90, hjust=1)) + 
    scale_y_continuous(trans="log2") +
    xlab("")
p 

Now let’s show the data as well!

p + geom_point(alpha=0.5)

Interpretation: The West has a small range, East Asia has a large range, . . .

Ridge Plots

If we have so much data that there is over-plotting, showing the data can be counterproductive.

If we have so much data that there is over-plotting, showing the data can be counterproductive.

Ridge plots are stacked smooth densities or histograms. We will use the package ggridges

p2 <- gapminder %>% 
  filter(year == 1962 & !is.na(dollars_per_day)) %>% 
  ggplot(aes(dollars_per_day, group)) + 
    scale_x_continuous(trans="log2")
p2

What are the differences here compared to the last code for this graph?

p2 + geom_density_ridges(scale = 2)
## Picking joint bandwidth of 0.64

scale = 2 indicates the amount of overlap. scale = 1 indicates no overlap

Rug representation

We can also show the data using jittered points (left) or rug representation (right).

Note: the jittered points plot can be confusing because the height of the points is not indicative of anything. See the text: section 10.7.2 for code

Comparing 1968 with 2010

How might we want to compare income distributions across time to see if there is still 2 groups?

Potentially histograms? Box plots? Density plots?

We could categorize the countries into “West” and “not West”.

What does the following code do?

gapminder %>% 
  filter(year %in% c(1962, 2010) & !is.na(dollars_per_day)) %>% 
  mutate(west=ifelse(group == "West", "West", "not West")) %>% 
  ggplot(aes(dollars_per_day)) + 
    geom_histogram(binwidth = 1, color="black") + 
    scale_x_continuous(trans="log2") + 
    facet_grid(year~ west)

ifelse() to rename a new column called west if called the current value is called West and give it the value of West but in not, say not West The lower case west is the name of the column.

For an ifelse, it needs three parts. The first part is the if part, the second part is “if that is TRUE”, do this, and the third part is “if that is FALSE”, do this.

intersect

There are more data points in 2010 than 1968, so we may want to plot only those countries for which we have data on both years (to keep the total number the same)

country_list_1962 <- gapminder %>% 
  filter(year== 1962 & !is.na(dollars_per_day)) %>% 
  pull(country)
country_list_2010 <- gapminder %>% 
  filter(year == 2010 & !is.na(dollars_per_day)) %>% 
  pull(country)
country_list <- intersect(country_list_1962, country_list_2010)

How does the code from the previous slide change to incorporate only these countries?

gapminder %>% 
  filter(year %in% c(1962, 2010) & country %in% country_list) %>% 
  mutate(west=ifelse(group == "West", "West", "not West")) %>% 
  ggplot(aes(dollars_per_day)) + 
    geom_histogram(binwidth = 1, color="black") + 
    scale_x_continuous(trans="log2") + 
    facet_grid(year~ west)

Boxplots

Now, let’s create boxplots that compare 1968 to 2010. Here is our code from before. How do we have to modify it?

p <- gapminder %>% 
  filter(year == 1962 & !is.na(dollars_per_day)) %>% 
  ggplot(aes(group, dollars_per_day)) + 
    geom_boxplot() +
    theme(axis.text.x = element_text(angle=90, hjust=1)) + 
    scale_y_continuous(trans="log2") + 
    xlab("")
p

Up to 3 changes: 1. year %in% c(1962, 2010) 2. facet_grid(. ~ year) 3. If you want to limit to countries that have data in both years, add: country %in% country_list

But this is a bit hard to compare dollars_per_day. It would be nicer if the boxes for each region (in the 2 years) were next to each other.

Instead of faceting

We can fill color (or fill) the boxes depending on year.

Before we can fill based on year, we need to convert year to a factor so that R can assign color based on factor.

This is the code we had before but we need to change it to create plots side by side. We need facet_grid and both years. remember that facet_grid does the rows then ~ then columns.

gapminder %>% 
  filter(year %in% c(1962, 2010) & country %in% country_list) %>% 
  ggplot(aes(group, dollars_per_day)) + 
    geom_boxplot() + 
    theme(axis.text.x = element_text(angle=90, hjust=1)) + 
    scale_y_continuous(trans="log2") + 
    xlab("") + 
    facet_grid(. ~year)

To group them together, we get ride of the facet_grid and fill = year need to be in the aes.

gapminder %>% 
  filter(year %in% c(1962, 2010) & country %in% country_list) %>% 
  mutate(year = factor(year)) %>% 
  ggplot(aes(group, dollars_per_day, fill=year)) + 
    geom_boxplot() + 
    theme(axis.text.x = element_text(angle=90, hjust=1)) + 
    scale_y_continuous(trans="log2") + 
    xlab("")  

Density plots

Let’s examine our histogram code again. What changes would we need to make to create density plots (shown below)?

gapminder %>% 
  filter(year %in% c(1962, 2010) & country %in% country_list) %>%
  mutate(west=ifelse(group == "West", "West", "not West")) %>% 
  ggplot(aes(dollars_per_day)) + 
    geom_histogram(binwidth = 1, color="black") + 
    scale_x_continuous(trans="log2") + 
    facet_grid(year~ west)

Changes: 1. Remove geom_histogram() 2. Add geom_density(alpha = 0.2) 3. If we want the two groups of countries to be on the same graph, we want to add fill=grop to the ggplot function and facet_grid(year ~.) since we aren’t separated by “west vs. not west”

But . . . density requires that the distributions add to 1, but the # of countries is not the same in these two groups.

Instead of using “density”, use “count” on the y-axis

From the help file, we can see that we can access the computed variable”count” in geom_density.

To access these computed variables, we surround the name with 2 dots. aes(x=dollars_per_day, y=..count..)

gapminder %>% 
  filter(year %in% c(1962, 2010) & country %in% country_list) %>% 
  mutate(west=ifelse(group == "West", "West", "not West")) %>% 
  ggplot(aes(dollars_per_day, y=..count.., fill=west)) + 
    geom_density(alpha = 0.2) + 
    scale_x_continuous(trans="log2") + 
    facet_grid(year~ .)
## Warning: The dot-dot notation (`..count..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(count)` instead.

Density plot by region

We may want to know how individual regions influence this shift over time.

Here was our previous ridge code. What needs to change?

p2 <- gapminder %>% filter(year == 1962 & !is.na(dollars_per_day)) %>% 
  ggplot(aes(dollars_per_day, group)) + 
    scale_x_continuous(trans="log2") 
p2 + geom_density_ridges(scale = 2)
## Picking joint bandwidth of 0.64

Changes: 1) year %in% c(1962, 2010) 2) country %in% country_list

Stacking Densities

Let’s examine this code:

gapminder %>% 
  filter(year %in% c(1962,2010) & country %in% country_list) %>% 
  group_by(year) %>% 
  mutate(weight = population/sum(population)) %>% 
  ungroup() %>% 
  ggplot(aes(dollars_per_day, fill=group, weight = weight)) + 
    scale_x_continuous(trans="log2", limit = c(0.05, 300)) + 
    geom_density(alpha = 0.2, position = "stack", bw=0.75) + 
    facet_grid(year ~.)

Ch 11: Data Visualization Principles

Start Audio recording 230207_002 at 1:09:00

Don’t use pie charts

** Never use pie charts. Barplots and tables are always better.** >…humans are not good at precisely quantifying angles and are even worse when area is the only available visual cue. >If for some reason you need to make a pie chart, label each pie slice with its respective percentage

Include zero when presenting bar charts

The axis does not start at 0. Judging by the length, it appears border apprehensions are doubled when it was not. It makes you think then differences are larger than they actually are. >When using position rather than length, it is then not necessary to include 0. Here they are comparing the differences between groups relative to the within-group variability.

When using position (not length)

It is not necessary to include 0

Don’t distort quantities

Orderingg plot

Alphabetical order has nothing to do with the disease and by ordering according to actual rate, we quickly see the states with most and least rates. Ordering the plots by value (or meaningful category) is easier to look at and interpret. The “reorder” function to plot values in order.
To create this chart use this code:

data(murders)
murders %>% mutate(murder_rate = total / population * 100000) %>% 
  mutate(state=reorder(state, murder_rate)) %>% 
  ggplot(aes(state, murder_rate)) + geom_bar(stat="identity") + 
  coord_flip() + 
  theme(axis.text.y = element_text(size=6)) + 
  xlab("")

Show the Data

In the past, we may have created plots like this: the mean with the standard error, but plots that show the data are much more informative

heights %>% ggplot(aes(sex, 
height)) + geom_point()

If we want to“jitter” ur points, we can update the code to:

heights %>% 
  ggplot(aes(sex, height)) + 
  geom_jitter(width = 0.1, alpha = 0.2)

It may be hard to compare these distributions. Why?

1) The axes have different ranges – the “Male” range extends past 80
2) The plots are side by side, but if they were arranged vertically, we could better compare horizontal changes.  
heights %>% ggplot(aes(height, ..density..)) + 
geom_histogram(binwidth = 1, color="black") + 
facet_grid(sex~.)

Note ..density..

An order of magnitude difference may cause reason to show the comparison on a log scale.
We can encode categorical variables with color and shape. These shapes can be controlled with the shape argument.

gapminder <- gapminder %>% 
  mutate(OPEC = ifelse(group == "West", "Yes", "NO"))

gapminder %>% 
  filter(year== 2010 & !is.na(dollars_per_day) & 
!is.na(infant_mortality)) %>% 
  ggplot(aes(dollars_per_day, 1-infant_mortality/1000, col=region, shape=OPEC, size=population)) + 
  geom_point() + 
  scale_x_continuous(trans="log2") 

logit

Avoid pseudo-three-dimensional plots

We are bad at seeing in three dimensions – particularly on a 2-D screen.Refer to pie chart of frequency of use. the whole point for a graph is to communicate something.

Rounding in tables

Avoid too many significant digits

Vaccines have helped save millions of lives.

Prior to the vaccination programs, deaths from infectious diseases (like polio and smallpox) were common. Let’s examine rates of disease before & after the initiation of vaccination programs.

library(RColorBrewer)
data(us_contagious_diseases)
names(us_contagious_diseases)
## [1] "disease"         "state"           "year"            "weeks_reporting"
## [5] "count"           "population"

Measles data per 100,000 rate, removing Alaska & Hawaii and adjusting for weeks_reporting

the_disease <- "Measles" 
dat <- us_contagious_diseases %>% 
filter(!state%in%c("Hawaii","Alaska") & disease == 
the_disease) %>% mutate(rate = count / population * 100000 * 
52 / weeks_reporting) %>% mutate(state = reorder(state, rate)) 

Let’s plot the disease rates per year for California

dat %>% filter(state == "California" & !is.na(rate)) %>% 
ggplot(aes(year, rate)) + geom_line() + ylab("Cases per 
100,000") + geom_vline(xintercept=1963, col = "blue")

CDC introduced measles vaccine in 1963

dat %>% ggplot(aes(year, state, fill = rate)) + geom_tile(color = 
"grey50") + scale_x_continuous(expand = c(0,0)) + 
scale_fill_gradientn(colors = brewer.pal(9, "Reds"), trans = 
"sqrt") + geom_vline(xintercept = 1963, col = "blue") + 
theme_minimal() + theme(panel.grid = element_blank(), 
legend.position="bottom", text = element_text(size = 8)) + 
labs(title = the_disease, x = "", y = "")

expand() Expands the chart so there isn’t a gap between the names of the state and the plot xintercept More details on (ColorBrewer)[https://rdrr.io/cran/RColorBrewer/man/ColorBrewer.html] palettes:

colnames(dat)
## [1] "disease"         "state"           "year"            "weeks_reporting"
## [5] "count"           "population"      "rate"
avg <- us_contagious_diseases |>
  filter(disease==the_disease) |> group_by(year) |>
  summarize(us_rate = sum(count, na.rm = TRUE) / 
              sum(population, na.rm = TRUE) * 10000)
dat |> 
  filter(!is.na(rate)) |>
    ggplot() +
  geom_line(aes(year, rate, group = state),  color = "grey50", 
            show.legend = FALSE, alpha = 0.2, size = 1) +
  geom_line(mapping = aes(year, us_rate),  data = avg, size = 1) +
  ggtitle("Cases per 10,000 by state") + 
  xlab("") + ylab("") +
  geom_text(data = data.frame(x = 1955, y = 50), 
            mapping = aes(x, y, label="US average"), 
            color="black") + 
  geom_vline(xintercept=1963, col = "blue") +
  scale_y_continuous(trans = "sqrt", breaks = c(5, 25, 125, 300)) 
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.

Reference

Irizarry, Rafael A. 2019. Introduction to Data Science. Chapman; Hall/CRC. https://doi.org/10.1201/9780429341830.