1. When to summarize?

When designing any visualization, you need to think about the story you want to get across. What is the most important point are you trying to make? Find that idea, and design your visualization around it.

Often, you will need to make changes to your data before it is in the format you need to make the point that you want.

For example, imagine we wanted to plot the total US population over time by using the state population dataset from last time:

library(tidyverse)
state <- read_csv("state_population.csv")
state
## # A tibble: 6,020 x 5
##    state  year population region after2000
##    <chr> <dbl>      <dbl> <chr>  <lgl>    
##  1 AK     1950     135000 West   FALSE    
##  2 AK     1951     158000 West   FALSE    
##  3 AK     1952     189000 West   FALSE    
##  4 AK     1953     205000 West   FALSE    
##  5 AK     1954     215000 West   FALSE    
##  6 AK     1955     222000 West   FALSE    
##  7 AK     1956     224000 West   FALSE    
##  8 AK     1957     231000 West   FALSE    
##  9 AK     1958     224000 West   FALSE    
## 10 AK     1959     224000 West   FALSE    
## # … with 6,010 more rows

We could plot every state at once, but this would not tell us what the total US population was by year. Viewers would have to mentally add up the points:

state %>%
  ggplot(aes(x = year, y = population)) + 
    geom_point()