1. When to summarize?

When designing any visualization, you need to think about the story you want to get across. What is the most important point are you trying to make? Find that idea, and design your visualization around it.

Often, you will need to make changes to your data before it is in the format you need to make the point that you want.

For example, imagine we wanted to plot the total US population over time by using the state population dataset from last time:

library(tidyverse)
state <- read_csv("state_population.csv")
state

## # A tibble: 6,020 x 5
##    state  year population region after2000
##    <chr> <dbl>      <dbl> <chr>  <lgl>    
##  1 AK     1950     135000 West   FALSE    
##  2 AK     1951     158000 West   FALSE    
##  3 AK     1952     189000 West   FALSE    
##  4 AK     1953     205000 West   FALSE    
##  5 AK     1954     215000 West   FALSE    
##  6 AK     1955     222000 West   FALSE    
##  7 AK     1956     224000 West   FALSE    
##  8 AK     1957     231000 West   FALSE    
##  9 AK     1958     224000 West   FALSE    
## 10 AK     1959     224000 West   FALSE    
## # … with 6,010 more rows

We could plot every state at once, but this would not tell us what the total US population was by year. Viewers would have to mentally add up the points:

state %>%
  ggplot(aes(x = year, y = population)) + 
    geom_point()

What if we want to know the total population by year? Well, this is what the functions group_by() and summarise() are for. When used together, group_by() defines a group and summarise() will perform a calculation within each of those groups. For example, this code will run the sum() function on the population column within each group (year):

# calculate total (sum) population by year
pop_by_year <- state %>%
  group_by(year) %>%
  summarise(pop = sum(population))

## `summarise()` ungrouping output (override with `.groups` argument)

pop_by_year

## # A tibble: 120 x 2
##     year      pop
##    <dbl>    <dbl>
##  1  1900 76095000
##  2  1901 77588000
##  3  1902 79160000
##  4  1903 80631000
##  5  1904 82165000
##  6  1905 83818000
##  7  1906 85439000
##  8  1907 87001000
##  9  1908 88706000
## 10  1909 90490000
## # … with 110 more rows

Now that we have that object, we can plot total US population by year directly:

pop_by_year %>%
  ggplot(aes(x = year, y = pop)) + 
    geom_col() + 
    labs(x = "Year", 
         y = "Population",
         title = "US Population by Year") + 
    theme_bw()

# coord_flip() will flip axes
pop_by_year %>%
  ggplot(aes(x = year, y = pop)) + 
    geom_col() + 
    labs(x = "Year", 
         y = "Population",
         title = "US Population by Year") + 
    theme_bw() + 
    coord_flip()

# a "lollipop" plot 
# combines straight line segment with point
pop_by_year %>%
  ggplot(aes(x = year, y = pop)) + 
    geom_segment(aes(x = year, xend = year, y = 0, yend = pop), col = "grey") + 
    geom_point() + 
    labs(x = "Year", 
         y = "Population",
         title = "US Population by Year") + 
    theme_bw()

You can also define groups by multiple variables. For example, to calculate population by region AND year. For example, the population of the US South in 1932:

region_year <- state %>%
  group_by(region, year) %>%
  summarise(pop = sum(population))

## `summarise()` regrouping output by 'region' (override with `.groups` argument)

region_year

## # A tibble: 480 x 3
## # Groups:   region [4]
##    region   year      pop
##    <chr>   <dbl>    <dbl>
##  1 Midwest  1900 26359000
##  2 Midwest  1901 26722000
##  3 Midwest  1902 27126000
##  4 Midwest  1903 27446000
##  5 Midwest  1904 27830000
##  6 Midwest  1905 28203000
##  7 Midwest  1906 28524000
##  8 Midwest  1907 28868000
##  9 Midwest  1908 29187000
## 10 Midwest  1909 29530000
## # … with 470 more rows

region_year %>%
  ggplot(aes(x = year, y = pop)) + 
    geom_segment(aes(x = year, xend = year, y = 0, yend = pop), col = "grey") + 
    geom_point(size = 0.5) + 
    labs(x = "Year", 
         y = "Population",
         title = "US Population by Year") + 
    facet_wrap(~region) + 
    theme_linedraw()

# you can also use geom_col here
region_year %>%
  ggplot(aes(x = year, y = pop)) + 
    geom_col() + 
    labs(x = "Year", 
         y = "Population",
         title = "US Population by Year") + 
    facet_wrap(~region) + 
    theme_linedraw()

Exercises

Here is a dataset on historical World Cup results from 1930-2006. Download it, create a new folder named data, and save the file in that folder. Read the dataset into an object called cups.

cups <- read_csv("data/world_cups.csv")

## 
## ── Column specification ────────────────────────────────────────────────────────
## cols(
##   year = col_double(),
##   team = col_character(),
##   scored = col_double(),
##   conceded = col_double(),
##   penalties = col_double(),
##   matches = col_double(),
##   shots_on_goal = col_double(),
##   shots_wide = col_double(),
##   free_kicks = col_double(),
##   offside = col_double(),
##   corners = col_double(),
##   won = col_double(),
##   drawn = col_double(),
##   lost = col_double(),
##   wc_winner = col_logical()
## )

Look at the data with View(). Using group_by() and summarise(), calculate the total number of goals that each team has scored across all World Cups. Save this to an object named scored.

scored <- cups %>%
  group_by(team) %>%
  summarise(goals = sum(scored))

## `summarise()` ungrouping output (override with `.groups` argument)

Select only teams that have scored at least 50 goals in total from the scored object. Save this in an object called best_teams.

best_teams <- scored %>% filter(goals > 50)

Design a plot using best_teams.

# reorder(x, order_variable) will reorder columns
best_teams %>%
  ggplot(aes(x = reorder(team, goals), y = goals)) + 
    geom_col() + 
    coord_flip()

Using groups for other statistics

The sum is one interesting measurement you will want to calculate, but it is certainly not the only one. The mean (average) is another. Thankfully, calculating means for the complete sample or finding means within each group is very simple. All you need to do is change the function that you use in summarise(). .

# find average goals scored by team in 1930
cups %>%
  filter(year == 1930) %>%
  summarise(scored = mean(scored))

## # A tibble: 1 x 1
##   scored
##    <dbl>
## 1   5.38

# find average goals scored by team per match in 1930
cups %>%
  filter(year == 1930) %>%
  summarise(scored = mean(scored / matches))

## # A tibble: 1 x 1
##   scored
##    <dbl>
## 1   1.64

# find average goals scored per match in every year
cups %>%
  group_by(year) %>%
  summarise(scored = mean(scored))

## `summarise()` ungrouping output (override with `.groups` argument)

## # A tibble: 18 x 2
##     year scored
##    <dbl>  <dbl>
##  1  1930   5.38
##  2  1934   4.38
##  3  1938   5.6 
##  4  1950   6.77
##  5  1954   8.75
##  6  1958   7.88
##  7  1962   5.56
##  8  1966   5.56
##  9  1970   5.94
## 10  1974   6.06
## 11  1978   6.38
## 12  1982   6.08
## 13  1986   5.5 
## 14  1990   4.79
## 15  1994   5.88
## 16  1998   5.34
## 17  2002   5.03
## 18  2006   4.59

The median of a set of numbers is found by ordering them from smallest to largest and finding the value in the middle. Means and medians are both ways of estimating the center of a dataset, but they do have important differences in some cases. For example:

numbers <- c(1, 3, 7, 9, 2000)

# notice how the mean is sensitive to outliers
mean(numbers)

## [1] 404

# the median is not as sensitive
median(numbers)

## [1] 7

2. Visualizing distributions

Statistics like sums, means, and medians are one way of summarizing data, but they still only tell you one number. Sometimes, you will want to visualize all of the values in a column or two of your dataset - the entire distribution.

Histograms

For example, a histogram will visualize all of the values in one column. Taller bars tell you that more data fit in that range:

# visualize shots on goal since 1950
cups %>%
  filter(year > 1950) %>%
  ggplot(aes(x = shots_on_goal)) + 
    geom_histogram()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# just like any other geom, can use with groups and facets
cups %>%
  filter(year > 1950) %>%
  ggplot(aes(x = shots_on_goal)) + 
    geom_histogram() + 
    facet_wrap(~year)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

This tells you that most teams had between 0-10 shots on goal, while some had a lot (>50). Notice the message you get when creating this plot - stat_bin() using bins = 30. Pick better value with binwidth. This means that ggplot() chose a default width for the “bins” (rectangles) it creates on the histogram, but that you can overwrite it if you want with the binwidth argument. ggplot() will often do a good job, but you can choose one that best fits your data.

# very small binwidth
cups %>%
  filter(year > 1950) %>%
  ggplot(aes(x = shots_on_goal)) + 
    geom_histogram(binwidth = 1)

# very large binwidth - not as informative!
cups %>%
  filter(year > 1950) %>%
  ggplot(aes(x = shots_on_goal)) + 
    geom_histogram(binwidth = 25)

Densities

Histograms are good for individual discrete values (like the number of shots or games won), but bins can be too coarse for continuous values. For example, the number of goals scored per match would have decimal values:

# histogram looks okay here, but you may want a smoother distribution
cups %>%
  filter(year > 1950) %>%
  group_by(team) %>%
  summarise(per_match = scored / matches) %>%
  ggplot(aes(x = per_match)) +
    geom_histogram()

## `summarise()` regrouping output by 'team' (override with `.groups` argument)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# geom_density() will try to smooth it out for you
cups %>%
  filter(year > 1950) %>%
  group_by(team) %>%
  summarise(per_match = scored / matches) %>%
  ggplot(aes(x = per_match)) +
    geom_density()

## `summarise()` regrouping output by 'team' (override with `.groups` argument)

# can facet just like always
cups %>%
  group_by(year, team) %>%
  summarise(per_match = scored / matches) %>%
  ggplot(aes(x = per_match)) + 
    geom_density() + 
    facet_wrap(~year)

## `summarise()` regrouping output by 'year' (override with `.groups` argument)

Exercises

Read in the elections data from the other day. Then, create a histogram to visualize Democratic performance in all elections since and including 1992 (>= 1992) with one plot for each region.

For an extra challenge, try adding a vertical line to the plot at 50 with the geom_vline() function (don’t forget that you can read the documentation for a function with ?geom_vline, then scroll down to “Examples” to see code examples using the function).

Finally, explain the general trend that you see.

elections <- read_csv("pres_elections.csv")

## 
## ── Column specification ────────────────────────────────────────────────────────
## cols(
##   state = col_character(),
##   abb = col_character(),
##   democrat = col_double(),
##   year = col_double(),
##   region = col_character()
## )

elections %>%
  filter(year >= 1992) %>%
  ggplot(aes(x = democrat)) + 
    geom_histogram(binwidth = 2) + 
    facet_wrap(~region) + 
    geom_vline(xintercept = 50, col = "blue", lty = "dashed") + 
    theme_bw()

Using only results from states in the South, create a density plot for democrat with one plot per year. Try adding a vertical line at 50 as described above.

Describe the general trend you see over time.

elections %>%
  filter(region == "South") %>%
  ggplot(aes(x = democrat)) + 
    geom_density() + 
    facet_wrap(~year) + 
    geom_vline(xintercept = 50, col = "blue", lty = "dashed") + 
    theme_bw()

Think about the trend you saw in Question 2. Clearly describe it. Discuss a way you might visualize that trend. If you have time, try designing a plot to do so.

Boxplots

Histograms and densities are great, but sometimes you want to clearly show particular values in your dataset on your visualization like the median. Boxplots do this by clearly visualizing quartiles in your dataset.

What is a quartile? Quartiles split your data into four parts:

25% of the values in your dataset \(<\) First Quartile.
50% of the values in your dataset \(<\) Second Quartile. (this is the median!).
75% of the values in your dataset \(<\) Third Quartile

For example:

# quantile() defaults to quartiles
seq(0, 100, by = 1) %>% quantile()

##   0%  25%  50%  75% 100% 
##    0   25   50   75  100

seq(0, 50, by = 1) %>% quantile()

##   0%  25%  50%  75% 100% 
##  0.0 12.5 25.0 37.5 50.0

elections %>%
  ggplot(aes(x = democrat)) + 
  geom_boxplot()

# you can use fill in aes() to color by a variable
elections %>%
  ggplot(aes(x = region, y = democrat)) + 
  geom_boxplot()

# works both ways! or use coord_flip()
elections %>%
  ggplot(aes(x = democrat, y = region)) + 
  geom_boxplot()

Now, let’s think about the trend you saw in Exercise 2. How could we use boxplots to visualize this?

# check the warning, what is it trying to tell us?  
# continuous x aesthetic -- did you forget aes(group=...)?
elections %>%
  filter(region == "South") %>%
  ggplot(aes(x = year, y = democrat)) + 
  geom_boxplot()

## Warning: Continuous x aesthetic -- did you forget aes(group=...)?

# trust the hints
elections %>%
  filter(region == "South") %>%
  ggplot(aes(x = year, y = democrat, group = year)) + 
  geom_boxplot() + 
  geom_hline(yintercept = 50, col = "blue", lty = "dashed")

# notice the large outlier in the South in later years
# how could we find out which point that is?

Boxplots use quartiles (since they split data into 4 groups), but other measurements like percentiles can be useful as well. Percentiles split your data into 100, so the 13th percentile is greater than or equal to 13% of your values, the 99% percentile is greater than 99% of your values, etc.

Exercises

Using the World Cup cups dataset, create a boxplot of goals scored for every year.

cups %>%
  filter(year >= 1950) %>%
  ggplot(aes(x = year, y = scored, group = year)) + 
    geom_boxplot()

# this is technically right too - why is it maybe less desirable?
cups %>%
  filter(year >= 1950) %>%
  ggplot(aes(x = scored)) + 
    geom_boxplot() + 
    facet_wrap(~year)

Lesson 4: Summarizing Data

Tyler Simko

21 June, 2021

1. When to summarize?

Exercises

Using groups for other statistics

2. Visualizing distributions

Histograms

Densities

Exercises

Boxplots

Exercises