When designing any visualization, you need to think about the story you want to get across. What is the most important point are you trying to make? Find that idea, and design your visualization around it.
Often, you will need to make changes to your data before it is in the format you need to make the point that you want.
For example, imagine we wanted to plot the total US population over time by using the state population dataset from last time:
## # A tibble: 6,020 x 5
## state year population region after2000
## <chr> <dbl> <dbl> <chr> <lgl>
## 1 AK 1950 135000 West FALSE
## 2 AK 1951 158000 West FALSE
## 3 AK 1952 189000 West FALSE
## 4 AK 1953 205000 West FALSE
## 5 AK 1954 215000 West FALSE
## 6 AK 1955 222000 West FALSE
## 7 AK 1956 224000 West FALSE
## 8 AK 1957 231000 West FALSE
## 9 AK 1958 224000 West FALSE
## 10 AK 1959 224000 West FALSE
## # … with 6,010 more rows
We could plot every state at once, but this would not tell us what the total US population was by year. Viewers would have to mentally add up the points:
What if we want to know the total population by year? Well, this is what the functions group_by() and summarise() are for. When used together, group_by() defines a group and summarise() will perform a calculation within each of those groups. For example, this code will run the sum() function on the population column within each group (year):
# calculate total (sum) population by year
pop_by_year <- state %>%
group_by(year) %>%
summarise(pop = sum(population))## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 120 x 2
## year pop
## <dbl> <dbl>
## 1 1900 76095000
## 2 1901 77588000
## 3 1902 79160000
## 4 1903 80631000
## 5 1904 82165000
## 6 1905 83818000
## 7 1906 85439000
## 8 1907 87001000
## 9 1908 88706000
## 10 1909 90490000
## # … with 110 more rows
Now that we have that object, we can plot total US population by year directly:
pop_by_year %>%
ggplot(aes(x = year, y = pop)) +
geom_col() +
labs(x = "Year",
y = "Population",
title = "US Population by Year") +
theme_bw()# coord_flip() will flip axes
pop_by_year %>%
ggplot(aes(x = year, y = pop)) +
geom_col() +
labs(x = "Year",
y = "Population",
title = "US Population by Year") +
theme_bw() +
coord_flip()# a "lollipop" plot
# combines straight line segment with point
pop_by_year %>%
ggplot(aes(x = year, y = pop)) +
geom_segment(aes(x = year, xend = year, y = 0, yend = pop), col = "grey") +
geom_point() +
labs(x = "Year",
y = "Population",
title = "US Population by Year") +
theme_bw()You can also define groups by multiple variables. For example, to calculate population by region AND year. For example, the population of the US South in 1932:
## `summarise()` regrouping output by 'region' (override with `.groups` argument)
## # A tibble: 480 x 3
## # Groups: region [4]
## region year pop
## <chr> <dbl> <dbl>
## 1 Midwest 1900 26359000
## 2 Midwest 1901 26722000
## 3 Midwest 1902 27126000
## 4 Midwest 1903 27446000
## 5 Midwest 1904 27830000
## 6 Midwest 1905 28203000
## 7 Midwest 1906 28524000
## 8 Midwest 1907 28868000
## 9 Midwest 1908 29187000
## 10 Midwest 1909 29530000
## # … with 470 more rows
region_year %>%
ggplot(aes(x = year, y = pop)) +
geom_segment(aes(x = year, xend = year, y = 0, yend = pop), col = "grey") +
geom_point(size = 0.5) +
labs(x = "Year",
y = "Population",
title = "US Population by Year") +
facet_wrap(~region) +
theme_linedraw()# you can also use geom_col here
region_year %>%
ggplot(aes(x = year, y = pop)) +
geom_col() +
labs(x = "Year",
y = "Population",
title = "US Population by Year") +
facet_wrap(~region) +
theme_linedraw()data, and save the file in that folder. Read the dataset into an object called cups.##
## ── Column specification ────────────────────────────────────────────────────────
## cols(
## year = col_double(),
## team = col_character(),
## scored = col_double(),
## conceded = col_double(),
## penalties = col_double(),
## matches = col_double(),
## shots_on_goal = col_double(),
## shots_wide = col_double(),
## free_kicks = col_double(),
## offside = col_double(),
## corners = col_double(),
## won = col_double(),
## drawn = col_double(),
## lost = col_double(),
## wc_winner = col_logical()
## )
View(). Using group_by() and summarise(), calculate the total number of goals that each team has scored across all World Cups. Save this to an object named scored.## `summarise()` ungrouping output (override with `.groups` argument)
scored object. Save this in an object called best_teams.best_teams.# reorder(x, order_variable) will reorder columns
best_teams %>%
ggplot(aes(x = reorder(team, goals), y = goals)) +
geom_col() +
coord_flip()The sum is one interesting measurement you will want to calculate, but it is certainly not the only one. The mean (average) is another. Thankfully, calculating means for the complete sample or finding means within each group is very simple. All you need to do is change the function that you use in summarise(). .
# find average goals scored by team in 1930
cups %>%
filter(year == 1930) %>%
summarise(scored = mean(scored))## # A tibble: 1 x 1
## scored
## <dbl>
## 1 5.38
# find average goals scored by team per match in 1930
cups %>%
filter(year == 1930) %>%
summarise(scored = mean(scored / matches))## # A tibble: 1 x 1
## scored
## <dbl>
## 1 1.64
# find average goals scored per match in every year
cups %>%
group_by(year) %>%
summarise(scored = mean(scored))## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 18 x 2
## year scored
## <dbl> <dbl>
## 1 1930 5.38
## 2 1934 4.38
## 3 1938 5.6
## 4 1950 6.77
## 5 1954 8.75
## 6 1958 7.88
## 7 1962 5.56
## 8 1966 5.56
## 9 1970 5.94
## 10 1974 6.06
## 11 1978 6.38
## 12 1982 6.08
## 13 1986 5.5
## 14 1990 4.79
## 15 1994 5.88
## 16 1998 5.34
## 17 2002 5.03
## 18 2006 4.59
The median of a set of numbers is found by ordering them from smallest to largest and finding the value in the middle. Means and medians are both ways of estimating the center of a dataset, but they do have important differences in some cases. For example:
## [1] 404
## [1] 7
Statistics like sums, means, and medians are one way of summarizing data, but they still only tell you one number. Sometimes, you will want to visualize all of the values in a column or two of your dataset - the entire distribution.
For example, a histogram will visualize all of the values in one column. Taller bars tell you that more data fit in that range:
# visualize shots on goal since 1950
cups %>%
filter(year > 1950) %>%
ggplot(aes(x = shots_on_goal)) +
geom_histogram()## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
# just like any other geom, can use with groups and facets
cups %>%
filter(year > 1950) %>%
ggplot(aes(x = shots_on_goal)) +
geom_histogram() +
facet_wrap(~year)## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
This tells you that most teams had between 0-10 shots on goal, while some had a lot (>50). Notice the message you get when creating this plot - stat_bin() using bins = 30. Pick better value with binwidth. This means that ggplot() chose a default width for the “bins” (rectangles) it creates on the histogram, but that you can overwrite it if you want with the binwidth argument. ggplot() will often do a good job, but you can choose one that best fits your data.
# very small binwidth
cups %>%
filter(year > 1950) %>%
ggplot(aes(x = shots_on_goal)) +
geom_histogram(binwidth = 1)# very large binwidth - not as informative!
cups %>%
filter(year > 1950) %>%
ggplot(aes(x = shots_on_goal)) +
geom_histogram(binwidth = 25)Histograms are good for individual discrete values (like the number of shots or games won), but bins can be too coarse for continuous values. For example, the number of goals scored per match would have decimal values:
# histogram looks okay here, but you may want a smoother distribution
cups %>%
filter(year > 1950) %>%
group_by(team) %>%
summarise(per_match = scored / matches) %>%
ggplot(aes(x = per_match)) +
geom_histogram()## `summarise()` regrouping output by 'team' (override with `.groups` argument)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
# geom_density() will try to smooth it out for you
cups %>%
filter(year > 1950) %>%
group_by(team) %>%
summarise(per_match = scored / matches) %>%
ggplot(aes(x = per_match)) +
geom_density()## `summarise()` regrouping output by 'team' (override with `.groups` argument)
# can facet just like always
cups %>%
group_by(year, team) %>%
summarise(per_match = scored / matches) %>%
ggplot(aes(x = per_match)) +
geom_density() +
facet_wrap(~year)## `summarise()` regrouping output by 'year' (override with `.groups` argument)
elections data from the other day. Then, create a histogram to visualize Democratic performance in all elections since and including 1992 (>= 1992) with one plot for each region.For an extra challenge, try adding a vertical line to the plot at 50 with the geom_vline() function (don’t forget that you can read the documentation for a function with ?geom_vline, then scroll down to “Examples” to see code examples using the function).
Finally, explain the general trend that you see.
##
## ── Column specification ────────────────────────────────────────────────────────
## cols(
## state = col_character(),
## abb = col_character(),
## democrat = col_double(),
## year = col_double(),
## region = col_character()
## )
elections %>%
filter(year >= 1992) %>%
ggplot(aes(x = democrat)) +
geom_histogram(binwidth = 2) +
facet_wrap(~region) +
geom_vline(xintercept = 50, col = "blue", lty = "dashed") +
theme_bw()democrat with one plot per year. Try adding a vertical line at 50 as described above.Describe the general trend you see over time.
elections %>%
filter(region == "South") %>%
ggplot(aes(x = democrat)) +
geom_density() +
facet_wrap(~year) +
geom_vline(xintercept = 50, col = "blue", lty = "dashed") +
theme_bw()Histograms and densities are great, but sometimes you want to clearly show particular values in your dataset on your visualization like the median. Boxplots do this by clearly visualizing quartiles in your dataset.
What is a quartile? Quartiles split your data into four parts:
For example:
## 0% 25% 50% 75% 100%
## 0 25 50 75 100
## 0% 25% 50% 75% 100%
## 0.0 12.5 25.0 37.5 50.0
# you can use fill in aes() to color by a variable
elections %>%
ggplot(aes(x = region, y = democrat)) +
geom_boxplot()# works both ways! or use coord_flip()
elections %>%
ggplot(aes(x = democrat, y = region)) +
geom_boxplot()Now, let’s think about the trend you saw in Exercise 2. How could we use boxplots to visualize this?
# check the warning, what is it trying to tell us?
# continuous x aesthetic -- did you forget aes(group=...)?
elections %>%
filter(region == "South") %>%
ggplot(aes(x = year, y = democrat)) +
geom_boxplot()## Warning: Continuous x aesthetic -- did you forget aes(group=...)?
# trust the hints
elections %>%
filter(region == "South") %>%
ggplot(aes(x = year, y = democrat, group = year)) +
geom_boxplot() +
geom_hline(yintercept = 50, col = "blue", lty = "dashed")Boxplots use quartiles (since they split data into 4 groups), but other measurements like percentiles can be useful as well. Percentiles split your data into 100, so the 13th percentile is greater than or equal to 13% of your values, the 99% percentile is greater than 99% of your values, etc.