Data Viz: Chapter 4 - Histograms and Density Plots

Set Up Your Project and Load Libraries
Plots for Quantitative Data
Histograms and Density Plots for Groups

Set Up Your Project and Load Libraries

## Set the default size of figures and default to printing the R code
knitr::opts_chunk$set(echo = F, 
                      fig.width=8, 
                      fig.height=5)  

## Load the libraries we will be using
pacman::p_load(tidyverse, skimr)

## Changing the default theme to black/white instead of grey
theme_set(theme_bw())

## Reading in the data: 
counties <- read.csv("us counties.csv")

Plots for Quantitative Data

Histogram of Population

If you want to make a histogram, the geom is geom_histogram() (not geom_hist(), that one doesn’t exist)

geom_histogram() will default to creating 30 bins and will recommend you override how many bins to create.

You can specify how many bins to have in your histogram with:

bins = number: how many bars should be in the histogram
binwidth = number: how wide to make each bin
breaks = vector: A vector of breaks (where to start and stop each bar) that are equally spaced

Create a histogram for population, one each way:

bins = 25
binwidth = 200000
breaks = seq(from = 0, to = 10000000, by = 250000)

Assign the fill and color aesthetics a good choice other than the default

Kinda hard to see what is going on from the histogram because population is extremely right skewed. Is there a way we can fix that?

Instead of mapping x = population, map x = log10(population) and see what happens:

The histogram looks so much better! Unfortunately, the numbers on the x-axis aren’t very intuitive

A 2 means the population is 100 (\(10^2\)), a 5 means the population is 100,000 (\(10^5\)), and a 7 means the population is 10,000,000!

Is there a way we can change the scale of x but keep the numbers the same?

Changing the numeric scale of a quantitative variable

Since we want to change the scale of the x aesthetic, we need to the scale_ family of functions. But which one?

First, we need the aesthetic, x, and the type of variable, continuous, so we’ll use scale_x_continuous()!

The argument we need wasn’t discussed in our example code about the scale_ functions. But the continuous function family members have an argument trans = (short for transformation) that we can give a math function and ggplot() will apply that transformation to the assigned aesthetic.

Try adding scale_x_continuous(trans = "log10") to one of the original histograms that assigned population to x, not log10(population)

Since log transformations are a very common way of reshaping a variable, there is a little bit of a shortcut we can use. Instead of calling scales_x_continuous() and using trans = log10, we can use

scale_x_log10()

Note that every scale_aesthetic_continuous() can be replaced by a scale_aesthetic_log10() if you want to transform the variable with log10()

When using a transformation of the original variable, it is important to indicate it somewhere in the graph. A common are is with the caption:

Density plot of Population

A downside of histograms is we have to specify how many bins to use.

If we pick too many, it can highlight random features specific to the sample and not the larger population.

Too few and it can hide important characteristics that generalize to larger groups.

Try changing binwidth to 1000 or 1000000 and see what you get.

So how many bins should we have in our histogram? That’s a good question!

It would be nice if we could use a different plot that displays the same information without requiring us to make a choice.

Does such a plot exist? Of course, we wouldn’t mention it otherwise!

Instead of a histogram, we can build a density plot! It’s similar to geom_histogram() to create, just we use geom_density() and don’t specify any of the bin arguments!

So what’s the downside of a density plot? Trying to explain what density represents about the data! Try explaining what the 0.6 means

Imposing a density curve on top of a histogram

You can create a histogram and add a density curve on the graph. In order to have the density curve lay ontop of the bars, you’ll need to specify binwidth in geom_histogram() and the map the y aesthetic in geom_density() as y = {binwidth}*after_stat(count)

For example, looking at the bachelor degree percentage:

What after_stat(count) is doing is changing the summary stat displayed on the y-axis from density (default of geom_density()) to count (default of geom_histogram()).

Another common option for after_stat() is prop if you want to display a proportion instead of a count, often used in geom_bar().

Histograms and Density Plots for Groups

We can also use fill to create histogram and density plots for different groups. Make a histogram for each different economy type by mapping fill to rural_urban_code (make sure to turn it into a factor!)

Neither of these plots show the differences that well since there are 6 total groups. Density plots can work if there are fewer groups (say 3, 4 at most), but they can get crowded and messy quickly.

Faceted Histograms and Density Plots

Let’s compare the population of counties by their rural/urban code separately, but in the same overall plot. How do we do that?

By creating small multiples again using facet_wrap()!

One issue is that density plots will “mask” or hide how many cases are in each group, because the total area under the curve will be the same for each one. If we want to show the differences in the number of cases in each group as well, you can map y to the internal variable ..count..

## Warning: The dot-dot notation (`..count..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(count)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.