## Set the default size of figures and default to printing the R code
knitr::opts_chunk$set(echo = F,
fig.width=8,
fig.height=5)
## Load the libraries we will be using
pacman::p_load(tidyverse, skimr)
## Changing the default theme to black/white instead of grey
theme_set(theme_bw())
## Reading in the data:
counties <- read.csv("us counties.csv")
If you want to make a histogram, the geom is
geom_histogram()
(not geom_hist(), that one doesn’t
exist)
geom_histogram()
will default to creating 30 bins and
will recommend you override how many bins to create.
You can specify how many bins to have in your histogram with:
bins =
number: how many bars should be in the
histogrambinwidth =
number: how wide to make each binbreaks =
vector: A vector of breaks (where to start and
stop each bar) that are equally spacedCreate a histogram for population, one each way:
bins = 25
binwidth = 200000
breaks = seq(from = 0, to = 10000000, by = 250000)
Assign the fill and color aesthetics a good choice other than the default
Kinda hard to see what is going on from the histogram because population is extremely right skewed. Is there a way we can fix that?
Instead of mapping x = population
, map
x = log10(population)
and see what happens:
The histogram looks so much better! Unfortunately, the numbers on the x-axis aren’t very intuitive
A 2 means the population is 100 (\(10^2\)), a 5 means the population is 100,000 (\(10^5\)), and a 7 means the population is 10,000,000!
Is there a way we can change the scale of x
but keep the
numbers the same?
Since we want to change the scale of the x
aesthetic, we
need to the scale_
family of functions. But which one?
First, we need the aesthetic, x
, and the type of
variable, continuous
, so we’ll use
scale_x_continuous()
!
The argument we need wasn’t discussed in our example code about the
scale_
functions. But the continuous function family
members have an argument trans =
(short for transformation)
that we can give a math function and ggplot()
will apply
that transformation to the assigned aesthetic.
Try adding scale_x_continuous(trans = "log10")
to one of
the original histograms that assigned population to x
, not
log10(population)
Since log transformations are a very common way of reshaping a
variable, there is a little bit of a shortcut we can use. Instead of
calling scales_x_continuous()
and using
trans = log10
, we can use
scale_x_log10()
Note that every scale_aesthetic_continuous()
can be
replaced by a scale_aesthetic_log10()
if you want to
transform the variable with log10()
When using a transformation of the original variable, it is important to indicate it somewhere in the graph. A common are is with the caption:
A downside of histograms is we have to specify how many bins to use.
If we pick too many, it can highlight random features specific to the sample and not the larger population.
Too few and it can hide important characteristics that generalize to larger groups.
Try changing binwidth to 1000 or 1000000 and see what you get.
So how many bins should we have in our histogram? That’s a good question!
It would be nice if we could use a different plot that displays the same information without requiring us to make a choice.
Does such a plot exist? Of course, we wouldn’t mention it otherwise!
Instead of a histogram, we can build a density plot! It’s
similar to geom_histogram()
to create, just we use
geom_density()
and don’t specify any of the bin
arguments!
So what’s the downside of a density plot? Trying to explain what density represents about the data! Try explaining what the 0.6 means
You can create a histogram and add a density curve on the graph. In
order to have the density curve lay ontop of the bars, you’ll need to
specify binwidth
in geom_histogram()
and the
map the y
aesthetic in geom_density()
as
y = {binwidth}*after_stat(count)
For example, looking at the bachelor degree percentage:
What after_stat(count)
is doing is changing the summary
stat displayed on the y-axis from density (default of
geom_density()
) to count (default of
geom_histogram()
).
Another common option for after_stat()
is prop
if you want to display a proportion instead of a count, often used in
geom_bar()
.
We can also use fill to create histogram and density plots for
different groups. Make a histogram for each different economy type by
mapping fill
to rural_urban_code
(make sure to
turn it into a factor!)
Neither of these plots show the differences that well since there are 6 total groups. Density plots can work if there are fewer groups (say 3, 4 at most), but they can get crowded and messy quickly.
Let’s compare the population of counties by their rural/urban code separately, but in the same overall plot. How do we do that?
By creating small multiples again using
facet_wrap()
!
One issue is that density plots will “mask” or hide how many cases
are in each group, because the total area under the curve will be the
same for each one. If we want to show the differences in the number of
cases in each group as well, you can map y
to the internal
variable ..count..
## Warning: The dot-dot notation (`..count..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(count)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.