Welcome to Part 4 of my Stats in R Tutorial series. If you missed the last tutorial, you can check it out at the following link. After we clear today’s tutorial, we can move back to learning the basics of statistics and how to put everything we have used together. For now, let’s learn how to make some nice plots!
The ggplot2 package is another set of functions from the
tidyverse. It’s namesake is somewhat self-evident, as it is
obviously used for plotting. However, the “gg” here stands for “grammar
of graphics.” This is actually an important distinction that may help
you understand what you are doing when constructing plots from this
package. So far, we have learned to run really simple plots. As an
example, a simple histogram can be made with this code on the
airquality dataset.
hist(airquality$Solar.R)
However, this sort of plotting is fairly limited. What if, for example, we want to add a density plot to the histograms? We can plot the density and plot the histogram bars, but we can’t really do both unless we use a lot of separate code like below. Notice that we have to remove missing data before we even plug these functions in because vectorized functions in R cannot handle missing data unless specified.
#### Drop NA Values ####
drop <- na.omit(airquality$Solar.R)
#### Plot Histogram ####
hist(drop,
probability = T)
#### Save Density Function ####
dens <- density(drop)
#### Overlay Density ####
lines(dens,
lwd = 3)
However the same can be done with one line of code in the
ggplot function:
ggplot(airquality,
aes(x=Solar.R))+
geom_histogram(aes(y = ..density..))+
geom_density()
You may have just noticed a couple oddities here. First, what is a
geom? These are layers of geometry that you are applying to
the plot. In essence, ggplot serves as a blank canvas, and
these geoms paint plots on this canvas. You may also be asking yourself:
what are those + signs doing? Those “stack” the
geom or plot layers on top of each other. You can think of
ggplot like a hamburger. The ggplot function
is the meat and each geom is another layer of toppings,
whether that be bread, tomatoes, lettuce, or otherwise. The beauty of
hamburgers and plots is that you can add as much as you want, but adding
too much can also cause a giant mess. When plotting, one must always
balance information transmission and interpretability.
A delicious ggplot hamburger.
We will cover a lot of ground today with the ggplot2
package, but keep in mind that this package has a bazillion types of
plots and functions. While we cannot cover everything, we can cover most
of the foundational content so you get an idea of how it works.
Recall from our previous tutorials that histograms are plots that visualize frequency values. If you have a given variable and want to explain the shape of it’s distribution, as well as where the data is most frequently counted, we can use a histogram to visualize this relationship.
First, let’s try to make our first ggplot step by step.
A typical ggplot function includes the data first, then
followed by the aes function, which basically says “take
this data and arrange it the way I say here.” For example, it is typical
for histograms to require one variable, your \(x\) value, but ggplot doesn’t
know yet which \(x\) to use.
You would have to specify with aes(x = variable), the
“variable” part here being the name of your \(x\). Let’s try it out on the
airquality data once more, using the Solar.R
variable as our \(x\).
ggplot(airquality,
aes(x=Solar.R))
Oh no! All it did was show white space where our plot should be. I
showed this on purpose. Recall previously that I mentioned that
ggplot serves as a blank canvas that you must build on.
Here you can see that it has taken our data values (seen on the axes),
but it hasn’t “painted” anything yet. That’s because we have only
supplied the data to ggplot without any geoms. To create a
histogram plot layer, we just add geom_histogram, and we
make sure to use the + operator to connect it to the
plot.
ggplot(airquality,
aes(x=Solar.R))+
geom_histogram()
This likely isn’t that enthralling. First off, let’s fill the bars
with a different color. You can choose any color you want from this color wheel or
just use one of the preset colors in R. For now, let’s use
hotpink because why not?
ggplot(airquality,
aes(x=Solar.R))+
geom_histogram(fill = "hotpink")
Beautiful. We may want to also color the lines in so the bars are
more distinct from each other. Let’s now use color and
supply the character “black” this time.
ggplot(airquality,
aes(x=Solar.R))+
geom_histogram(fill = "hotpink",
color = "black")
Not bad. Let’s try one more of these arguments. Let’s now change the number of bins, so we have more data stacked into each.
ggplot(airquality,
aes(x=Solar.R))+
geom_histogram(fill = "hotpink",
color = "black",
bins = 10)
Not bad. Each geom will have it’s own arguments, and
even when a geom uses similar arguments, it may apply them
in different ways, so try to experiment and see how they differ (you can
always inspect the help page by running ?geom_histogram for
example). One nice way that we can add to this plot is by relabeling a
bit. Let’s change our x-axis, y-axis, and title. We can do this easily
with the labs function, which is added on just like
geom_histogram.
ggplot(airquality,
aes(x=Solar.R))+
geom_histogram(fill = "hotpink",
color = "black",
bins = 10)+
labs(x="Solar Radiation",
y="Count",
title="Solar Radiation Histogram (Black Pink Version)")
Something else that can spice things up is to add a theme. There are
quite a few plot themes in R that are not limited to the
tidyverse package, but one is theme_bw, which
applies a simple black and white scheme to the plot. Usually the themes
don’t require any arguments and can be simply run as-is with the
ggplot.
ggplot(airquality,
aes(x=Solar.R))+
geom_histogram(fill = "hotpink",
color = "black",
bins = 10)+
labs(x="Solar Radiation",
y="Count",
title="Solar Radiation Histogram (Black Pink Version)")+
theme_bw()
Let’s add just one more function here so we can expand our
vocabulary. A useful function is facet_wrap, which takes a
plot and splits it up by a factor. Our airquality data has
solar radiation data for different months. Let’s check out how each
month differs. We can do this by writing
facet_wrap(~ Month). The tilda states that the data
(implicitly to the left of the tilda) is going to be disaggregated by
everything to the right (here the Month variable). You can
also specify how many columns and rows you use. Here we will use 5
columns since that is how many months we have. We will also change the
color of the plot to dark gray with the same fill argument.
Finally, we will change the title to be more specific about what we are
doing.
ggplot(airquality,
aes(x=Solar.R))+
geom_histogram(fill = "darkgray",
color = "black",
bins = 10)+
labs(x="Solar Radiation",
y="Count",
title="Solar Radiation by Month")+
theme_bw()+
facet_wrap(~Month,
ncol = 5)
There are several functions we can add here, but you can probably see the pattern now. The functions which we add “stack” on top of each other, further customizing the plot. This in turn gives us a very detailed plot compared to base R plots. Let’s move on to another plot to learn more.
Remember that boxplots allow us to check the average variation in a
given variable according to some grouping variable. It provides us with
the IQR, range, median, and even outliers. The boxplots in
ggplot are much nicer than the bare ones native to R. Let’s
check monthly averages of wind speed by using the code below. This time,
we have to specify an \(x\) and \(y\) value. The x here will be
our categorical variable, whereas the data variation we are looking to
observe will be our y. Then we do as before and add a
geom, this time geom_boxplot.
ggplot(airquality,
aes(x=Month,
y=Wind))+
geom_boxplot()
Oh no! We have hit another speed bump. If you are using a
non-categorical variable as a factor in ggplot, R will get
confused and turn everything into one boxplot. We will use the
factor function we have used before to transform this
numeric data into categorical data.
ggplot(airquality,
aes(x=factor(Month),
y=Wind))+
geom_boxplot()
Much better! But you can already see another problem. R has labeled our data quite literally on the x-axis. We already know how to fix this. Let’s give our boxplot some better labels.
ggplot(airquality,
aes(x=factor(Month),
y=Wind))+
geom_boxplot()+
labs(x="Month",
y="Wind Speed (mph)",
title="Monthly Variation in Wind Speed")
Now we are going to introduce another concept. Recall that I
mentioned how aes takes your data and explicitly tells R
what to do with it. We are going to use fill again, but
this time instead of using it as an argument in our geom,
we are going to add it to aes instead. We want to fill in
color for each month. To do this, we will tell aes that we
want fill=Month. This tells aes to “fill the
geom with colors by month”. This may be too abstract, so let’s just see
it in action instead.
ggplot(airquality,
aes(x=factor(Month),
y=Wind,
fill=factor(Month)))+
geom_boxplot()+
labs(x="Month",
y="Wind Speed (mph)",
title="Monthly Variation in Wind Speed")
Fantastic. You may notice that now R has supplied a legend to the right. This is very useful in cases where the categorical labeling is not obvious. However, our plot already has an x-axis label for month, so this isn’t necessary. We will now change the theme of the plot manually to fix this issue.
The theme function takes whatever plot layers you have
and edits in minor fixes to each part you provide. To get rid of the
legend, we have to specify that we want our legend blank. To do this, we
will use the function theme and tell the
legend.position argument we don’t want it by supplying
"none" as a character vector switching this off.
ggplot(airquality,
aes(x=factor(Month),
y=Wind,
fill=factor(Month)))+
geom_boxplot()+
labs(x="Month",
y="Wind Speed (mph)",
title="Monthly Variation in Wind Speed")+
theme(legend.position = "none")
Not bad so far. But notice that the y axis has a limited range of
values. Perhaps we want to change the intervals so there is more
information we can go off of. We can use scale_y_continuous
(or scale_x_continuous for the x-axis) to redistribute the
intervals. All we have to do is add n.breaks and how many
that we want.
ggplot(airquality,
aes(x=factor(Month),
y=Wind,
fill=factor(Month)))+
geom_boxplot()+
labs(x="Month",
y="Wind Speed (mph)",
title="Monthly Variation in Wind Speed")+
theme(legend.position = "none")+
scale_y_continuous(n.breaks = 20)
We can see now that wind speed doesn’t vary that much in general,
though some months have higher/lower wind speeds compared to others. You
can also see two dots above and below the range for June (Month 6).
Remember that these dots represent outliers above and below the IQR. To
finish off this plot, let’s consider changing the palette of colors we
have here. Perhaps the natural shade isn’t that exciting. We can supply
our own manually into a character vector, then add them to our plot with
scale_fill_manual using the value
argument.
#### Create Colors ####
colors <- c("darkgray",
"lightblue",
"steelblue",
"royalblue",
"steelblue1")
#### Create Plot ####
ggplot(airquality,
aes(x=factor(Month),
y=Wind,
fill=factor(Month)))+
geom_boxplot()+
labs(x="Month",
y="Wind Speed (mph)",
title="Monthly Variation in Wind Speed")+
theme(legend.position = "none")+
scale_y_continuous(n.breaks = 20)+
scale_fill_manual(values = colors)
Lookin’ good! Let’s move on.
We will learn a couple more plot types, but now that we have learned
a bit of ggplot, it is now time to also revisit
dplyr. Recall from the last tutorial that
dplyr allows you to manipulate data in some rather complex
and useful ways. Combining these two packages will leverage a lot of
power in your data science skills, and it is recommended to get familiar
with using both if you want to excel at R.
You have seen that ggplot uses a similar method of code
chunking. Rather than use the %>% operator, it uses a
+ operator, but the idea is the same. They however do not
mix well. Make sure that your dplyr functions distinctly
have the %>% after each function and ggplot
has a + after each function. When you finish writing a
chain of functions, do not use either operator. Doing so will
make R think that you are not finished with your code and will wig
out.
Anyways, we will create a scatter plot of values using this
principle. We will explore these in much more depth once we tackle
regression in a future tutorial, but they can be useful for many other
statistical explorations as well. For starting our plot off, we do so in
a similar fashion to geom_boxplot, only this time we use
only numeric values for \(x\) and \(y\), followed by
geom_point.
ggplot(airquality,
aes(x=Month,
y=Wind))+
geom_point()
Notice that our scatter looks…pretty stupid so far. That’s because we
essentially binned repeated values into our x-axis. Before we fix this,
let’s start off with a really basic combo of dplyr and
ggplot. This time, let’s simply pipe our data into the
plot, and omit the data from the ggplot function.
airquality %>%
ggplot(aes(x=Month,
y=Wind))+
geom_point()
Look at that. We essentially achieved the same thing, but piped our
data into the plot before adding layers. Now we are going to add some
complexity. The data needs to be aggregated into unique values for each
month, so what we can do is group our data by month and take it’s mean.
The dplyr way is shown below, which we have already
learned.
airquality %>%
group_by(Month) %>%
summarise(Mean_Wind = mean(Wind, na.rm = T))
## # A tibble: 5 × 2
## Month Mean_Wind
## <int> <dbl>
## 1 5 11.6
## 2 6 10.3
## 3 7 8.94
## 4 8 8.79
## 5 9 10.2
Now we just have to add this to the ggplot function.
Since we have created a new variable, our \(y\) will be called Mean_Wind
this time.
airquality %>%
group_by(Month) %>%
summarise(Mean_Wind = mean(Wind, na.rm = T)) %>%
ggplot(aes(x=Month,
y=Mean_Wind))+
geom_point()
This is starting to look better. We can now clearly see the average
wind speed for each month. We can begin to tell a story with this plot
by adding a line between the points. Simply add geom_line
to add one more layer to the plot, and with it, a line between our
scatter plot points.
airquality %>%
group_by(Month) %>%
summarise(Mean_Wind = mean(Wind, na.rm = T)) %>%
ggplot(aes(x=Month,
y=Mean_Wind))+
geom_point()+
geom_line()
Go ahead and fix the labels and theme of the plot, this time using
theme_classic, which makes a more minimalist look for your
plot.
airquality %>%
group_by(Month) %>%
summarise(Mean_Wind = mean(Wind, na.rm = T)) %>%
ggplot(aes(x=Month,
y=Mean_Wind))+
geom_point()+
geom_line()+
labs(x="Month",
y="Mean Wind Speed (mph)",
title = "Mean Wind Speed by Month")+
theme_classic()
This still doesn’t pop yet. Let’s add three more arguments to
geom_point. We will set size = 5 to increase
the size of the points, alpha = .5 to make them
semi-translucent, and write color = "steelblue" to change
it’s color.
airquality %>%
group_by(Month) %>%
summarise(Mean_Wind = mean(Wind, na.rm = T)) %>%
ggplot(aes(x=Month,
y=Mean_Wind))+
geom_point(size = 5,
alpha = .5,
color = "steelblue")+
geom_line()+
labs(x="Month",
y="Mean Wind Speed (mph)",
title = "Mean Wind Speed by Month")+
theme_classic()
The line still looks pretty weak. Let’s set the linewidth to 1 to make it bulkier. Let’s also scale our y-axis again so there are more values to look at.
airquality %>%
group_by(Month) %>%
summarise(Mean_Wind = mean(Wind, na.rm = T)) %>%
ggplot(aes(x=Month,
y=Mean_Wind))+
geom_point(size = 5,
alpha = .5,
color = "steelblue")+
geom_line(linewidth = 1)+
labs(x="Month",
y="Mean Wind Speed (mph)",
title = "Mean Wind Speed by Month")+
theme_classic()+
scale_y_continuous(n.breaks = 10)
Our plot is looking better. Clearly we have a curvilinear relationship going on here, with wind speed fluctuating between months. However, maybe we are only interested in the wind speed from May to July. Let’s filter for this.
airquality %>%
group_by(Month) %>%
summarise(Mean_Wind = mean(Wind, na.rm = T)) %>%
filter(Month < 8) %>%
ggplot(aes(x=Month,
y=Mean_Wind))+
geom_point(size = 5,
alpha = .5,
color = "steelblue")+
geom_line(linewidth = 1)+
labs(x="Month",
y="Mean Wind Speed (mph)",
title = "Mean Wind Speed by Month")+
theme_classic()+
scale_y_continuous(n.breaks = 10)
Here we can see the trend is very linear if we do not look at
subsequent months. Filtering before plotting can narrow down your search
of information when you have a lot of data to work with, making
dplyr extra helpful.
Let’s do one last chain of functions to highlight this functionality
more. We are going to use pivot_longer to compare trends
with the the same scatter plot / line graph. This time we want to
compare ozone, wind speed, and solar radiation levels by month. Let’s
first select our variables, group them by month, and get their mean.
Let’s save this as air.avg.
air.avg <- airquality %>%
select(Month, Wind, Ozone, Solar.R) %>%
group_by(Month) %>%
summarise(Mean_Wind = mean(Wind, na.rm = T),
Mean_Ozone = mean(Ozone, na.rm = T),
Mean_Solar = mean(Solar.R, na.rm = T))
air.avg
## # A tibble: 5 × 4
## Month Mean_Wind Mean_Ozone Mean_Solar
## <int> <dbl> <dbl> <dbl>
## 1 5 11.6 23.6 181.
## 2 6 10.3 29.4 190.
## 3 7 8.94 59.1 216.
## 4 8 8.79 60.0 172.
## 5 9 10.2 31.4 167.
Let’s then pivot this data and call it air.pivot. Use
the contains function to select all of the variables with
“mean” in their name.
air.pivot <- air.avg %>%
pivot_longer(cols = contains("Mean"),
names_to = "Measure",
values_to = "Value")
air.pivot
## # A tibble: 15 × 3
## Month Measure Value
## <int> <chr> <dbl>
## 1 5 Mean_Wind 11.6
## 2 5 Mean_Ozone 23.6
## 3 5 Mean_Solar 181.
## 4 6 Mean_Wind 10.3
## 5 6 Mean_Ozone 29.4
## 6 6 Mean_Solar 190.
## 7 7 Mean_Wind 8.94
## 8 7 Mean_Ozone 59.1
## 9 7 Mean_Solar 216.
## 10 8 Mean_Wind 8.79
## 11 8 Mean_Ozone 60.0
## 12 8 Mean_Solar 172.
## 13 9 Mean_Wind 10.2
## 14 9 Mean_Ozone 31.4
## 15 9 Mean_Solar 167.
Now let’s plot this pivoted data. We will use the color
argument in the aes function again to color the data in by
each measure we labeled.
air.pivot %>%
ggplot(aes(x=Month,
y=Value,
color=Measure))+
geom_point()+
geom_line()
We are getting something closer to what we want. However, you can see
the variables have much different values from each other and are not as
easily comparable. We could scale them using scale, but
that would make it slightly more difficult to interpret. Instead, we can
use facet_wrap again and use the scales=free
argument to help us visualize it better. We can also remove the legend
and add our own colors like last time. Finally, we can give ourselves a
title with labs. Also, I have added
theme_classic before theme. If you swap the
order, R will overwrite your theme changes with the “defaults” of
theme_classic. This may seem like a lot of steps, but you
already know most of this from what we have covered so far, so you can
just apply what you have learned.
air.pivot %>%
ggplot(aes(x=Month,
y=Value,
color=Measure))+
geom_point()+
geom_line()+
facet_wrap(~Measure,
scales = "free")+
theme_classic()+
theme(legend.position = "none")+
scale_color_manual(values = c("darkred",
"darkorange",
"darkblue"))+
labs(title = "Weather Measurements by Month")
Now we can clearly see some interesting trends! Not all of these variables have the same relationship across time, as ozone and wind are almost inverses of each other. Just for reference, we could have done all of this with one code…
airquality %>%
select(Month, Wind, Ozone, Solar.R) %>%
group_by(Month) %>%
summarise(Mean_Wind = mean(Wind, na.rm = T),
Mean_Ozone = mean(Ozone, na.rm = T),
Mean_Solar = mean(Solar.R, na.rm = T)) %>%
pivot_longer(cols = contains("Mean"),
names_to = "Measure",
values_to = "Value") %>%
ggplot(aes(x=Month,
y=Value,
color=Measure))+
geom_point()+
geom_line()+
facet_wrap(~Measure,
scales = "free")+
theme_classic()+
theme(legend.position = "none")+
scale_color_manual(values = c("darkred",
"darkorange",
"darkblue"))
…but in this case it is better to break it up into chunks since it is
quite a lot. Now you understand why leveraging dplyr and
ggplot can be really helpful, as well as how saving objects
can help break up your analysis into meaningful steps.
We did quite a bit today, and I would recommend going through the
code here and experimenting with it on different data (remember that R
has many built-in datasets to utilize). I will once again recommend the
book R for Data Science by Hadley Wickham if you want to learn
much more about plotting in the tidyverse, as it’s coverage
is quite deep compared to here. Unrelated to R, there is also a
fantastic book called Storytelling with Data by Cole Nussbaumer
Knaflic which really does a great job of explaining how to effectively
visualize data. I recommend reading it if you want to be efficient with
how you explain data with plots.
Thank you for reading this tutorial. If you felt that any part of this tutorial was helpful, and you would like to support me, please consider buying me a cup of coffee.
Check out my other RPubs if you want to learn more about stats in R.
Now that we have a foundation covered for the tidyverse,
it is safe to move back to learning our core statistics. In the next
tutorial, I will explore some new statistical concepts and show how
object-based programming can help us better understand these ideas. Take
care and happy coding!