Introduction

Welcome to Part 4 of my Stats in R Tutorial series. If you missed the last tutorial, you can check it out at the following link. After we clear today’s tutorial, we can move back to learning the basics of statistics and how to put everything we have used together. For now, let’s learn how to make some nice plots!

The Grammar of Graphics

The ggplot2 package is another set of functions from the tidyverse. It’s namesake is somewhat self-evident, as it is obviously used for plotting. However, the “gg” here stands for “grammar of graphics.” This is actually an important distinction that may help you understand what you are doing when constructing plots from this package. So far, we have learned to run really simple plots. As an example, a simple histogram can be made with this code on the airquality dataset.

hist(airquality$Solar.R)

However, this sort of plotting is fairly limited. What if, for example, we want to add a density plot to the histograms? We can plot the density and plot the histogram bars, but we can’t really do both unless we use a lot of separate code like below. Notice that we have to remove missing data before we even plug these functions in because vectorized functions in R cannot handle missing data unless specified.

#### Drop NA Values ####
drop <- na.omit(airquality$Solar.R)

#### Plot Histogram ####
hist(drop, 
     probability = T)

#### Save Density Function ####
dens <- density(drop)

#### Overlay Density ####
lines(dens,
      lwd = 3)

However the same can be done with one line of code in the ggplot function:

ggplot(airquality,
       aes(x=Solar.R))+
  geom_histogram(aes(y = ..density..))+
  geom_density()

You may have just noticed a couple oddities here. First, what is a geom? These are layers of geometry that you are applying to the plot. In essence, ggplot serves as a blank canvas, and these geoms paint plots on this canvas. You may also be asking yourself: what are those + signs doing? Those “stack” the geom or plot layers on top of each other. You can think of ggplot like a hamburger. The ggplot function is the meat and each geom is another layer of toppings, whether that be bread, tomatoes, lettuce, or otherwise. The beauty of hamburgers and plots is that you can add as much as you want, but adding too much can also cause a giant mess. When plotting, one must always balance information transmission and interpretability.

A delicious ggplot hamburger.

A delicious ggplot hamburger.

We will cover a lot of ground today with the ggplot2 package, but keep in mind that this package has a bazillion types of plots and functions. While we cannot cover everything, we can cover most of the foundational content so you get an idea of how it works.

Histograms Revisited

Recall from our previous tutorials that histograms are plots that visualize frequency values. If you have a given variable and want to explain the shape of it’s distribution, as well as where the data is most frequently counted, we can use a histogram to visualize this relationship.

First, let’s try to make our first ggplot step by step. A typical ggplot function includes the data first, then followed by the aes function, which basically says “take this data and arrange it the way I say here.” For example, it is typical for histograms to require one variable, your \(x\) value, but ggplot doesn’t know yet which \(x\) to use. You would have to specify with aes(x = variable), the “variable” part here being the name of your \(x\). Let’s try it out on the airquality data once more, using the Solar.R variable as our \(x\).

ggplot(airquality,
       aes(x=Solar.R))

Oh no! All it did was show white space where our plot should be. I showed this on purpose. Recall previously that I mentioned that ggplot serves as a blank canvas that you must build on. Here you can see that it has taken our data values (seen on the axes), but it hasn’t “painted” anything yet. That’s because we have only supplied the data to ggplot without any geoms. To create a histogram plot layer, we just add geom_histogram, and we make sure to use the + operator to connect it to the plot.

ggplot(airquality,
       aes(x=Solar.R))+
  geom_histogram()

This likely isn’t that enthralling. First off, let’s fill the bars with a different color. You can choose any color you want from this color wheel or just use one of the preset colors in R. For now, let’s use hotpink because why not?

ggplot(airquality,
       aes(x=Solar.R))+
  geom_histogram(fill = "hotpink")

Beautiful. We may want to also color the lines in so the bars are more distinct from each other. Let’s now use color and supply the character “black” this time.

ggplot(airquality,
       aes(x=Solar.R))+
  geom_histogram(fill = "hotpink",
                 color = "black")

Not bad. Let’s try one more of these arguments. Let’s now change the number of bins, so we have more data stacked into each.

ggplot(airquality,
       aes(x=Solar.R))+
  geom_histogram(fill = "hotpink",
                 color = "black",
                 bins = 10)

Not bad. Each geom will have it’s own arguments, and even when a geom uses similar arguments, it may apply them in different ways, so try to experiment and see how they differ (you can always inspect the help page by running ?geom_histogram for example). One nice way that we can add to this plot is by relabeling a bit. Let’s change our x-axis, y-axis, and title. We can do this easily with the labs function, which is added on just like geom_histogram.

ggplot(airquality,
       aes(x=Solar.R))+
  geom_histogram(fill = "hotpink",
                 color = "black",
                 bins = 10)+
  labs(x="Solar Radiation",
       y="Count",
       title="Solar Radiation Histogram (Black Pink Version)")

Something else that can spice things up is to add a theme. There are quite a few plot themes in R that are not limited to the tidyverse package, but one is theme_bw, which applies a simple black and white scheme to the plot. Usually the themes don’t require any arguments and can be simply run as-is with the ggplot.

ggplot(airquality,
       aes(x=Solar.R))+
  geom_histogram(fill = "hotpink",
                 color = "black",
                 bins = 10)+
  labs(x="Solar Radiation",
       y="Count",
       title="Solar Radiation Histogram (Black Pink Version)")+
  theme_bw()

Let’s add just one more function here so we can expand our vocabulary. A useful function is facet_wrap, which takes a plot and splits it up by a factor. Our airquality data has solar radiation data for different months. Let’s check out how each month differs. We can do this by writing facet_wrap(~ Month). The tilda states that the data (implicitly to the left of the tilda) is going to be disaggregated by everything to the right (here the Month variable). You can also specify how many columns and rows you use. Here we will use 5 columns since that is how many months we have. We will also change the color of the plot to dark gray with the same fill argument. Finally, we will change the title to be more specific about what we are doing.

ggplot(airquality,
       aes(x=Solar.R))+
  geom_histogram(fill = "darkgray",
                 color = "black",
                 bins = 10)+
  labs(x="Solar Radiation",
       y="Count",
       title="Solar Radiation by Month")+
  theme_bw()+
  facet_wrap(~Month,
             ncol = 5)

There are several functions we can add here, but you can probably see the pattern now. The functions which we add “stack” on top of each other, further customizing the plot. This in turn gives us a very detailed plot compared to base R plots. Let’s move on to another plot to learn more.

Boxplots Revisited

Remember that boxplots allow us to check the average variation in a given variable according to some grouping variable. It provides us with the IQR, range, median, and even outliers. The boxplots in ggplot are much nicer than the bare ones native to R. Let’s check monthly averages of wind speed by using the code below. This time, we have to specify an \(x\) and \(y\) value. The x here will be our categorical variable, whereas the data variation we are looking to observe will be our y. Then we do as before and add a geom, this time geom_boxplot.

ggplot(airquality,
       aes(x=Month,
           y=Wind))+
  geom_boxplot()

Oh no! We have hit another speed bump. If you are using a non-categorical variable as a factor in ggplot, R will get confused and turn everything into one boxplot. We will use the factor function we have used before to transform this numeric data into categorical data.

ggplot(airquality,
       aes(x=factor(Month),
           y=Wind))+
  geom_boxplot()

Much better! But you can already see another problem. R has labeled our data quite literally on the x-axis. We already know how to fix this. Let’s give our boxplot some better labels.

ggplot(airquality,
       aes(x=factor(Month),
           y=Wind))+
  geom_boxplot()+
  labs(x="Month",
       y="Wind Speed (mph)",
       title="Monthly Variation in Wind Speed")

Now we are going to introduce another concept. Recall that I mentioned how aes takes your data and explicitly tells R what to do with it. We are going to use fill again, but this time instead of using it as an argument in our geom, we are going to add it to aes instead. We want to fill in color for each month. To do this, we will tell aes that we want fill=Month. This tells aes to “fill the geom with colors by month”. This may be too abstract, so let’s just see it in action instead.

ggplot(airquality,
       aes(x=factor(Month),
           y=Wind,
           fill=factor(Month)))+
  geom_boxplot()+
  labs(x="Month",
       y="Wind Speed (mph)",
       title="Monthly Variation in Wind Speed")

Fantastic. You may notice that now R has supplied a legend to the right. This is very useful in cases where the categorical labeling is not obvious. However, our plot already has an x-axis label for month, so this isn’t necessary. We will now change the theme of the plot manually to fix this issue.

The theme function takes whatever plot layers you have and edits in minor fixes to each part you provide. To get rid of the legend, we have to specify that we want our legend blank. To do this, we will use the function theme and tell the legend.position argument we don’t want it by supplying "none" as a character vector switching this off.

ggplot(airquality,
       aes(x=factor(Month),
           y=Wind,
           fill=factor(Month)))+
  geom_boxplot()+
  labs(x="Month",
       y="Wind Speed (mph)",
       title="Monthly Variation in Wind Speed")+
  theme(legend.position = "none")

Not bad so far. But notice that the y axis has a limited range of values. Perhaps we want to change the intervals so there is more information we can go off of. We can use scale_y_continuous (or scale_x_continuous for the x-axis) to redistribute the intervals. All we have to do is add n.breaks and how many that we want.

ggplot(airquality,
       aes(x=factor(Month),
           y=Wind,
           fill=factor(Month)))+
  geom_boxplot()+
  labs(x="Month",
       y="Wind Speed (mph)",
       title="Monthly Variation in Wind Speed")+
  theme(legend.position = "none")+
  scale_y_continuous(n.breaks = 20)

We can see now that wind speed doesn’t vary that much in general, though some months have higher/lower wind speeds compared to others. You can also see two dots above and below the range for June (Month 6). Remember that these dots represent outliers above and below the IQR. To finish off this plot, let’s consider changing the palette of colors we have here. Perhaps the natural shade isn’t that exciting. We can supply our own manually into a character vector, then add them to our plot with scale_fill_manual using the value argument.

#### Create Colors ####
colors <- c("darkgray", 
                      "lightblue",
                      "steelblue",
                      "royalblue",
                      "steelblue1")

#### Create Plot ####
ggplot(airquality,
       aes(x=factor(Month),
           y=Wind,
           fill=factor(Month)))+
  geom_boxplot()+
  labs(x="Month",
       y="Wind Speed (mph)",
       title="Monthly Variation in Wind Speed")+
  theme(legend.position = "none")+
  scale_y_continuous(n.breaks = 20)+
  scale_fill_manual(values = colors)

Lookin’ good! Let’s move on.

Combining DPLYR and GGPLOT

We will learn a couple more plot types, but now that we have learned a bit of ggplot, it is now time to also revisit dplyr. Recall from the last tutorial that dplyr allows you to manipulate data in some rather complex and useful ways. Combining these two packages will leverage a lot of power in your data science skills, and it is recommended to get familiar with using both if you want to excel at R.

You have seen that ggplot uses a similar method of code chunking. Rather than use the %>% operator, it uses a + operator, but the idea is the same. They however do not mix well. Make sure that your dplyr functions distinctly have the %>% after each function and ggplot has a + after each function. When you finish writing a chain of functions, do not use either operator. Doing so will make R think that you are not finished with your code and will wig out.

Anyways, we will create a scatter plot of values using this principle. We will explore these in much more depth once we tackle regression in a future tutorial, but they can be useful for many other statistical explorations as well. For starting our plot off, we do so in a similar fashion to geom_boxplot, only this time we use only numeric values for \(x\) and \(y\), followed by geom_point.

ggplot(airquality,
       aes(x=Month,
           y=Wind))+
  geom_point()

Notice that our scatter looks…pretty stupid so far. That’s because we essentially binned repeated values into our x-axis. Before we fix this, let’s start off with a really basic combo of dplyr and ggplot. This time, let’s simply pipe our data into the plot, and omit the data from the ggplot function.

airquality %>% 
  ggplot(aes(x=Month,
             y=Wind))+
  geom_point()

Look at that. We essentially achieved the same thing, but piped our data into the plot before adding layers. Now we are going to add some complexity. The data needs to be aggregated into unique values for each month, so what we can do is group our data by month and take it’s mean. The dplyr way is shown below, which we have already learned.

airquality %>% 
  group_by(Month) %>% 
  summarise(Mean_Wind = mean(Wind, na.rm = T))
## # A tibble: 5 × 2
##   Month Mean_Wind
##   <int>     <dbl>
## 1     5     11.6 
## 2     6     10.3 
## 3     7      8.94
## 4     8      8.79
## 5     9     10.2

Now we just have to add this to the ggplot function. Since we have created a new variable, our \(y\) will be called Mean_Wind this time.

airquality %>% 
  group_by(Month) %>% 
  summarise(Mean_Wind = mean(Wind, na.rm = T)) %>% 
  ggplot(aes(x=Month,
             y=Mean_Wind))+
  geom_point()

This is starting to look better. We can now clearly see the average wind speed for each month. We can begin to tell a story with this plot by adding a line between the points. Simply add geom_line to add one more layer to the plot, and with it, a line between our scatter plot points.

airquality %>% 
  group_by(Month) %>% 
  summarise(Mean_Wind = mean(Wind, na.rm = T)) %>% 
  ggplot(aes(x=Month,
             y=Mean_Wind))+
  geom_point()+
  geom_line()

Go ahead and fix the labels and theme of the plot, this time using theme_classic, which makes a more minimalist look for your plot.

airquality %>% 
  group_by(Month) %>% 
  summarise(Mean_Wind = mean(Wind, na.rm = T)) %>% 
  ggplot(aes(x=Month,
             y=Mean_Wind))+
  geom_point()+
  geom_line()+
  labs(x="Month",
       y="Mean Wind Speed (mph)",
       title = "Mean Wind Speed by Month")+
  theme_classic()

This still doesn’t pop yet. Let’s add three more arguments to geom_point. We will set size = 5 to increase the size of the points, alpha = .5 to make them semi-translucent, and write color = "steelblue" to change it’s color.

airquality %>% 
  group_by(Month) %>% 
  summarise(Mean_Wind = mean(Wind, na.rm = T)) %>% 
  ggplot(aes(x=Month,
             y=Mean_Wind))+
  geom_point(size = 5,
             alpha = .5,
             color = "steelblue")+
  geom_line()+
  labs(x="Month",
       y="Mean Wind Speed (mph)",
       title = "Mean Wind Speed by Month")+
  theme_classic()

The line still looks pretty weak. Let’s set the linewidth to 1 to make it bulkier. Let’s also scale our y-axis again so there are more values to look at.

airquality %>% 
  group_by(Month) %>% 
  summarise(Mean_Wind = mean(Wind, na.rm = T)) %>% 
  ggplot(aes(x=Month,
             y=Mean_Wind))+
  geom_point(size = 5,
             alpha = .5,
             color = "steelblue")+
  geom_line(linewidth = 1)+
  labs(x="Month",
       y="Mean Wind Speed (mph)",
       title = "Mean Wind Speed by Month")+
  theme_classic()+
  scale_y_continuous(n.breaks = 10)

Our plot is looking better. Clearly we have a curvilinear relationship going on here, with wind speed fluctuating between months. However, maybe we are only interested in the wind speed from May to July. Let’s filter for this.

airquality %>% 
  group_by(Month) %>% 
  summarise(Mean_Wind = mean(Wind, na.rm = T)) %>% 
  filter(Month < 8) %>% 
  ggplot(aes(x=Month,
             y=Mean_Wind))+
  geom_point(size = 5,
             alpha = .5,
             color = "steelblue")+
  geom_line(linewidth = 1)+
  labs(x="Month",
       y="Mean Wind Speed (mph)",
       title = "Mean Wind Speed by Month")+
  theme_classic()+
  scale_y_continuous(n.breaks = 10)

Here we can see the trend is very linear if we do not look at subsequent months. Filtering before plotting can narrow down your search of information when you have a lot of data to work with, making dplyr extra helpful.

Let’s do one last chain of functions to highlight this functionality more. We are going to use pivot_longer to compare trends with the the same scatter plot / line graph. This time we want to compare ozone, wind speed, and solar radiation levels by month. Let’s first select our variables, group them by month, and get their mean. Let’s save this as air.avg.

air.avg <- airquality %>% 
  select(Month, Wind, Ozone, Solar.R) %>% 
  group_by(Month) %>% 
  summarise(Mean_Wind = mean(Wind, na.rm = T),
            Mean_Ozone = mean(Ozone, na.rm = T),
            Mean_Solar = mean(Solar.R, na.rm = T))
air.avg
## # A tibble: 5 × 4
##   Month Mean_Wind Mean_Ozone Mean_Solar
##   <int>     <dbl>      <dbl>      <dbl>
## 1     5     11.6        23.6       181.
## 2     6     10.3        29.4       190.
## 3     7      8.94       59.1       216.
## 4     8      8.79       60.0       172.
## 5     9     10.2        31.4       167.

Let’s then pivot this data and call it air.pivot. Use the contains function to select all of the variables with “mean” in their name.

air.pivot <- air.avg %>% 
  pivot_longer(cols = contains("Mean"),
               names_to = "Measure",
               values_to = "Value")
air.pivot
## # A tibble: 15 × 3
##    Month Measure     Value
##    <int> <chr>       <dbl>
##  1     5 Mean_Wind   11.6 
##  2     5 Mean_Ozone  23.6 
##  3     5 Mean_Solar 181.  
##  4     6 Mean_Wind   10.3 
##  5     6 Mean_Ozone  29.4 
##  6     6 Mean_Solar 190.  
##  7     7 Mean_Wind    8.94
##  8     7 Mean_Ozone  59.1 
##  9     7 Mean_Solar 216.  
## 10     8 Mean_Wind    8.79
## 11     8 Mean_Ozone  60.0 
## 12     8 Mean_Solar 172.  
## 13     9 Mean_Wind   10.2 
## 14     9 Mean_Ozone  31.4 
## 15     9 Mean_Solar 167.

Now let’s plot this pivoted data. We will use the color argument in the aes function again to color the data in by each measure we labeled.

air.pivot %>% 
  ggplot(aes(x=Month,
             y=Value,
             color=Measure))+
  geom_point()+
  geom_line()

We are getting something closer to what we want. However, you can see the variables have much different values from each other and are not as easily comparable. We could scale them using scale, but that would make it slightly more difficult to interpret. Instead, we can use facet_wrap again and use the scales=free argument to help us visualize it better. We can also remove the legend and add our own colors like last time. Finally, we can give ourselves a title with labs. Also, I have added theme_classic before theme. If you swap the order, R will overwrite your theme changes with the “defaults” of theme_classic. This may seem like a lot of steps, but you already know most of this from what we have covered so far, so you can just apply what you have learned.

air.pivot %>% 
  ggplot(aes(x=Month,
             y=Value,
             color=Measure))+
  geom_point()+
  geom_line()+
  facet_wrap(~Measure,
             scales = "free")+
  theme_classic()+
  theme(legend.position = "none")+
  scale_color_manual(values = c("darkred",
                                         "darkorange",
                                         "darkblue"))+
                                           labs(title = "Weather Measurements by Month")

Now we can clearly see some interesting trends! Not all of these variables have the same relationship across time, as ozone and wind are almost inverses of each other. Just for reference, we could have done all of this with one code…

airquality %>% 
  select(Month, Wind, Ozone, Solar.R) %>% 
  group_by(Month) %>% 
  summarise(Mean_Wind = mean(Wind, na.rm = T),
            Mean_Ozone = mean(Ozone, na.rm = T),
            Mean_Solar = mean(Solar.R, na.rm = T)) %>% 
  pivot_longer(cols = contains("Mean"),
               names_to = "Measure",
               values_to = "Value") %>% 
  ggplot(aes(x=Month,
             y=Value,
             color=Measure))+
  geom_point()+
  geom_line()+
  facet_wrap(~Measure,
             scales = "free")+
  theme_classic()+
  theme(legend.position = "none")+
  scale_color_manual(values = c("darkred",
                                         "darkorange",
                                         "darkblue"))

…but in this case it is better to break it up into chunks since it is quite a lot. Now you understand why leveraging dplyr and ggplot can be really helpful, as well as how saving objects can help break up your analysis into meaningful steps.

Conclusion

We did quite a bit today, and I would recommend going through the code here and experimenting with it on different data (remember that R has many built-in datasets to utilize). I will once again recommend the book R for Data Science by Hadley Wickham if you want to learn much more about plotting in the tidyverse, as it’s coverage is quite deep compared to here. Unrelated to R, there is also a fantastic book called Storytelling with Data by Cole Nussbaumer Knaflic which really does a great job of explaining how to effectively visualize data. I recommend reading it if you want to be efficient with how you explain data with plots.

My Other RPubs

Thank you for reading this tutorial. If you felt that any part of this tutorial was helpful, and you would like to support me, please consider buying me a cup of coffee.

Check out my other RPubs if you want to learn more about stats in R.

Now that we have a foundation covered for the tidyverse, it is safe to move back to learning our core statistics. In the next tutorial, I will explore some new statistical concepts and show how object-based programming can help us better understand these ideas. Take care and happy coding!