Today we are going to learn how to plot data in R! While there are some simple plotting functions built into base R (you will often see tutorials that use the plot() command), I encourage you to produce your plots and data visualizations uisng the ggplot2 package in R. This package takes a little getting used to, but once you understand the syntax you will be making effective graphs and visualizations in no time! Visualizing data is such an important part of the data analysis process: it helps us to better understand the data and its distribution, it allows us to identify and communicate patterns in simple and visually appealing ways, and it enables us to condense a large amount of technical information into a diagram or visual.

This plot isn’t bad, but it isn’t very nice looking either. The ggplot2 package gives us so much flexibility to customize our plots - we’ll make a much nicer version of this soon. Before we get to that, we first need to learn a bit about the syntax of ggplot2.

There’s a few important things to point out about the code above. I have put the color = argument in the geom_point layer. This tells R to use the color blue for the points - when we create more complex graphs, being able to customize each geom layer individually becomes really important. Second, the color that I choose comes next in quotation marks. What happens if we leave them out?

As you can see, without the quotes, R thinks that we are using an object call blue to set the color of the graph! You could actually do that, like this:

Being able to set the color scheme of a graph using an object is much more useful when you are working with a color palette (that is, when you need to use multiple colors to symbolize a graph). We will see an example like that soon. Let’s look at some other customization features!

Hopefully that makes sense now! So, the graph of multiple variables looks pretty nice! But, there no legend on our graph! How will people know what each color represents? This is a somewhat annoying limitation of ggplot2, and it’s a problem that you’ll come across somewhat frequently. There are two ways to fix it: first, we can add a legend manually. I’ll show you how to do that first. Second, we can reshape the data - this is a somewhat more complicated method, but it ends up being extremely useful when you have more than a few variables to graph. I’ll explain that method second.

Doesn’t that look nice? Let’s talk about reshaping your data now. This is a very helpful skill to have, and you will find yourself having to reshape data frequently.So right now, each variable has its own column in the dataframe in our case, we’re working with four columns of data. Since three of the variables are temperature data, wouldn’t it make sense to put them all in one column instead? This is what we call transforming data from wide to long format - wide data has more columns, while long data will usually have more rows instead. Here’s a diagram of what it looks like to transform from long to wide: https://i.stack.imgur.com/i1Dne.jpg. You can look at this diagram later if you’re confused about what wide and long format look like. For whatever reason, the ggplot2 package tends to work better with long data. We’ll reshape the data using a command from the tidyr package, pivot_longer.

#What if I wanted to go from long to wide data? Here's what that looks like, for your reference:

wide_data1 <- long_data %>% 
  pivot_wider(names_from = temp_type, values_from = temp)

#Run this code to verify that the original wide_data and our new wide_data1 dataframe are the same. Logical data is helpful in this case!
#I'm setting the number of values that R prints out to 20 - otherwise, R would print the whole dataset on the screen, and it would take up a lot of space!
options(max.print = 20)
wide_data1 == wide_data

Do you see the difference? Now, the category (temp_type, or mean, min, and max) is in one column, while each temperature that corresponds to the temp_type and day is in the temp column. The wide and long data sets are just different ways of storing the same data! Now let’s see how this works in ggplot2.

Do you see why I changed the column names? R uses the categories in the temp_type column to add names to the legend. Keeping the “actual_mean_temp” (and so on) labels would not have been nearly as clear in a legend. In our graphs, we should aim to show complex information in the simplest way possible - having clear legend and axis titles is key to that. Now, in this case, the colors aren’t quite right! Let’s set them manually.

Do you see how much simpler and shorter the ggplot2 code is now? Reshaping data definitely takes some getting used to, but it’ll save you time in the future. Before we move on, I’ll show you one more way to set the colors, now that we have the data in long format. This third version will introduce you to a function that is helpful within dplyr functions, ifelse.

library(tidyr)
Warning: package ‘tidyr’ was built under R version 4.0.5

Overall, the second method of adding color for multiple variables (reshaping the data and using a color palette object to set colors) is probably the most flexible method. Often, you’ll be able to set palettes using functions, which makes the process even faster; it’s unlikely that you’ll need to manually specify colors. In the next chunk of code, I’ll give an example using a package that contains some really nice color palettes inspired by US national parks. More info on the palettes are here: https://github.com/katiejolly/nationalparkcolors. You can use the code from the github link to install the package, too. Other packages, like RColorBrewer, also provide a range of palettes to choose from.

So now you know the basics of graphing with ggplot2! There are just a few more topics to cover that you will find helpful. First, what if I don’t want to use a scatter plot? Ggplot2 comes with a wide range of geom possibilities! It’s so easy to produce different kinds of plots of your data. Let’s make a line plot with the data we already have.

The line plot doesn’t look nice and smooth because we’re working with daily data - there are a lot of data points, and the temperatures move around a lot! But as you can see, switching to a line plot was so easy. Next, we can look at a bar plot. Line plots, scatter plots, and bar plots will be the most common plots you’ll use.

Let’s think about some other ways you can customize your plots. One way that we can make our plot look nice is by adding a visual theme. There are a number of themes you can add to your plot - we’ll download a new package now that contains some additional themes as well.

If you type theme_ you should get a drop down menu of all of the possible themes to choose from - I would encourage you to play around with them, and see which one you like best!

Finally, can you use more than one plot type in the same plot? You sure can! The code below shows what this looks like by adding a trend line to the data using a smoothing method (don’t worry about how it works, this is just an example). In this example, because we are using the smoothing function to find the average values of each temperature type, we don’t need to change the aesthetics. If, for example, you wanted to make a line graph with one variable over time and a point graph with a different variable over time, you would need to manually include the aesthetics in each geom layer (like we did the first time we added color - in that example, each variable had its own geom layer, and we had to map the aesthetics for each one).

And there you have it! You are now a pro at using ggplot2. You should have all the tools you need to make beautiful and effective visualization in R. If you want more information on different types of graphs, or you just want a helpful reference to refer to as you progress through the course, you can find an excellent ggplot2 cheat sheet here: https://www.maths.usyd.edu.au/u/UG/SM/STAT3022/r/current/Misc/data-visualization-2.1.pdf.

Resources

FiveThirtyEight (2014). US. Weather History. [Data Set]. Retrieved from: https://github.com/fivethirtyeight/data/tree/master/us-weather-history.

Prabhakaran, S. (2017). The Complete ggplot2 Tutorial - Part1 | Introduction To ggplot2. Retrieved from: http://r-statistics.co/Complete-Ggplot2-Tutorial-Part1-With-R-Code.html.

