Today we are going to learn how to plot data in R! While there are
some simple plotting functions built into base R (you will often see
tutorials that use the plot() command), I encourage you to produce your
plots and data visualizations uisng the ggplot2 package in R. This
package takes a little getting used to, but once you understand the
syntax you will be making effective graphs and visualizations in no
time! Visualizing data is such an important part of the data analysis
process: it helps us to better understand the data and its distribution,
it allows us to identify and communicate patterns in simple and visually
appealing ways, and it enables us to condense a large amount of
technical information into a diagram or visual.
#Make sure you download ggplot2 first! Let's load in package and set the working directory.
setwd("C:/Users/melha/OneDrive/Documents/Binghamton/geog380")
Warning: The working directory was changed to C:/Users/melha/OneDrive/Documents/Binghamton/geog380 inside a notebook chunk. The working directory will be reset when the chunk is finished running. Use the knitr root.dir option in the setup chunk to change the working directory for notebook chunks.
library(dplyr)
library(tidyr)
Warning: package ‘tidyr’ was built under R version 4.0.5
library(ggplot2)
#we'll start by loading in some data to play with! We'll use NYC temperature data for this tutorial.
temps <- read.csv("temps_nyc.csv")
#Take a look at this dataset! It contains mean, min, and max temperatures in NYC for an entire year (2014).
#What if we wanted to plot the temperatures over time? We could plot it using base R like so:
plot(temps$day, temps$actual_mean_temp)

#for all plots, the syntax is usually (x = , y = ) - we'll put time (days) on the horizontal axis, and temperatures on the vertical axis. Putting the time variable on the x-axis is pretty standard.
This plot isn’t bad, but it isn’t very nice looking either. The
ggplot2 package gives us so much flexibility to customize our plots -
we’ll make a much nicer version of this soon. Before we get to that, we
first need to learn a bit about the syntax of ggplot2.
#Let's look at the first line of a basic ggplot graph:
ggplot(temps, aes(x = day, y = actual_mean_temp))

#When you use the ggplot() command, you need to supply a few key arguments. The first is the dataset - in this case, we will be using the temps data (as shown). The next part, called the aesthetic mapping or aes of the plot, tells us what we will be plotting from the dataset. Later, we will also include some characteristics of the plot in the aes() section. Can we plot the graph now? Not just yet! We need to add a geom layer - the geom layer tells ggplot2 what kind of visualization to produce with the data. We use a + sign to indicate a new layer in the plot like this (here I'm using geom_point to tell ggplot2 to draw a scatter plot):
ggplot(temps, aes(x = day, y = actual_mean_temp)) +
geom_point()


There’s a few important things to point out about the code above. I
have put the color = argument in the geom_point layer. This tells R to
use the color blue for the points - when we create more complex graphs,
being able to customize each geom layer individually becomes really
important. Second, the color that I choose comes next in quotation
marks. What happens if we leave them out?

As you can see, without the quotes, R thinks that we are using an
object call blue to set the color of the graph! You could actually do
that, like this:

Being able to set the color scheme of a graph using an object is much
more useful when you are working with a color palette (that is, when you
need to use multiple colors to symbolize a graph). We will see an
example like that soon. Let’s look at some other customization
features!




Hopefully that makes sense now! So, the graph of multiple variables
looks pretty nice! But, there no legend on our graph! How will people
know what each color represents? This is a somewhat annoying limitation
of ggplot2, and it’s a problem that you’ll come across somewhat
frequently. There are two ways to fix it: first, we can add a legend
manually. I’ll show you how to do that first. Second, we can reshape the
data - this is a somewhat more complicated method, but it ends up being
extremely useful when you have more than a few variables to graph. I’ll
explain that method second.


Doesn’t that look nice? Let’s talk about reshaping your data now.
This is a very helpful skill to have, and you will find yourself having
to reshape data frequently.So right now, each variable has its own
column in the dataframe in our case, we’re working with four columns of
data. Since three of the variables are temperature data, wouldn’t it
make sense to put them all in one column instead? This is what we call
transforming data from wide to long format - wide data has more columns,
while long data will usually have more rows instead. Here’s a diagram of
what it looks like to transform from long to wide: https://i.stack.imgur.com/i1Dne.jpg. You can look at
this diagram later if you’re confused about what wide and long format
look like. For whatever reason, the ggplot2 package tends to work better
with long data. We’ll reshape the data using a command from the tidyr
package, pivot_longer.
#What if I wanted to go from long to wide data? Here's what that looks like, for your reference:
wide_data1 <- long_data %>%
pivot_wider(names_from = temp_type, values_from = temp)
#Run this code to verify that the original wide_data and our new wide_data1 dataframe are the same. Logical data is helpful in this case!
#I'm setting the number of values that R prints out to 20 - otherwise, R would print the whole dataset on the screen, and it would take up a lot of space!
options(max.print = 20)
wide_data1 == wide_data
Do you see the difference? Now, the category (temp_type, or mean,
min, and max) is in one column, while each temperature that corresponds
to the temp_type and day is in the temp column. The wide and long data
sets are just different ways of storing the same data! Now let’s see how
this works in ggplot2.

Do you see why I changed the column names? R uses the categories in
the temp_type column to add names to the legend. Keeping the
“actual_mean_temp” (and so on) labels would not have been nearly as
clear in a legend. In our graphs, we should aim to show complex
information in the simplest way possible - having clear legend and axis
titles is key to that. Now, in this case, the colors aren’t quite right!
Let’s set them manually.

Do you see how much simpler and shorter the ggplot2 code is now?
Reshaping data definitely takes some getting used to, but it’ll save you
time in the future. Before we move on, I’ll show you one more way to set
the colors, now that we have the data in long format. This third version
will introduce you to a function that is helpful within dplyr functions,
ifelse.

Overall, the second method of adding color for multiple variables
(reshaping the data and using a color palette object to set colors) is
probably the most flexible method. Often, you’ll be able to set palettes
using functions, which makes the process even faster; it’s unlikely that
you’ll need to manually specify colors. In the next chunk of code, I’ll
give an example using a package that contains some really nice color
palettes inspired by US national parks. More info on the palettes are
here: https://github.com/katiejolly/nationalparkcolors. You
can use the code from the github link to install the package, too. Other
packages, like RColorBrewer, also provide a range of palettes to choose
from.

So now you know the basics of graphing with ggplot2! There are just a
few more topics to cover that you will find helpful. First, what if I
don’t want to use a scatter plot? Ggplot2 comes with a wide range of
geom possibilities! It’s so easy to produce different kinds of plots of
your data. Let’s make a line plot with the data we already have.

The line plot doesn’t look nice and smooth because we’re working with
daily data - there are a lot of data points, and the temperatures move
around a lot! But as you can see, switching to a line plot was so easy.
Next, we can look at a bar plot. Line plots, scatter plots, and bar
plots will be the most common plots you’ll use.

Let’s think about some other ways you can customize your plots. One
way that we can make our plot look nice is by adding a visual theme.
There are a number of themes you can add to your plot - we’ll download a
new package now that contains some additional themes as well.



If you type theme_ you should get a drop down menu of all of the
possible themes to choose from - I would encourage you to play around
with them, and see which one you like best!

Finally, can you use more than one plot type in the same plot? You
sure can! The code below shows what this looks like by adding a trend
line to the data using a smoothing method (don’t worry about how it
works, this is just an example). In this example, because we are using
the smoothing function to find the average values of each temperature
type, we don’t need to change the aesthetics. If, for example, you
wanted to make a line graph with one variable over time and a point
graph with a different variable over time, you would need to manually
include the aesthetics in each geom layer (like we did the first time we
added color - in that example, each variable had its own geom layer, and
we had to map the aesthetics for each one).

And there you have it! You are now a pro at using ggplot2. You should
have all the tools you need to make beautiful and effective
visualization in R. If you want more information on different types of
graphs, or you just want a helpful reference to refer to as you progress
through the course, you can find an excellent ggplot2 cheat sheet here:
https://www.maths.usyd.edu.au/u/UG/SM/STAT3022/r/current/Misc/data-visualization-2.1.pdf.
Resources
FiveThirtyEight (2014). US. Weather History. [Data Set]. Retrieved
from: https://github.com/fivethirtyeight/data/tree/master/us-weather-history.
Prabhakaran, S. (2017). The Complete ggplot2 Tutorial - Part1 |
Introduction To ggplot2. Retrieved from: http://r-statistics.co/Complete-Ggplot2-Tutorial-Part1-With-R-Code.html.
