Today we are going to learn how to plot data in R! While there are some simple plotting functions built into base R (you will often see tutorials that use the plot() command), I encourage you to produce your plots and data visualizations uisng the ggplot2 package in R. This package takes a little getting used to, but once you understand the syntax you will be making effective graphs and visualizations in no time! Visualizing data is such an important part of the data analysis process: it helps us to better understand the data and its distribution, it allows us to identify and communicate patterns in simple and visually appealing ways, and it enables us to condense a large amount of technical information into a diagram or visual.


#Make sure you download ggplot2 first! Let's load in package and set the working directory.
setwd("~/Binghamton/harp130")
library(dplyr)
library(tidyr)
library(ggplot2)

#we'll start by loading in some data to play with! We'll use NYC temperature data for this tutorial. 

temps <- read.csv("temps_nyc.csv")

#Take a look at this dataset! It contains mean, min, and max temperatures in NYC for an entire year (2014).
#What if we wanted to plot the temperatures over time? We could plot it using base R like so:

plot(temps$day, temps$actual_mean_temp)


#for all plots, the syntax is usually (x = , y = ) - we'll put time (days) on the horizontal axis, and temperatures on the vertical axis. Putting the time variable on the x-axis is pretty standard. 

This plot isn’t bad, but it isn’t very nice looking either. The ggplot2 package gives us so much flexibility to customize our plots - we’ll make a much nicer version of this soon. Before we get to that, we first need to learn a bit about the syntax of ggplot2.

#Let's look at the first line of a basic ggplot graph:

ggplot(temps, aes(x = day, y = actual_mean_temp))


#When you use the ggplot() command, you need to supply a few key arguments. The first is the dataset - in this case, we will be using the temps data (as shown). The next part, called the aesthetic mapping or aes of the plot, tells us what we will be plotting from the dataset. Later, we will also include some characteristics of the plot in the aes() section. Can we plot the graph now? Not just yet! We need to add a geom layer - the geom layer tells ggplot2 what kind of visualization to produce with the data. We use a + sign to indicate a new layer in the plot like this (here I'm using geom_point to tell ggplot2 to draw a scatter plot):

ggplot(temps, aes(x = day, y = actual_mean_temp)) +
  geom_point()

#One of the nice things about ggplot2 is its flexibility. We can easily customize the plot. Once you get used to the syntax of ggplot2, customization is very simple. For example, let's start by changing the color of the points:

ggplot(temps, aes(x = day, y = actual_mean_temp)) +
  geom_point(color = "mediumpurple")

There’s a few important things to point out about the code above. I have put the color = argument in the geom_point layer. This tells R to use the color blue for the points - when we create more complex graphs, being able to customize each geom layer individually becomes really important. Second, the color that I choose comes next in quotation marks. What happens if we leave them out?

blue <- "blue"

ggplot(temps, aes(x = day, y = actual_mean_temp)) +
  geom_point(color = blue)

As you can see, without the quotes, R thinks that we are using an object call blue to set the color of the graph! You could actually do that, like this:

blue <- "blue"

ggplot(temps, aes(x = day, y = actual_mean_temp)) +
  geom_point(color = blue)

Being able to set the color scheme of a graph using an object is much more useful when you are working with a color palette (that is, when you need to use multiple colors to symbolize a graph). We will see an example like that soon. Let’s look at some other customization features!

#What if we want to add labels to our plot? This is very easy to do with the labs argument, like so (remember that the x-axis is the horizontal axis, while the y-axis is the verticle axis):

ggplot(temps, aes(x = day, y = actual_mean_temp)) +
  geom_point(color = "blue") +
  labs(y = "Mean Temperature", x = "Day")

#This is starting to look pretty nice! What if we wanted to add a title too?

ggplot(temps, aes(x = day, y = actual_mean_temp)) +
  geom_point(color = "blue") +
  labs(y = "Mean Temperature", x = "Day", 
       title = "Mean Daily Temperature in New York City, 2014")


#You'll notice that I like to put each new argument after a "+" on a new line - you don't have to do this, but I prefer to because it makes my code much easier to follow. I also like to put longer label names on a new line - again, this won't affect how the code runs, it just makes it more readable. 
#But what if we wanted to also graph the minimum and maximum temperatures on the same plot? this is also very easy to do! We just need to use a geom_point layer for each variable we want to plot. 

#here we start by telling R that we want to use the temps data for our plot
ggplot(temps) +
  #for each new geom_point layer, I need to include a new aesthetic mapping
  #this tells R which variable to use in the plot
  geom_point(aes(x = day, y = actual_mean_temp), color = "gray") +
  geom_point(aes(x = day, y = actual_min_temp), color = "blue") +
  geom_point(aes(x = day, y = actual_max_temp), color = "red") +
  labs(y = "Temperature", x = "Day", 
       title = "Daily Temperature in New York City, 2014")


#Notice here that the color arguments are outside of the aes() argument. This is intentional - only arguments that depend on variables in the dataset should be in the aes() argument. What does that mean? In this case, the color "gray" doesn't depend on anything in the data - for example, the color doesn't change for lower or higher values. The entire geom_point layer is just gray. If we had a variable called "color" in the dataset, or if we wanted the colors to change based on temperature values, we could put color inside the aes(). I'll show an example next.
ggplot(temps) +
  geom_point(aes(x = day, y = actual_mean_temp, color = actual_mean_temp)) +
  labs(y = "Mean Temperature", x = "Day", 
       title = "Mean Daily Temperature in New York City, 2014",
       color = "Mean Temperature (F)")

Hopefully that makes sense now! So, the graph of multiple variables looks pretty nice! But, there no legend on our graph! How will people know what each color represents? This is a somewhat annoying limitation of ggplot2, and it’s a problem that you’ll come across somewhat frequently. There are two ways to fix it: first, we can add a legend manually. I’ll show you how to do that first. Second, we can reshape the data - this is a somewhat more complicated method, but it ends up being extremely useful when you have more than a few variables to graph. I’ll explain that method second.


#We'll first manually set the colors in the legend using scale_color_manual.For some reason, if you set the colors all at once, R will generate a legend; if you set each color individually in the geom_point layer, it won't. I don't make the rules, I just follow them! When you set the colors manually, you have to tell R what label you'd like to use for each geom layer. Here, I've set the label names using color = "" inside the aesthetic mapping in the geom layer. 

ggplot(temps) +
  geom_point(aes(x = day, y = actual_mean_temp, color = "Mean")) +
  geom_point(aes(x = day, y = actual_min_temp, color = "Min")) +
  geom_point(aes(x = day, y = actual_max_temp, color = "Max")) +
  labs(y = "Temperature", x = "Day", 
       title = "Daily Temperature in New York City, 2014",
       #since the legend is based on the color mapping, use color = to set the legend title
       color = "Temperature Values")+
  scale_color_manual(labels = c("Mean", "Min", "Max"), values = c("gray", "blue", "red"))


#In scale_color_manual, we start by telling R which labels to use to generate the color scheme; in this case, it's the same labels we just set above. Then, we have to tell R which colors to use for each label. Because there are three color values to set, note that we have to use c() around the list of variable names and colors. 

#Scale_color_manual often involves some guessing and checking with the order of the colors - for some reason, R wanted to use the first color for the max temperature, the second for the mean, and the third for the min. This order makes no sense, but it is also not easy to change. If you notice that the colors in your graph don't match up, the easiest fix is to just change the order that you listed the colors and variable names so that it matches R's default ordering. That's what I did below. 
#Here's the correct plot!
ggplot(temps) +
  geom_point(aes(x = day, y = actual_mean_temp, color = "Mean")) +
  geom_point(aes(x = day, y = actual_min_temp, color = "Min")) +
  geom_point(aes(x = day, y = actual_max_temp, color = "Max")) +
  labs(y = "Temperature", x = "Day", 
       title = "Daily Temperature in New York City, 2014",
       color = "Temperature Values")+
  scale_color_manual(labels = c("Max", "Mean", "Min"), values = c("red", "gray", "blue"))

Doesn’t that look nice? Let’s talk about reshaping your data now. This is a very helpful skill to have, and you will find yourself having to reshape data frequently.So right now, each variable has its own column in the dataframe in our case, we’re working with four columns of data. Since three of the variables are temperature data, wouldn’t it make sense to put them all in one column instead? This is what we call transforming data from wide to long format - wide data has more columns, while long data will usually have more rows instead. Here’s a diagram of what it looks like to transform from long to wide: https://i.stack.imgur.com/i1Dne.jpg. You can look at this diagram later if you’re confused about what wide and long format look like. For whatever reason, the ggplot2 package tends to work better with long data. We’ll reshape the data using a command from the tidyr package, pivot_longer.

wide_data <- temps %>% 
  #we'll only select the variables of interest to do this
  select(c(day, actual_mean_temp, actual_min_temp, actual_max_temp))

#let's change the column names to make them a bit nicer:
#I'll show you why this matters soon. 
colnames(wide_data) <- c("Day", "Mean", "Min", "Max")

head(wide_data)
#Now we'll reshape it!

long_data <- wide_data %>% 
  #The column titles become the categories in a new column after the reshaping
  #I'm naming this new column temp_type
  #The temperature values are put in a new values column, which I'm calling temp
  pivot_longer(!Day, names_to = "temp_type", values_to = "temp")

head(long_data)
#What if I wanted to go from long to wide data? Here's what that looks like, for your reference:

wide_data1 <- long_data %>% 
  pivot_wider(names_from = temp_type, values_from = temp)

#Run this code to verify that the original wide_data and our new wide_data1 dataframe are the same. Logical data is helpful in this case!
#I'm setting the number of values that R prints out to 20 - otherwise, R would print the whole dataset on the screen, and it would take up a lot of space!
options(max.print = 20)
wide_data1 == wide_data
        Day Mean  Min  Max
  [1,] TRUE TRUE TRUE TRUE
  [2,] TRUE TRUE TRUE TRUE
  [3,] TRUE TRUE TRUE TRUE
  [4,] TRUE TRUE TRUE TRUE
  [5,] TRUE TRUE TRUE TRUE
 [ reached getOption("max.print") -- omitted 360 rows ]

Do you see the difference? Now, the category (temp_type, or mean, min, and max) is in one column, while each temperature that corresponds to the temp_type and day is in the temp column. The wide and long data sets are just different ways of storing the same data! Now let’s see how this works in ggplot2.


#Now, instead of three geom layers, we will need to plot the data by three groups: mean, min, and max. Because we are grouping the data in the dataframe by the type of temperature recorded, we need to assign the temp_type column to the group argument. Because each type of temperature will also have a different color, we will assign temp_type to the color argument as well. Let's see what this looks like!

ggplot(long_data, aes(x = Day, y = temp, group = temp_type, color = temp_type)) +
  geom_point() +
  labs(y = "Temperature", x = "Day", 
       title = "Daily Temperature in New York City, 2014",
       color = "Temperature (F)")

Do you see why I changed the column names? R uses the categories in the temp_type column to add names to the legend. Keeping the “actual_mean_temp” (and so on) labels would not have been nearly as clear in a legend. In our graphs, we should aim to show complex information in the simplest way possible - having clear legend and axis titles is key to that. Now, in this case, the colors aren’t quite right! Let’s set them manually.

#create the color palette for the data
#Remember, order matters! Based on the order of the legend in the last graph, I will include the color for the max temp, then the mean, then the min. 

colors <- c("red", "gray", "blue")

ggplot(long_data, aes(x = Day, y = temp, group = temp_type, color = temp_type)) +
  geom_point() +
  labs(y = "Temperature", x = "Day", 
       title = "Daily Temperature in New York City, 2014",
       color = "Temperature (F)")+
  scale_color_manual(values = colors)

Do you see how much simpler and shorter the ggplot2 code is now? Reshaping data definitely takes some getting used to, but it’ll save you time in the future. Before we move on, I’ll show you one more way to set the colors, now that we have the data in long format. This third version will introduce you to a function that is helpful within dplyr functions, ifelse.

#This version involves creating a new column in the long dataframe with the color values in it. Each color will correspond to the correct temp_type. 
library(dplyr)
library(tidyr)
long_data <- long_data %>% 
  mutate(colors = ifelse(temp_type == "Max", "red",
                         ifelse(temp_type == "Mean", "gray", "blue")))

#Let's walk through the syntax of the ifelse function (note that it stand for "if else")
#You can read the code like this: if temp_type is equal to Max, set the value to red
#if temp type is equal to Mean, set the color to gray. 
#For all others, set the color to blue. 
#If you look up the documentation to ifelse, you'll see just how simple it is:
#ifelse(test, yes, no)
#you give the function the test, and set a value to correspond to yes (True) and no (False) answers. In this code, I have nested an ifelse function inside of another one, basically telling R that no or false values based on the first call to ifelse are subject to another ifelse statement. Values that are no or false in both calls ot ifelse will be blue (note that these are the minimum values).
#Let's graph this!

ggplot(long_data, aes(x = Day, y = temp, group = temp_type, color = colors)) +
  geom_point() +
  labs(y = "Temperature", x = "Day", 
       title = "Daily Temperature in New York City, 2014",
       color = "Temperature (F)")+
  scale_color_identity(guide = "legend", labels = c("Min", "Mean", "Max"))


#If you want to use this method, you need to also include the scale_color_identity argument - this tells R to use colors from the data. I also had to specify the legend label names using the labels = argument - again, R is weird about variable order, and I had to include the label names in the same order that R uses in the legend. For whatever reason, this order is a bit random.

Overall, the second method of adding color for multiple variables (reshaping the data and using a color palette object to set colors) is probably the most flexible method. Often, you’ll be able to set palettes using functions, which makes the process even faster; it’s unlikely that you’ll need to manually specify colors. In the next chunk of code, I’ll give an example using a package that contains some really nice color palettes inspired by US national parks. More info on the palettes are here: https://github.com/katiejolly/nationalparkcolors. You can use the code from the github link to install the package, too. Other packages, like RColorBrewer, also provide a range of palettes to choose from.

#I already have this installed - I like using these colors for presentations. 
install.packages("devtools")
Error in install.packages : Updating loaded packages
devtools::install_github("katiejolly/nationalparkcolors")
Downloading GitHub repo katiejolly/nationalparkcolors@HEAD
  
  
  
✔  checking for file 'C:\Users\mhaller\AppData\Local\Temp\Rtmp2zRIXl\remotes32501e6e783c\katiejolly-nationalparkcolors-df8cd15/DESCRIPTION'

  
  
  
─  preparing 'nationalparkcolors':
✔  checking DESCRIPTION meta-information

  
  
  
─  checking for LF line-endings in source and make files and shell scripts
─  checking for empty or unneeded directories
   Omitted 'LazyData' from DESCRIPTION

  
─  building 'nationalparkcolors_0.1.0.tar.gz'

  
   
Installing package into ‘C:/Users/mhaller/AppData/Local/R/win-library/4.2’
(as ‘lib’ is unspecified)
* installing *source* package 'nationalparkcolors' ...
** using staged installation
** R
** byte-compile and prepare package for lazy loading
** help
*** installing help indices
*** copying figures
** building package indices
** testing if installed package can be loaded from temporary location
** testing if installed package can be loaded from final location
** testing if installed package keeps a record of temporary installation path
* DONE (nationalparkcolors)
install.packages("devtools")
Installing package into ‘C:/Users/mhaller/AppData/Local/R/win-library/4.2’
(as ‘lib’ is unspecified)
trying URL 'https://cran.rstudio.com/bin/windows/contrib/4.2/devtools_2.4.4.zip'
Content type 'application/zip' length 429292 bytes (419 KB)
downloaded 419 KB
package ‘devtools’ successfully unpacked and MD5 sums checked

The downloaded binary packages are in
    C:\Users\mhaller\AppData\Local\Temp\Rtmp2zRIXl\downloaded_packages
library(nationalparkcolors)

#I've just picked a random palette, these colors aren't meaningful. 
palette <- park_palette("SmokyMountains", n = 3)

library(ggplot2)
#Now graph it with the palette!
ggplot(long_data, aes(x = Day, y = temp, group = temp_type, color = temp_type)) +
  geom_point() +
  labs(y = "Temperature", x = "Day", 
       title = "Daily Temperature in New York City, 2014",
       color = "Temperature (F)")+
  scale_color_manual(values = palette)

So now you know the basics of graphing with ggplot2! There are just a few more topics to cover that you will find helpful. First, what if I don’t want to use a scatter plot? Ggplot2 comes with a wide range of geom possibilities! It’s so easy to produce different kinds of plots of your data. Let’s make a line plot with the data we already have.


ggplot(long_data, aes(x = Day, y = temp, group = temp_type, color = temp_type)) +
  geom_line() +
  labs(y = "Temperature", x = "Day", 
       title = "Daily Temperature in New York City, 2014",
       color = "Temperature (F)")+
  scale_color_manual(values = palette)

The line plot doesn’t look nice and smooth because we’re working with daily data - there are a lot of data points, and the temperatures move around a lot! But as you can see, switching to a line plot was so easy. Next, we can look at a bar plot. Line plots, scatter plots, and bar plots will be the most common plots you’ll use.

#Remember that bar plots don't require x and y variables - we just need one y variable (in this case, temperature) and categories for the x-axis (in this case, temp_type). Let's make a simple bar plot for one day:

day1 <- long_data %>% filter(Day == 1)

ggplot(day1, aes(y = temp, x = temp_type, group = temp_type, fill = temp_type))+
  geom_bar(stat = "Identity")+
  labs(x = "Temperature Type", y = "Temperature",
       title = "Temperatures in One Day in NYC") +
  theme(legend.position="none")+
  scale_fill_manual(values = palette)


#Some things to note about this code: because I have included the categories as x-values, I need to include the argument stat = "Identity" in the geom_bar layer. Without going into too much detail, this argument tells R that the height of the columns should be equal to the temp values. Note also that instead of color, we use the fill= argument here - the color argument is for lines and points, while solid polygons need to be assigned colors using the fill argument. As an exercise, try seeing what happens when you use color = instead! When you use fill, the scale_color_manual argument also changes to scale_fill_manual to set the color palette. Finally, R will automatically generate a legend when you assign colors using the group aesthetic - I didn't need a legend in this graph, so I used the theme() argument to set the legend position to "none". This deletes the legend from the plot, and is worth remembering. 

Let’s think about some other ways you can customize your plots. One way that we can make our plot look nice is by adding a visual theme. There are a number of themes you can add to your plot - we’ll download a new package now that contains some additional themes as well.

library(ggthemes)

#I prefer the minimal theme that comes loaded with ggplot2 - it makes plots look very sleek 
ggplot(long_data, aes(x = Day, y = temp, group = temp_type, color = temp_type)) +
  geom_line() +
  labs(y = "Temperature", x = "Day", 
       title = "Daily Temperature in New York City, 2014",
       color = "Temperature (F)")+
  scale_color_manual(values = palette) +
  theme_minimal()

#The ggthemes package comes with some other useful themes. Let's try a few more:
#If you want your plot to look like the plots in the Economist magazine, you might use this theme:

ggplot(long_data, aes(x = Day, y = temp, group = temp_type, color = temp_type)) +
  geom_line() +
  labs(y = "Temperature", x = "Day", 
       title = "Daily Temperature in New York City, 2014",
       color = "Temperature (F)")+
  scale_color_manual(values = palette)+
  theme_economist()


#This theme mimics plots drawn by the Wall Street Journal:
ggplot(long_data, aes(x = Day, y = temp, group = temp_type, color = temp_type)) +
  geom_line() +
  labs(y = "Temperature", x = "Day", 
       title = "Daily Temperature in New York City, 2014",
       color = "Temperature (F)")+
  scale_color_manual(values = palette)+
  theme_wsj()

If you type theme_ you should get a drop down menu of all of the possible themes to choose from - I would encourage you to play around with them, and see which one you like best!

#Finally, you can use the theme function to change the centering of the title and other text. By the default, everything is left-aligned. 

ggplot(long_data, aes(x = Day, y = temp, group = temp_type, color = temp_type)) +
  geom_line() +
  labs(y = "Temperature", x = "Day", 
       title = "Daily Temperature in New York City, 2014",
       color = "Temperature (F)")+
  scale_color_manual(values = palette) +
  theme_minimal()+
  theme(plot.title = element_text(hjust = 0.5))


#Note that the element_text(hjust = 0.5) argument is telling R to center-align the title text
#You can play around with the hjust argument to manually change the title position (although I don't really see any reason to do this)

Finally, can you use more than one plot type in the same plot? You sure can! The code below shows what this looks like by adding a trend line to the data using a smoothing method (don’t worry about how it works, this is just an example). In this example, because we are using the smoothing function to find the average values of each temperature type, we don’t need to change the aesthetics. If, for example, you wanted to make a line graph with one variable over time and a point graph with a different variable over time, you would need to manually include the aesthetics in each geom layer (like we did the first time we added color - in that example, each variable had its own geom layer, and we had to map the aesthetics for each one).


ggplot(long_data, aes(x = Day, y = temp, group = temp_type, color = temp_type)) +
  geom_point() +
  geom_smooth(color = "black")+
  labs(y = "Temperature", x = "Day", 
       title = "Daily Temperature in New York City, 2014",
       color = "Temperature (F)")+
  scale_color_manual(values = palette) +
  theme_minimal()+
  theme(plot.title = element_text(hjust = 0.5))

And there you have it! You are now a pro at using ggplot2. You should have all the tools you need to make beautiful and effective visualization in R. If you want more information on different types of graphs, or you just want a helpful reference to refer to as you progress through the course, you can find an excellent ggplot2 cheat sheet here: https://www.maths.usyd.edu.au/u/UG/SM/STAT3022/r/current/Misc/data-visualization-2.1.pdf.

Resources

FiveThirtyEight (2014). US. Weather History. [Data Set]. Retrieved from: https://github.com/fivethirtyeight/data/tree/master/us-weather-history.

Prabhakaran, S. (2017). The Complete ggplot2 Tutorial - Part1 | Introduction To ggplot2. Retrieved from: http://r-statistics.co/Complete-Ggplot2-Tutorial-Part1-With-R-Code.html.

