crab_data <- read.csv("pie_crab.csv", stringsAsFactors = TRUE) # load the data
head(crab_data) # inspect the data
str(crab_data) # 392 rows, 5 columns (site, latitude, size, air_temp, water_temp)Plotting Exercises
Open R Sessions 2023
Welcome to the plotting exercises! We suggest you work in pairs or small groups for these exercises. In this session we will work with real published ecological datasets as it will mirror the workflow you will likely go through when it comes to making plots of your own data. We will start by importing a CSV file into a dataframe, and then explore each dataset with a series of plots, with the goal to create at least one plot for each dataset that visualises a result or prediction from each study. All CSV files can be found on canvas.
If you get stuck or need a hint on how to proceed, you can always click the “Show potential solution button” to see one way to solve the problem (there’s always more than one way). You can copy that code to your clipboard by clicking the clipboard icon in the right of the code box. Make use of the R helpfiles using ?, as well as the resources listed in Canvas. Also feel free to search the internet for code examples of what you want to do. Learning to efficiently search the internet is a big part of learning a programming language. However, everyone makes mistakes, including people with high scores on StackOverflow, so be sure you can understand what the code is doing line-by-line, especially if it is important.
Remember to annotate your code with comments to help you remember what you did and why. To write a comment in R, use the # symbol. Anything after the # symbol will be ignored by R.
It’s up to you how you want to organise your R scripts. We would suggest to create a new R script for each dataset, all saved in a single folder somewhere safe that you can find again. Remember to set your working directory to that folder for each script (do you remember the command?).
1 Bergmann’s crabs
1.1 Dataset
The Atlantic marsh fiddler crab, Minuca pugnax, lives in salt marshes throughout the eastern coast of the United States. Historically, M. pugnax were distributed from northern Florida to Cape Cod, Massachusetts, but like other species have expanded their range northward due to ocean warming.
The pie_crab.csv data sample is from a study by Johnson and colleagues at the Plum Island Ecosystem Long Term Ecological Research site.
Data sampling overview:
- 13 marshes were sampled on the Atlantic coast of the United States in summer 2016
- Spanning > 12 degrees of latitude, from northeast Florida to northeast Massachusetts
- Between 25 and 37 adult male fiddler crabs were collected, and their sizes recorded
Data columns:
site: a string that identifies each site sampledlatitude: latitude for each site (does not change within site)size: carapace width measurements (mm) for male crabs in the studyair_temp: mean air temperature for each site (does not change within site)water_temp: a mean water temperature for each site (does not change within site)
1.2 Exercises
1.2.1
The data is stored as a comma-separated values (CSV) file. You can load the data into your R session using the read.csv() function. Remember, if you are unsure about how to use a function, you can use the ? operator to access the help documentation (e.g. ?read.csv).
Load the data into your R session and assign it to an object called crab_data. Inspect the data using the head() and str() functions. What are the column names? What are the data types? How many rows of data are there?
1.2.2
Let’s get a better understanding of each of our variables, and check for any obvious errors. We can use the summary() function to get a quick summary of each variable. What is the mean carapace width of a crab (size)? What are the min and max air and water temperatures? Plot histograms of carapace widths (size) using the hist() function. Does anything stand out to you?
summary(crab_data) # get a summary of each variable
hist(crab_data$size) # plot a histogram of carapace widthsThat looks a bit weird right? Outliers are expected in real-world data, but we should check to make sure it isn’t a mistake. size here is measured in mm. Does a fiddler crab with a carapace width >1m seem resonable here? I hope not. It’s probably a data entry mistake. Let’s remove it from our dataset. There’s a few ways to do this. We can of course open the CSV file in excel, and edit it there, but we can also do this in R.
We can either locate the outline by row number, and then remove that row, or we can use a condition to remove all rows that meet that condition. For the former, we can use which() to find the row number of the outlier, and then use the - operator to remove that row.
# find the outlier as the max value in the size vector
outlier <- max(crab_data$size)
# find the row number of the outlier
outlier_row <- which(crab_data$size == outlier) # row 270
# remove the outlier
crab_data <- crab_data[-outlier_row, ]For the latter, we can use the subset() function to remove all rows that meet a condition. Re-plot the histogram of size again.
crab_data <- subset(crab_data, size < 1000)
hist(crab_data$size) # plot a histogram of carapace widthsThat looks better!
1.2.3
Let’s look at the range of carapace widths recorded at each site. What sort of plot would be best for this? Give it a try! Add a descriptive title and some axis labels. Do you agree with how we’ve done it?
Show potential solution
boxplot(size ~ site, data = crab_data, main = "Sizes at each site", xlab = "Site", ylab = "Carapace width (mm)")We also might want to only compare means at each site. What sort of plot would be best for this? Here’s how you can calculate means for each site using the aggregate() function.
# FUN = mean tells the aggregate function to use the mean() function
mean_sizes <- aggregate(size ~ site, data = crab_data, FUN = mean)Have a look at the format of the data now, and make a plot with it. Add a descriptive title and some axis labels, and use the ? helpfiles if you get stuck.
Show potential solution
barplot(
mean_sizes$size, # the heights of the bars
names.arg = mean_sizes$site, # the names of the bars
main = "Mean sizes at each site",
xlab = "Site",
ylab = "Carapace width (mm)"
)BONUS: how to add error bars
# Really, don't stress about it, it's just often people will ask about this
# you might be thinking that error bars would be nice
# error bars can be made using the arrows() function in a bit of a hacky way
# here's an example:
# calculate the standard error of the mean for each site
mean_sizes$se <- aggregate(size ~ site, data = crab_data, FUN = function(x) sd(x)/sqrt(length(x)))$size
# we need to assign it to a variable for later
(my_barplot <-
barplot(
mean_sizes$size, # the heights of the bars
names.arg = mean_sizes$site, # the names of the bars
ylim = c(0, 30), # set the y axis limits
main = "Mean sizes at each site",
xlab = "Site",
ylab = "Carapace width (mm)"
)
)
# add the error bars
arrows(
x0 = my_barplot, # the x position of the error bars
x1 = my_barplot, # the x position of the error bars
y0 = mean_sizes$size - mean_sizes$se, # the lower y position of the error bars
y1 = mean_sizes$size + mean_sizes$se, # the upper y position of the error bars
code = 3, # the type of arrow to use
angle = 90, # the angle of the arrow
length = 0.1 # the length of the arrow
)1.2.4
The dataset was originally collected to test Bergmann’s rule:
“One of the best-known patterns in biogeography is Bergmann’s rule. It predicts that organisms at higher latitudes are larger than ones at lower latitudes. Many organisms follow Bergmann’s rule, including insects, birds, snakes, marine invertebrates, and terrestrial and marine mammals (Johnson et al. 2019).”
Can you make a plot that would help us to visualise the main prediciton from Bergmann’s rule in M. pugnax? Give it a try! Some helpful functions might be plot(), abline(), and lm(). Check the the potential solution, do you disagree? There’s more than one way to do this. If you have a plot in mind, but don’t know how to make it, ask for help! Does Bergmann’s rule hold for M. pugnax?
Show potential solution
# I chose to use a scatterplot with a linear model fit
plot(crab_data$latitude, crab_data$size, main = "Bergmann's rule in Minuca pugnax", xlab = "Latitude", ylab = "Carapace width (mm)")
model <- lm(crab_data$size ~ crab_data$latitude) # fit a linear model
abline(model, col = "red") # abline will use the intercept and slope from the linear model to plot a line on the figure1.2.5
Hopefully you’ve made a plot you are happy with. Suppose you wanted to put this in your thesis, how would export it? You can use the Rstudio “Export” button above your plot, and this will work alright for most things. However, if you want to have more control over the size of your figure, one method is to use the pdf() function to create a pdf file, then plot your plot, and then use the dev.off() function to “close” the PDF file.
pdf("my_plot.pdf")
#
# plot your plot here
#
dev.off()Using a format like PDF is nearly always preferable to something like a PNG file, because it is a vector format. This means that the file contains information about shapes in the plot, rather than just a bunch of pixels. This means that you can zoom in on the plot without it getting blurry.
Congratulations! First dataset done! If you want to see the actual figure that was used in the publication, click here. How does your plot compare? Feel free to make additional plots you have in mind for this dataset (we didn’t use air_temp or water_temp yet), or move onto the next one.
Johnson, D. S., C. Crowley, K. Longmire, J. Nelson, B. Williams, and S. Wittyngham. 2019. The fiddler crab, Minuca pugnax, follows Bergmann’s rule. Ecology and Evolution 9:14489–14497.
2 Frozen lakes
2.1 Dataset
The dataset contains information aboout the ice cover of two lakes in Madison, WI, USA. The dataset starts in the 19th century, so has the potential to be informative about how the climate has changed on long time scales.
Data columns:
lakeid: name of the lakeyear: year that the record was fromice_duration: the number of days that the lake was covered with ice
2.2 Exercises
2.2.1
Load the data into your R session and assign it to an object called lake_data. Inspect the data using the head() and str() functions as before.
Show potential solution
lake_data <- read.csv("ntl_icecover.csv", stringsAsFactors = TRUE)
head(lake_data)
str(lake_data)2.2.2
Plot a histogram of ice_duration.
Show potential solution
hist(
x = lake_data$ice_duration,
xlab = "Ice duration (days)",
main = "" # note that providing an empty string is one way to not plot a title or axis.
)Say we wanted to plot separately histograms for each lake. To do that we need to learn a bit about par(). par() hold global (i.e. effects everything) plotting parameters. Most of these are the default values for plot shape, size, colour, etc that are used when you do no specifiy what they should be. par() also controls how many panels the plotting window has. To plot two plots one above the other, we need to modify the mfrow argument (remember, use ?par to find out more). We can then make our two plots, separating the data for each lake.
# Set the layout of the plotting window to have two rows
par(mfrow = c(2, 1))
# Plot the histogram of ice duration for Lake Mendota
hist(
lake_data$ice_duration[lake_data$lake == "Lake Mendota"],
main = "Lake Mendota",
xlab = "Ice duration (days)"
)
# Plot the histogram of ice duration for Lake Monona
hist(
lake_data$ice_duration[lake_data$lake == "Lake Monona"],
main = "Lake Monona",
xlab = "Ice duration (days)"
)
# Set the layout back to a single plot per window
par(mfrow = c(1, 1))For some of you, the fact that the breaks (bin sizes) are not the same, and that x axis does not match, will be very annoying. We can fix this by changing some arguments! Pass a vector of two numbers (the min and max value) to the argument xlim = in each plot, to see if you can fix this issue. To enforce the breaks to be the same, you can use the breaks = argument. Try to figure this out with ?hist before looking at the solution.
Show potential solution
# to save typing it twice, we can define our xlim and breaks vector before hand
# we can use the functions min() and max() to find a suitble range
min(lake_data$ice_duration) # 21
max(lake_data$ice_duration) # 161
x_limit <- c(10,170)
n_breaks <- 20
# Set the layout of the plotting window to have two rows
par(mfrow = c(2, 1))
# Plot the histogram of ice duration for Lake Mendota
hist(
lake_data$ice_duration[lake_data$lake == "Lake Mendota"],
main = "Lake Mendota",
xlab = "Ice duration (days)",
xlim = x_limit,
breaks = n_breaks
)
# Plot the histogram of ice duration for Lake Monona
hist(
lake_data$ice_duration[lake_data$lake == "Lake Monona"],
main = "Lake Monona",
xlab = "Ice duration (days)",
xlim = x_limit,
breaks = n_breaks
)
# Set the layout back to a single plot per window
par(mfrow = c(1, 1))You might have noticed that the code for our histograms is almost identical, apart from which lake is selected. We could make our code much shorter using for loops, where we just replace which lake is used and the same code is run twice. This will be covered in a dedicated session on the 19th October! For now, let’s move on.
2.2.3
The big question here is probably surrounding how ice_duration has changed over time (that is, the number of days a year that the lakes are covered in ice). How could we visualise that? One option might be to plot the relationship between ice_duration and year. Try to do that using the plot() function. Give the plot descripte axis labels and a title. Check the helpfiles if you are unsure.
Show potential solution
# using the formula notation
plot(
ice_duration ~ year,
data = lake_data,
main = "Ice duration over time",
xlab = "Year",
ylab = "Ice duration (days)"
)
# OR
# providing just the vectors
plot(
x = lake_data$year,
y = lake_data$ice_duration,
main = "Ice duration over time",
xlab = "Year",
ylab = "Ice duration (days)"
)
# Both make the same plotThis sort of data is often called timeseries data, and we usually see it presented using a line, not individual points. We can use the type = l argument for the plot() function to acheive this. However, this will “join” the dots in order they appear in the dataframe. We also have two lakes, and therefore two data points for each value of the x axis. If you don’t see what the issue is, you can try plot it! Let’s deal with the two lake problem first. Try to adapt the code we used in the previous dataset to calculate a mean ice_duration for each year across the two lakes.
Show potential solution
mean_ice_duration <- aggregate(ice_duration ~ year, data = lake_data, FUN = mean)Now let’s ensure our data is in the correct order. We can do that by ordering the dataset by the year column. Use ? to find out how order() works.
Code
mean_ice_duration <- mean_ice_duration[order(mean_ice_duration$year), ]Ok, now we are ready to make the plot. Use the type = l argument for the plot() function to make the timeseries plot with a line connecting the data points. Change the colour of the line to one of your preference. As an additional execrise, how could you change the Y axis to be in weeks, rather than days?
Show potential solution
plot(
ice_duration ~ year,
data = mean_ice_duration,
type = "l",
main = "Mean ice duration over time",
xlab = "Year",
ylab = "Mean Ice duration (days)",
col = "red"
)Change days to weeks
plot(
ice_duration/7 ~ year, # we can make the edit here, without editing the dataframe object. diving by 7 will give weeks.
data = mean_ice_duration,
type = "l",
main = "Mean ice duration over time",
xlab = "Year",
ylab = "Mean Ice duration (weeks)",
col = "red"
)Now, modify your plot to instead show both lines and points (look at ?plot for help). Make the points solid instead of having no fill.
Show potential solution
plot(
ice_duration ~ year,
data = mean_ice_duration,
type = "b",
main = "Mean ice duration over time",
xlab = "Year",
ylab = "Mean Ice duration (days)",
col = "red",
pch = 16 # 19 will also work
)Use lm() and abline() to add a trendline to the figure. Make the line dashed instead of solid, and increase it’s width to 2. Maybe you also need to change the colour so it stands out on your plot?
Show potential solution
abline(lm(ice_duration ~ year, data = lake_data), lty = 5, lwd = 2)2.2.4
From what we have covered in the last two datasets, do you think you can make a similar plot as in the last section, but have a separate line for each lake? There’s a few ways to do it. You will need to use the lines() function to add a second line to the plot. Add in a legend to show which line is which. You don’t need to add a trendline with abline(). Good luck!
Show potential solution
# separate the data for each lake
mendota <- lake_data[lake_data$lakeid == "Lake Mendota", ]
monona <- lake_data[lake_data$lakeid == "Lake Monona", ]
# make sure both datasets are in order
mendota <- mendota[order(mendota$year), ]
monona <- monona[order(monona$year), ]
# make the plot with the first lake
plot(
ice_duration ~ year,
data = mendota,
type = "l",
main = "Mean ice duration over time",
xlab = "Year",
ylab = "Mean Ice duration (days)",
col = "red"
)
# add the second lake
lines(
ice_duration ~ year,
data = monona,
col = "blue"
)
# add a legend
legend(
"topright",
legend = c("Lake Mendota", "Lake Monona"),
col = c("red", "blue"),
lty = 1
)Awesome work! Congratulations for getting this far! The next section is quite open for you to do what you like, so feel free to use the Sugar Maple dataset, or to use one of your own, or to keep working on other plots from the two datasets above.
3 Maples and ggplot2
3.1 Dataset
The hbr_maples.csv dataset contains observations on sugar maple seedlings in untreated and calcium-amended watersheds at Hubbard Brook Experimental Forest in New Hampshire, USA. To investigate the impact of soil calcium supplementation on sugar maple seedling growth, the authors recorded “general sugar maple germinant health by height, leaf area, biomass, and chlorophyll content” for seedlings in untreated and previously calcium-treated watersheds. By comparing seedling growth in calcium-treated (W1) versus untreated (Reference) watersheds, calcium impacts on sugar maple seedling growth can be explored.
Sugar maple seedlings were examined on calcium-treated and reference sites during August 2003 and June 2004. Seedlings were sampled every ten steps in transects.
year: a number denoting the year that the sample was takenwatershed: a factor denoting the watershed where the sample was collected; W1 = calcium-treated, Reference = referenceelevation: a factor describing the Elevation of transect; Low = low elevation, Mid = mid elevationtransect: a factor denoting the transect number within the watershedsample: a factor denoting the sample number within transect. There are twenty samples in each transectstem_length: a number denoting the height of the seedling in millimetersleaf1area: a number denoting the area of the first sampled leaf in square centimetersleaf2area: a number denoting the area of the second sampled leaf in square centimetersleaf_dry_mass: a number denoting the dry mass of both sampled leaves in gramsstem_dry_mass: a number denoting the dry mass of the stem in gramscorrected_leaf_area: a number denoting the area of both leaves in addition to the area removed for chlorophyll measurements in square centimeters
Field observations suggest that the seedlings in the calcium-treated watershed appeared to be more abundant and healthier than in the reference areas. They looked larger, greener, and had bigger root systems that held more soil on the roots when they were plucked out of the ground.
3.2 Exercises
Based on those observations, you should produce a set of plots that examine the differences between the sugar maple seedlings across treatments.
We will use ggplot2 to do this and at the end of this exercise you should produce a bar plot, scatterplot, or boxplot using ggplot2 based on the field observations above. Be creative! Try and export your figures to PDF files at the end.
3.2.1 Read the data
Load the ggplot2 package and import the data set you’ll be working with. If we have never used ggplot2 before we can use the function install.packages("") to get it. If you have used ggplot2 before you can call it with the function library().
Show potential solution
library(ggplot2)
plants <- read.csv("hbr_maples.csv")3.2.2 Scatterplot
Use ggplot() to plot stem_length on the x-axis and leaf_dry_mass on the y-axis. Make sure the plot shows each observation as a point.
Show potential solution
ggplot(plants, aes(x = stem_length, y = leaf_dry_mass)) + geom_point()
#plants -> data frame we’re plotting.
#aes(...) -> aesthetics—it tells R which variables map to which visual axes.
#x -> stem_length means the x-axis shows stem length.
#y -> leaf_dry_mass means the y-axis shows leaf dry mass.
#+ -> the + adds a “layer” to the plot.
#geom_point() -> says the geometry of this layer is points, so we’re making a scatterplot where points differ in position, color, and sizeNow that you know how to do a scatterplot using ggplot(), can you do it for corrected_leaf_area in the x-axis and stem_dry_mass on the y-axis?
3.2.3 Map Aesthetics: Color and Size
Update your scatterplot so that points are colored by watershed and sized by stem_dry_mass. This will help you visualize how these variables vary together
Show potential solution
ggplot(plants, aes(x = stem_length, y = leaf_dry_mass, color = watershed, size = stem_dry_mass)) + geom_point()
#color -> the variable by which we will be coloring the dots
#size -> the variable by which we will be having different dot sizesCould you color your scatterplot by year and have the dot size by transect?
3.2.4 Customize Labels and Theme
Using the same scatterplot, add descriptive axis labels and a clear title using labs(), and apply a ggplot2 theme (e.g., theme_minimal()) so the plot looks polished and easy to read. You can find theme variations in https://ggplot2.tidyverse.org/reference/ggtheme.html
Show potential solution
ggplot(plants, aes(stem_length, leaf_dry_mass, color = watershed)) +
geom_point() + labs(title = "Stem length vs Leaf dry mass", x = "Stem
length (cm)", y = "Leaf dry mass (g)") + theme_minimal()
#labs(...) -> function to add titles and axis labels
#title -> the title of the graph
#x -> the x-axis label
#y -> the y-axis label
# + theme_minimal() -> applies a minimalist theme to the graphCould you try applying a different theme to your graphic?
3.2.5 Add a Smoother (Trend Line)
Add a linear trend line to the scatterplot (for example, with geom_smooth(method = "lm")) to show the overall relationship between stem length and leaf dry mass.
Show potential solution
ggplot(plants, aes(stem_length, leaf_dry_mass)) + geom_point() +
geom_smooth(method = "lm", se = FALSE)
#+ geom_smooth(method = "lm", se = FALSE) -> adds a smooth trend line.
#method = "lm" -> tells ggplot to fit a linear model (a straight regression line).
#se = FALSE -> removes the shaded confidence band around the line.How could you modify this command so you can see the confidence interval of the trend line?
3.2.6 Use Faceting to Compare Elevations
Split the scatterplot into separate panels for each elevation level using facet_wrap() or facet_grid(). This lets you compare patterns across elevations.
Show potential solution
ggplot(plants, aes(stem_length, leaf_dry_mass)) + geom_point() +
facet_wrap(~ elevation)
#+ facet_wrap(~ elevation) -> splits the plot into multiple small panels (facets).
#~ elevation -> each panel shows data for one level of the variable elevation, side by side.Could you use facet_grid() to split the scatterplot by years?
3.2.7 Summarize Means with a Bar Plot (using dplyr)
We will use dplyr to calculate the mean leaf_dry_mass for each watershed. For that, first we need to load the dplyr package using library() or, if we have never used dplyr before, we can get the package with install.packages ("").
Then we need to use summarise() to create a new summary table with the mean leaf dry mass values for each watershed. Afterwards we can create a bar plot of these means to show differences between watersheds.
Show potential solution
library(dplyr)
mean_mass <- plants %>% group_by(watershed) %>%
summarise(mean_leaf_mass = mean(leaf_dry_mass, na.rm = TRUE))
#plants -> data frame we’re plotting
#%>% -> (the “pipe”) passes the result of one step to the next.
#group_by(watershed) -> groups the data by the variable watershed.
#summarise(...) -> creates a new summary table:
#mean_leaf_mass = mean(leaf_dry_mass, na.rm = TRUE) -> calculates the mean leaf dry mass for each watershed, ignoring missing values (na.rm = TRUE). The result is saved as a new data frame called mean_mass.
ggplot(mean_mass, aes(x = watershed, y = mean_leaf_mass)) + geom_col()
#Starts a plot using the summarized data, mapping watershed to the x-axis and mean_leaf_mass to the y-axis.
#+ geom_col() -> Adds a bar chart layer where the heights of the columns represent the mean_leaf_mass values.3.2.8 Make a Boxplot
Create a boxplot to show the distribution of stem_length (or another variable of interest) across different groups, such as transect or elevation.
Show potential solution
ggplot(plants, aes(x = transect, y = stem_length)) + geom_boxplot() +
geom_jitter(alpha = 0.3)
#ggplot(plants, aes(x = transect, y = stem_length)) -> Starts the plot with the plants data. Maps transect to the x-axis (categories) and stem_length to the y-axis (numeric values).
#+ geom_boxplot() -> Adds a boxplot layer, showing the median, quartiles, and possible outliers of stem length for each transect.
#+ geom_jitter(alpha = 0.3) -> Adds individual data points scattered (or “jittered”) slightly so they don’t overlap.
#alpha = 0.3 -> makes the points partly transparent so dense areas are easier to see.Could you modify this command to see the distribution of corrected_leaf_area across elevation, and make the dots solid black?
3.2.9 Combine Multiple Variables in One Plot
Design a plot that displays at least three variables at once. For example, let’s take a look at the distribution of the values of leaf1area and leaf2area by mapping color to the watershed, shape to the elevation, and size to stem_dry_mass. This way you can explore more complex relationships.
Show a potential solution
ggplot(plants, aes(x = leaf1area, y = leaf2area, color = watershed,
shape = elevation, size = stem_dry_mass)) + geom_point()3.2.10 Apply Custom Scales and Colors
Refine one of your previous plots by setting custom color palettes, axis scales, or legend labels to make the visualization clearer and more visually appealing.
Show a potential solution
ggplot(plants, aes(x = stem_length, y = leaf_dry_mass, color =
stem_length)) + geom_point() + scale_color_gradient(low = "blue", high =
"red")
#+ scale_color_gradient(low = "blue", high = "red") -> Adjusts the color scale for the color aesthetic. Points with the smallest stem lengths are blue; those with the largest are red, with a smooth gradient in between.
#OPTIONAL
# You can use scale_color_gradient2(low = "blue", mid = "white", high = "red", midpoint = avg_stem) to add a mid color point3.2.11 Open ended project
Create a final visualization of your choice using the maple data set. Decide which question you want to answer or which relationship you want to highlight, and use the plotting techniques you’ve learned. Try and export your figures as PDF files, add a title to them and a small description of which question you tried to answer.