Plotting Exercises

Open R Sessions 2023

Authors

Lucia Ximena Alva Caballero

Violeta Caballero López

Laura Hildesheim

Simon Jacobsen Ellerstrand

Iain Moodie

Pedro Rosero

Welcome to the plotting exercises! We suggest you work in pairs or small groups for these exercises. In this session we will work with real published ecological datasets as it will mirror the workflow you will likely go through when it comes to making plots of your own data. We will start by importing a CSV file into a dataframe, and then explore each dataset with a series of plots, with the goal to create at least one plot for each dataset that visualises a result or prediction from each study. All CSV files can be found on canvas.

If you get stuck or need a hint on how to proceed, you can always click the “Show potential solution button” to see one way to solve the problem (there’s always more than one way). You can copy that code to your clipboard by clicking the clipboard icon in the right of the code box. Make use of the R helpfiles using ?, as well as the resources listed in Canvas. Also feel free to search the internet for code examples of what you want to do. Learning to efficiently search the internet is a big part of learning a programming language. However, everyone makes mistakes, including people with high scores on StackOverflow, so be sure you can understand what the code is doing line-by-line, especially if it is important.

Remember to annotate your code with comments to help you remember what you did and why. To write a comment in R, use the # symbol. Anything after the # symbol will be ignored by R.

It’s up to you how you want to organise your R scripts. We would suggest to create a new R script for each dataset, all saved in a single folder somewhere safe that you can find again. Remember to set your working directory to that folder for each script (do you remember the command?).

1 Bergmann’s crabs

1.1 Dataset

The Atlantic marsh fiddler crab, Minuca pugnax, lives in salt marshes throughout the eastern coast of the United States. Historically, M. pugnax were distributed from northern Florida to Cape Cod, Massachusetts, but like other species have expanded their range northward due to ocean warming.

The pie_crab.csv data sample is from a study by Johnson and colleagues at the Plum Island Ecosystem Long Term Ecological Research site.

Data sampling overview:

13 marshes were sampled on the Atlantic coast of the United States in summer 2016
Spanning > 12 degrees of latitude, from northeast Florida to northeast Massachusetts
Between 25 and 37 adult male fiddler crabs were collected, and their sizes recorded

Data columns:

site: a string that identifies each site sampled
latitude: latitude for each site (does not change within site)
size: carapace width measurements (mm) for male crabs in the study
air_temp: mean air temperature for each site (does not change within site)
water_temp: a mean water temperature for each site (does not change within site)

1.2 Exercises

1.2.1

The data is stored as a comma-separated values (CSV) file. You can load the data into your R session using the read.csv() function. Remember, if you are unsure about how to use a function, you can use the ? operator to access the help documentation (e.g. ?read.csv).

Load the data into your R session and assign it to an object called crab_data. Inspect the data using the head() and str() functions. What are the column names? What are the data types? How many rows of data are there?

crab_data <- read.csv("pie_crab.csv", stringsAsFactors = TRUE) # load the data

head(crab_data) # inspect the data

str(crab_data)  # 392 rows, 5 columns (site, latitude, size, air_temp, water_temp)

1.2.2

Let’s get a better understanding of each of our variables, and check for any obvious errors. We can use the summary() function to get a quick summary of each variable. What is the mean carapace width of a crab (size)? What are the min and max air and water temperatures? Plot histograms of carapace widths (size) using the hist() function. Does anything stand out to you?

summary(crab_data) # get a summary of each variable
hist(crab_data$size) # plot a histogram of carapace widths

That looks a bit weird right? Outliers are expected in real-world data, but we should check to make sure it isn’t a mistake. size here is measured in mm. Does a fiddler crab with a carapace width >1m seem resonable here? I hope not. It’s probably a data entry mistake. Let’s remove it from our dataset. There’s a few ways to do this. We can of course open the CSV file in excel, and edit it there, but we can also do this in R.

We can either locate the outline by row number, and then remove that row, or we can use a condition to remove all rows that meet that condition. For the former, we can use which() to find the row number of the outlier, and then use the - operator to remove that row.

# find the outlier as the max value in the size vector
outlier <- max(crab_data$size)
# find the row number of the outlier
outlier_row <- which(crab_data$size == outlier) # row 270
# remove the outlier
crab_data <- crab_data[-outlier_row, ]

For the latter, we can use the subset() function to remove all rows that meet a condition. Re-plot the histogram of size again.

crab_data <- subset(crab_data, size < 1000)
hist(crab_data$size) # plot a histogram of carapace widths

That looks better!

1.2.3

Let’s look at the range of carapace widths recorded at each site. What sort of plot would be best for this? Give it a try! Add a descriptive title and some axis labels. Do you agree with how we’ve done it?

Show potential solution

boxplot(size ~ site, data = crab_data, main = "Sizes at each site", xlab = "Site", ylab = "Carapace width (mm)")

We also might want to only compare means at each site. What sort of plot would be best for this? Here’s how you can calculate means for each site using the aggregate() function.

# FUN = mean tells the aggregate function to use the mean() function
mean_sizes <- aggregate(size ~ site, data = crab_data, FUN = mean)

Have a look at the format of the data now, and make a plot with it. Add a descriptive title and some axis labels, and use the ? helpfiles if you get stuck.

Show potential solution

barplot(
  mean_sizes$size, # the heights of the bars
  names.arg = mean_sizes$site, # the names of the bars
  main = "Mean sizes at each site", 
  xlab = "Site", 
  ylab = "Carapace width (mm)"
  )

BONUS: how to add error bars

# Really, don't stress about it, it's just often people will ask about this

# you might be thinking that error bars would be nice
# error bars can be made using the arrows() function in a bit of a hacky way
# here's an example:

# calculate the standard error of the mean for each site
mean_sizes$se <- aggregate(size ~ site, data = crab_data, FUN = function(x) sd(x)/sqrt(length(x)))$size

# we need to assign it to a variable for later
(my_barplot <-
barplot(
  mean_sizes$size, # the heights of the bars
  names.arg = mean_sizes$site, # the names of the bars
  ylim = c(0, 30), # set the y axis limits
  main = "Mean sizes at each site", 
  xlab = "Site", 
  ylab = "Carapace width (mm)"
  )
)

# add the error bars
arrows(
  x0 = my_barplot, # the x position of the error bars
  x1 = my_barplot, # the x position of the error bars
  y0 = mean_sizes$size - mean_sizes$se, # the lower y position of the error bars
  y1 = mean_sizes$size + mean_sizes$se, # the upper y position of the error bars
  code = 3, # the type of arrow to use
  angle = 90, # the angle of the arrow
  length = 0.1 # the length of the arrow
  )

1.2.4

The dataset was originally collected to test Bergmann’s rule:

“One of the best-known patterns in biogeography is Bergmann’s rule. It predicts that organisms at higher latitudes are larger than ones at lower latitudes. Many organisms follow Bergmann’s rule, including insects, birds, snakes, marine invertebrates, and terrestrial and marine mammals (Johnson et al. 2019).”

Can you make a plot that would help us to visualise the main prediciton from Bergmann’s rule in M. pugnax? Give it a try! Some helpful functions might be plot(), abline(), and lm(). Check the the potential solution, do you disagree? There’s more than one way to do this. If you have a plot in mind, but don’t know how to make it, ask for help! Does Bergmann’s rule hold for M. pugnax?

Show potential solution

# I chose to use a scatterplot with a linear model fit
plot(crab_data$latitude, crab_data$size, main = "Bergmann's rule in Minuca pugnax", xlab = "Latitude", ylab = "Carapace width (mm)")

model <- lm(crab_data$size ~ crab_data$latitude) # fit a linear model

abline(model, col = "red") # abline will use the intercept and slope from the linear model to plot a line on the figure

1.2.5

Hopefully you’ve made a plot you are happy with. Suppose you wanted to put this in your thesis, how would export it? You can use the Rstudio “Export” button above your plot, and this will work alright for most things. However, if you want to have more control over the size of your figure, one method is to use the pdf() function to create a pdf file, then plot your plot, and then use the dev.off() function to “close” the PDF file.

pdf("my_plot.pdf")
#
# plot your plot here
#
dev.off()

Using a format like PDF is nearly always preferable to something like a PNG file, because it is a vector format. This means that the file contains information about shapes in the plot, rather than just a bunch of pixels. This means that you can zoom in on the plot without it getting blurry.

Congratulations! First dataset done! If you want to see the actual figure that was used in the publication, click here. How does your plot compare? Feel free to make additional plots you have in mind for this dataset (we didn’t use air_temp or water_temp yet), or move onto the next one.

Johnson, D. S., C. Crowley, K. Longmire, J. Nelson, B. Williams, and S. Wittyngham. 2019. The fiddler crab, Minuca pugnax, follows Bergmann’s rule. Ecology and Evolution 9:14489–14497.

2 Frozen lakes

2.1 Dataset

The dataset contains information aboout the ice cover of two lakes in Madison, WI, USA. The dataset starts in the 19th century, so has the potential to be informative about how the climate has changed on long time scales.

Data columns:

lakeid: name of the lake
year: year that the record was from
ice_duration: the number of days that the lake was covered with ice

2.2 Exercises

2.2.1

Load the data into your R session and assign it to an object called lake_data. Inspect the data using the head() and str() functions as before.

Show potential solution

lake_data <- read.csv("ntl_icecover.csv", stringsAsFactors = TRUE)

head(lake_data)

str(lake_data)

2.2.2

Plot a histogram of ice_duration.

Show potential solution

hist(
  x = lake_data$ice_duration,
  xlab = "Ice duration (days)",
  main = "" # note that providing an empty string is one way to not plot a title or axis.
  )

Say we wanted to plot separately histograms for each lake. To do that we need to learn a bit about par(). par() hold global (i.e. effects everything) plotting parameters. Most of these are the default values for plot shape, size, colour, etc that are used when you do no specifiy what they should be. par() also controls how many panels the plotting window has. To plot two plots one above the other, we need to modify the mfrow argument (remember, use ?par to find out more). We can then make our two plots, separating the data for each lake.

# Set the layout of the plotting window to have two rows
par(mfrow = c(2, 1))

# Plot the histogram of ice duration for Lake Mendota
hist(
  lake_data$ice_duration[lake_data$lake == "Lake Mendota"],
  main = "Lake Mendota",
  xlab = "Ice duration (days)"
  )

# Plot the histogram of ice duration for Lake Monona
hist(
  lake_data$ice_duration[lake_data$lake == "Lake Monona"],
  main = "Lake Monona",
  xlab = "Ice duration (days)"
  )

# Set the layout back to a single plot per window
par(mfrow = c(1, 1))

For some of you, the fact that the breaks (bin sizes) are not the same, and that x axis does not match, will be very annoying. We can fix this by changing some arguments! Pass a vector of two numbers (the min and max value) to the argument xlim = in each plot, to see if you can fix this issue. To enforce the breaks to be the same, you can use the breaks = argument. Try to figure this out with ?hist before looking at the solution.

Show potential solution

# to save typing it twice, we can define our xlim and breaks vector before hand
# we can use the functions min() and max() to find a suitble range

min(lake_data$ice_duration) # 21
max(lake_data$ice_duration) # 161

x_limit <- c(10,170)
n_breaks <- 20

# Set the layout of the plotting window to have two rows
par(mfrow = c(2, 1))

# Plot the histogram of ice duration for Lake Mendota
hist(
  lake_data$ice_duration[lake_data$lake == "Lake Mendota"],
  main = "Lake Mendota",
  xlab = "Ice duration (days)",
  xlim = x_limit,
  breaks = n_breaks
  )

# Plot the histogram of ice duration for Lake Monona
hist(
  lake_data$ice_duration[lake_data$lake == "Lake Monona"],
  main = "Lake Monona",
  xlab = "Ice duration (days)",
  xlim = x_limit,
  breaks = n_breaks
  )

# Set the layout back to a single plot per window
par(mfrow = c(1, 1))

You might have noticed that the code for our histograms is almost identical, apart from which lake is selected. We could make our code much shorter using for loops, where we just replace which lake is used and the same code is run twice. This will be covered in a dedicated session on the 19th October! For now, let’s move on.

2.2.3

The big question here is probably surrounding how ice_duration has changed over time (that is, the number of days a year that the lakes are covered in ice). How could we visualise that? One option might be to plot the relationship between ice_duration and year. Try to do that using the plot() function. Give the plot descripte axis labels and a title. Check the helpfiles if you are unsure.

Show potential solution

# using the formula notation
plot(
  ice_duration ~ year, 
  data = lake_data,
  main = "Ice duration over time", 
  xlab = "Year", 
  ylab = "Ice duration (days)"
  )

# OR

# providing just the vectors
plot(
  x = lake_data$year,
  y = lake_data$ice_duration, 
  main = "Ice duration over time", 
  xlab = "Year", 
  ylab = "Ice duration (days)"
  )

# Both make the same plot

This sort of data is often called timeseries data, and we usually see it presented using a line, not individual points. We can use the type = l argument for the plot() function to acheive this. However, this will “join” the dots in order they appear in the dataframe. We also have two lakes, and therefore two data points for each value of the x axis. If you don’t see what the issue is, you can try plot it! Let’s deal with the two lake problem first. Try to adapt the code we used in the previous dataset to calculate a mean ice_duration for each year across the two lakes.

Show potential solution

mean_ice_duration <- aggregate(ice_duration ~ year, data = lake_data, FUN = mean)

Now let’s ensure our data is in the correct order. We can do that by ordering the dataset by the year column. Use ? to find out how order() works.

Code

mean_ice_duration <- mean_ice_duration[order(mean_ice_duration$year), ]

Ok, now we are ready to make the plot. Use the type = l argument for the plot() function to make the timeseries plot with a line connecting the data points. Change the colour of the line to one of your preference. As an additional execrise, how could you change the Y axis to be in weeks, rather than days?

Show potential solution

plot(
  ice_duration ~ year, 
  data = mean_ice_duration,
  type = "l",
  main = "Mean ice duration over time", 
  xlab = "Year", 
  ylab = "Mean Ice duration (days)",
  col = "red"
  )

Change days to weeks

plot(
  ice_duration/7 ~ year, # we can make the edit here, without editing the dataframe object. diving by 7 will give weeks.
  data = mean_ice_duration,
  type = "l",
  main = "Mean ice duration over time", 
  xlab = "Year", 
  ylab = "Mean Ice duration (weeks)",
  col = "red"
  )

Now, modify your plot to instead show both lines and points (look at ?plot for help). Make the points solid instead of having no fill.

Show potential solution

plot(
  ice_duration ~ year, 
  data = mean_ice_duration,
  type = "b",
  main = "Mean ice duration over time", 
  xlab = "Year", 
  ylab = "Mean Ice duration (days)",
  col = "red",
  pch = 16 # 19 will also work
  )

Use lm() and abline() to add a trendline to the figure. Make the line dashed instead of solid, and increase it’s width to 2. Maybe you also need to change the colour so it stands out on your plot?

Show potential solution

abline(lm(ice_duration ~ year, data = lake_data), lty = 5, lwd = 2)

2.2.4

From what we have covered in the last two datasets, do you think you can make a similar plot as in the last section, but have a separate line for each lake? There’s a few ways to do it. You will need to use the lines() function to add a second line to the plot. Add in a legend to show which line is which. You don’t need to add a trendline with abline(). Good luck!

Show potential solution

# separate the data for each lake
mendota <- lake_data[lake_data$lakeid == "Lake Mendota", ]
monona <- lake_data[lake_data$lakeid == "Lake Monona", ]

# make sure both datasets are in order
mendota <- mendota[order(mendota$year), ]
monona <- monona[order(monona$year), ]

# make the plot with the first lake
plot(
  ice_duration ~ year, 
  data = mendota,
  type = "l",
  main = "Mean ice duration over time", 
  xlab = "Year", 
  ylab = "Mean Ice duration (days)",
  col = "red"
  )

# add the second lake

lines(
  ice_duration ~ year, 
  data = monona,
  col = "blue"
  )

# add a legend
legend(
  "topright", 
  legend = c("Lake Mendota", "Lake Monona"), 
  col = c("red", "blue"), 
  lty = 1
  )

Awesome work! Congratulations for getting this far! The next section is quite open for you to do what you like, so feel free to use the Sugar Maple dataset, or to use one of your own, or to keep working on other plots from the two datasets above.

3 Maples and ggplot2

3.1 Dataset

The hbr_maples.csv dataset contains observations on sugar maple seedlings in untreated and calcium-amended watersheds at Hubbard Brook Experimental Forest in New Hampshire, USA. To investigate the impact of soil calcium supplementation on sugar maple seedling growth, the authors recorded “general sugar maple germinant health by height, leaf area, biomass, and chlorophyll content” for seedlings in untreated and previously calcium-treated watersheds. By comparing seedling growth in calcium-treated (W1) versus untreated (Reference) watersheds, calcium impacts on sugar maple seedling growth can be explored.

Sugar maple seedlings were examined on calcium-treated and reference sites during August 2003 and June 2004. Seedlings were sampled every ten steps in transects.

year: a number denoting the year that the sample was taken
watershed: a factor denoting the watershed where the sample was collected; W1 = calcium-treated, Reference = reference
elevation: a factor describing the Elevation of transect; Low = low elevation, Mid = mid elevation
transect: a factor denoting the transect number within the watershed
sample: a factor denoting the sample number within transect. There are twenty samples in each transect
stem_length: a number denoting the height of the seedling in millimeters
leaf1area: a number denoting the area of the first sampled leaf in square centimeters
leaf2area: a number denoting the area of the second sampled leaf in square centimeters
leaf_dry_mass: a number denoting the dry mass of both sampled leaves in grams
stem_dry_mass: a number denoting the dry mass of the stem in grams
corrected_leaf_area: a number denoting the area of both leaves in addition to the area removed for chlorophyll measurements in square centimeters

Field observations suggest that the seedlings in the calcium-treated watershed appeared to be more abundant and healthier than in the reference areas. They looked larger, greener, and had bigger root systems that held more soil on the roots when they were plucked out of the ground.

3.2 Exercises

Based on those observations, you should produce a set of plots that examine the differences between the sugar maple seedlings across treatments.

We will use ggplot2 to do this and at the end of this exercise you should produce a bar plot, scatterplot, or boxplot using ggplot2 based on the field observations above. Be creative! Try and export your figures to PDF files at the end.

3.2.1 Read the data

Load the ggplot2 package and import the data set you’ll be working with. If we have never used ggplot2 before we can use the function install.packages("") to get it. If you have used ggplot2 before you can call it with the function library().

Show potential solution

library(ggplot2) 
plants <- read.csv("hbr_maples.csv")

3.2.2 Scatterplot

Use ggplot() to plot stem_length on the x-axis and leaf_dry_mass on the y-axis. Make sure the plot shows each observation as a point.

Show potential solution

ggplot(plants, aes(x = stem_length, y = leaf_dry_mass)) + geom_point()

#plants -> data frame we’re plotting.

#aes(...) -> aesthetics—it tells R which variables map to which visual axes.

#x -> stem_length means the x-axis shows stem length.

#y -> leaf_dry_mass means the y-axis shows leaf dry mass.

#+  -> the + adds a “layer” to the plot.

#geom_point() -> says the geometry of this layer is points, so we’re making a scatterplot where points differ in position, color, and size

Now that you know how to do a scatterplot using ggplot(), can you do it for corrected_leaf_area in the x-axis and stem_dry_mass on the y-axis?

3.2.3 Map Aesthetics: Color and Size

Update your scatterplot so that points are colored by watershed and sized by stem_dry_mass. This will help you visualize how these variables vary together

Show potential solution

ggplot(plants, aes(x = stem_length, y = leaf_dry_mass, color = watershed, size = stem_dry_mass)) + geom_point()

#color -> the variable by which we will be coloring the dots

#size -> the variable by which we will be having different dot sizes

Could you color your scatterplot by year and have the dot size by transect?

3.2.4 Customize Labels and Theme

Using the same scatterplot, add descriptive axis labels and a clear title using labs(), and apply a ggplot2 theme (e.g., theme_minimal()) so the plot looks polished and easy to read. You can find theme variations in https://ggplot2.tidyverse.org/reference/ggtheme.html

Show potential solution

ggplot(plants, aes(stem_length, leaf_dry_mass, color = watershed)) +
geom_point() + labs(title = "Stem length vs Leaf dry mass", x = "Stem
length (cm)", y = "Leaf dry mass (g)") + theme_minimal()

#labs(...) -> function to add titles and axis labels
#title -> the title of the graph
#x -> the x-axis label
#y -> the y-axis label
# + theme_minimal() -> applies a minimalist theme to the graph

Could you try applying a different theme to your graphic?

3.2.5 Add a Smoother (Trend Line)

Add a linear trend line to the scatterplot (for example, with geom_smooth(method = "lm")) to show the overall relationship between stem length and leaf dry mass.

Show potential solution

ggplot(plants, aes(stem_length, leaf_dry_mass)) + geom_point() +
geom_smooth(method = "lm", se = FALSE)

#+ geom_smooth(method = "lm", se = FALSE) -> adds a smooth trend line.

#method = "lm" -> tells ggplot to fit a linear model (a straight regression line).
#se = FALSE -> removes the shaded confidence band around the line.

How could you modify this command so you can see the confidence interval of the trend line?

3.2.6 Use Faceting to Compare Elevations

Split the scatterplot into separate panels for each elevation level using facet_wrap() or facet_grid(). This lets you compare patterns across elevations.

Show potential solution

ggplot(plants, aes(stem_length, leaf_dry_mass)) + geom_point() +
facet_wrap(~ elevation)

#+ facet_wrap(~ elevation) -> splits the plot into multiple small panels (facets).

#~ elevation -> each panel shows data for one level of the variable elevation, side by side.

Could you use facet_grid() to split the scatterplot by years?

3.2.7 Summarize Means with a Bar Plot (using dplyr)

We will use dplyr to calculate the mean leaf_dry_mass for each watershed. For that, first we need to load the dplyr package using library() or, if we have never used dplyr before, we can get the package with install.packages ("").

Then we need to use summarise() to create a new summary table with the mean leaf dry mass values for each watershed. Afterwards we can create a bar plot of these means to show differences between watersheds.

Show potential solution

library(dplyr)

mean_mass <- plants %>% group_by(watershed) %>%
summarise(mean_leaf_mass = mean(leaf_dry_mass, na.rm = TRUE))

#plants -> data frame we’re plotting

#%>%  -> (the “pipe”) passes the result of one step to the next.

#group_by(watershed) -> groups the data by the variable watershed.

#summarise(...) -> creates a new summary table:

#mean_leaf_mass = mean(leaf_dry_mass, na.rm = TRUE) -> calculates the mean leaf dry mass for each watershed, ignoring missing values (na.rm = TRUE). The result is saved as a new data frame called mean_mass.

ggplot(mean_mass, aes(x = watershed, y = mean_leaf_mass)) + geom_col()

#Starts a plot using the summarized data, mapping watershed to the x-axis and mean_leaf_mass to the y-axis.

#+ geom_col() -> Adds a bar chart layer where the heights of the columns represent the mean_leaf_mass values.

3.2.8 Make a Boxplot

Create a boxplot to show the distribution of stem_length (or another variable of interest) across different groups, such as transect or elevation.

Show potential solution

ggplot(plants, aes(x = transect, y = stem_length)) + geom_boxplot() +
geom_jitter(alpha = 0.3)

#ggplot(plants, aes(x = transect, y = stem_length)) -> Starts the plot with the plants data. Maps transect to the x-axis (categories) and stem_length to the y-axis (numeric values).

#+ geom_boxplot() -> Adds a boxplot layer, showing the median, quartiles, and possible outliers of stem length for each transect.

#+ geom_jitter(alpha = 0.3) -> Adds individual data points scattered (or “jittered”) slightly so they don’t overlap.

#alpha = 0.3 -> makes the points partly transparent so dense areas are easier to see.

Could you modify this command to see the distribution of corrected_leaf_area across elevation, and make the dots solid black?

3.2.9 Combine Multiple Variables in One Plot

Design a plot that displays at least three variables at once. For example, let’s take a look at the distribution of the values of leaf1area and leaf2area by mapping color to the watershed, shape to the elevation, and size to stem_dry_mass. This way you can explore more complex relationships.

Show a potential solution

ggplot(plants, aes(x = leaf1area, y = leaf2area, color = watershed,
shape = elevation, size = stem_dry_mass)) + geom_point()

3.2.10 Apply Custom Scales and Colors

Refine one of your previous plots by setting custom color palettes, axis scales, or legend labels to make the visualization clearer and more visually appealing.

Show a potential solution

ggplot(plants, aes(x = stem_length, y = leaf_dry_mass, color =
stem_length)) + geom_point() + scale_color_gradient(low = "blue", high =
"red")

#+ scale_color_gradient(low = "blue", high = "red") -> Adjusts the color scale for the color aesthetic. Points with the smallest stem lengths are blue; those with the largest are red, with a smooth gradient in between.

#OPTIONAL
# You can use scale_color_gradient2(low = "blue", mid = "white", high = "red", midpoint = avg_stem) to add a mid color point

3.2.11 Open ended project

Create a final visualization of your choice using the maple data set. Decide which question you want to answer or which relationship you want to highlight, and use the plotting techniques you’ve learned. Try and export your figures as PDF files, add a title to them and a small description of which question you tried to answer.

--- search: false title: "Plotting Exercises" subtitle: "Open R Sessions 2023" author: - Lucia Ximena Alva Caballero - Violeta Caballero López - Laura Hildesheim - Simon Jacobsen Ellerstrand - Iain Moodie - Pedro Rosero format: html: code-fold: true code-tools: true code-overflow: wrap toc: true toc-depth: 4 self-contained: true anchor-sections: true smooth-scroll: true number-sections: true number-depth: 3 execute: echo: true warning: false eval: false --- Welcome to the plotting exercises! We suggest you work in pairs or small groups for these exercises. In this session we will work with real published ecological datasets as it will mirror the workflow you will likely go through when it comes to making plots of your own data. We will start by importing a CSV file into a dataframe, and then explore each dataset with a series of plots, with the goal to create at least one plot for each dataset that visualises a result or prediction from each study. All CSV files can be found on canvas. If you get stuck or need a hint on how to proceed, you can always click the "Show potential solution button" to see one way to solve the problem (there's always more than one way). You can copy that code to your clipboard by clicking the clipboard icon in the right of the code box. Make use of the R helpfiles using `?`, as well as the resources listed in Canvas. Also feel free to search the internet for code examples of what you want to do. Learning to efficiently search the internet is a big part of learning a programming language. However, everyone makes mistakes, including people with high scores on StackOverflow, so be sure you can understand what the code is doing line-by-line, especially if it is important. Remember to annotate your code with comments to help you remember what you did and why. To write a comment in `R`, use the `#` symbol. Anything after the `#` symbol will be ignored by `R`. It's up to you how you want to organise your `R` scripts. We would suggest to create a new `R` script for each dataset, all saved in a single folder somewhere safe that you can find again. Remember to set your working directory to that folder for each script (do you remember the command?). # Bergmann's crabs ![](https://bugguide.net/images/cache/5HT/HXH/5HTHXHDH6H6ZMLAZMLNZXLGZ7LRR7LBZ9HJHXL9Z5L1Z4LJHIH8ZIHEZ0LVHPHHR5LEZGH5ZSL2ZHLGZSLAZ8LJH8L.jpg){fig-align="center"} ## Dataset The Atlantic marsh fiddler crab, *Minuca pugnax*, lives in salt marshes throughout the eastern coast of the United States. Historically, *M. pugnax* were distributed from northern Florida to Cape Cod, Massachusetts, but like other species have expanded their range northward due to ocean warming. The `pie_crab.csv` data sample is from a study by Johnson and colleagues at the Plum Island Ecosystem Long Term Ecological Research site. Data sampling overview: - 13 marshes were sampled on the Atlantic coast of the United States in summer 2016 - Spanning \> 12 degrees of latitude, from northeast Florida to northeast Massachusetts - Between 25 and 37 adult male fiddler crabs were collected, and their sizes recorded Data columns: - `site`: a string that identifies each site sampled - `latitude`: latitude for each site (does not change within site) - `size`: carapace width measurements (mm) for male crabs in the study - `air_temp`: mean air temperature for each site (does not change within site) - `water_temp`: a mean water temperature for each site (does not change within site) ## Exercises ### The data is stored as a comma-separated values (CSV) file. You can load the data into your `R` session using the `read.csv()` function. Remember, if you are unsure about how to use a function, you can use the `?` operator to access the help documentation (e.g. `?read.csv`). Load the data into your `R` session and assign it to an object called `crab_data`. Inspect the data using the `head()` and `str()` functions. What are the column names? What are the data types? How many rows of data are there? ```{r} #| code-fold: false crab_data <- read.csv("pie_crab.csv", stringsAsFactors = TRUE) # load the data head(crab_data) # inspect the data str(crab_data) # 392 rows, 5 columns (site, latitude, size, air_temp, water_temp) ``` ------------------------------------------------------------------------ ### Let's get a better understanding of each of our variables, and check for any obvious errors. We can use the `summary()` function to get a quick summary of each variable. What is the mean carapace width of a crab (`size`)? What are the min and max air and water temperatures? Plot histograms of carapace widths (`size`) using the `hist()` function. Does anything stand out to you? ```{r} #| code-fold: false summary(crab_data) # get a summary of each variable hist(crab_data$size) # plot a histogram of carapace widths ``` That looks a bit weird right? Outliers are expected in real-world data, but we should check to make sure it isn't a mistake. `size` here is measured in mm. Does a fiddler crab with a carapace width \>1m seem resonable here? I hope not. It's probably a data entry mistake. Let's remove it from our dataset. There's a few ways to do this. We can of course open the CSV file in excel, and edit it there, but we can also do this in `R`. We can either locate the outline by row number, and then remove that row, or we can use a condition to remove all rows that meet that condition. For the former, we can use `which()` to find the row number of the outlier, and then use the `-` operator to remove that row. ```{r} #| code-fold: false # find the outlier as the max value in the size vector outlier <- max(crab_data$size) # find the row number of the outlier outlier_row <- which(crab_data$size == outlier) # row 270 # remove the outlier crab_data <- crab_data[-outlier_row, ] ``` For the latter, we can use the `subset()` function to remove all rows that meet a condition. Re-plot the histogram of `size` again. ```{r} #| code-fold: false crab_data <- subset(crab_data, size < 1000) hist(crab_data$size) # plot a histogram of carapace widths ``` That looks better! ------------------------------------------------------------------------ ### Let's look at the range of carapace widths recorded at each site. What sort of plot would be best for this? Give it a try! Add a descriptive title and some axis labels. Do you agree with how we've done it? ```{r} #| code-summary: "Show potential solution" boxplot(size ~ site, data = crab_data, main = "Sizes at each site", xlab = "Site", ylab = "Carapace width (mm)") ``` We also might want to only compare means at each site. What sort of plot would be best for this? Here's how you can calculate means for each site using the `aggregate()` function. ```{r} #| code-fold: false # FUN = mean tells the aggregate function to use the mean() function mean_sizes <- aggregate(size ~ site, data = crab_data, FUN = mean) ``` Have a look at the format of the data now, and make a plot with it. Add a descriptive title and some axis labels, and use the `?` helpfiles if you get stuck. ```{r} #| code-summary: "Show potential solution" barplot( mean_sizes$size, # the heights of the bars names.arg = mean_sizes$site, # the names of the bars main = "Mean sizes at each site", xlab = "Site", ylab = "Carapace width (mm)" ) ``` ```{r} #| code-summary: "BONUS: how to add error bars" # Really, don't stress about it, it's just often people will ask about this # you might be thinking that error bars would be nice # error bars can be made using the arrows() function in a bit of a hacky way # here's an example: # calculate the standard error of the mean for each site mean_sizes$se <- aggregate(size ~ site, data = crab_data, FUN = function(x) sd(x)/sqrt(length(x)))$size # we need to assign it to a variable for later (my_barplot <- barplot( mean_sizes$size, # the heights of the bars names.arg = mean_sizes$site, # the names of the bars ylim = c(0, 30), # set the y axis limits main = "Mean sizes at each site", xlab = "Site", ylab = "Carapace width (mm)" ) ) # add the error bars arrows( x0 = my_barplot, # the x position of the error bars x1 = my_barplot, # the x position of the error bars y0 = mean_sizes$size - mean_sizes$se, # the lower y position of the error bars y1 = mean_sizes$size + mean_sizes$se, # the upper y position of the error bars code = 3, # the type of arrow to use angle = 90, # the angle of the arrow length = 0.1 # the length of the arrow ) ``` ------------------------------------------------------------------------ ### The dataset was originally collected to test Bergmann's rule: > “One of the best-known patterns in biogeography is Bergmann’s rule. It predicts that organisms at higher latitudes are larger than ones at lower latitudes. Many organisms follow Bergmann’s rule, including insects, birds, snakes, marine invertebrates, and terrestrial and marine mammals (Johnson et al. 2019).” Can you make a plot that would help us to visualise the main prediciton from Bergmann's rule in *M. pugnax*? Give it a try! Some helpful functions might be `plot()`, `abline()`, and `lm()`. Check the the potential solution, do you disagree? There's more than one way to do this. If you have a plot in mind, but don't know how to make it, ask for help! Does Bergmann's rule hold for *M. pugnax*? ```{r} #| code-summary: "Show potential solution" # I chose to use a scatterplot with a linear model fit plot(crab_data$latitude, crab_data$size, main = "Bergmann's rule in Minuca pugnax", xlab = "Latitude", ylab = "Carapace width (mm)") model <- lm(crab_data$size ~ crab_data$latitude) # fit a linear model abline(model, col = "red") # abline will use the intercept and slope from the linear model to plot a line on the figure ``` ------------------------------------------------------------------------ ### Hopefully you've made a plot you are happy with. Suppose you wanted to put this in your thesis, how would export it? You can use the Rstudio "Export" button above your plot, and this will work alright for most things. However, if you want to have more control over the size of your figure, one method is to use the `pdf()` function to create a pdf file, then plot your plot, and then use the `dev.off()` function to "close" the PDF file. ```{r} #| code-fold: false pdf("my_plot.pdf") # # plot your plot here # dev.off() ``` Using a format like PDF is nearly always preferable to something like a PNG file, because it is a vector format. This means that the file contains information about shapes in the plot, rather than just a bunch of pixels. This means that you can zoom in on the plot without it getting blurry. ------------------------------------------------------------------------ Congratulations! First dataset done! If you want to see the actual figure that was used in the publication, [click here](https://onlinelibrary.wiley.com/cms/asset/06b04a64-f803-4d76-af33-38e8c51df66e/ece35883-fig-0002-m.jpg). How does your plot compare? Feel free to make additional plots you have in mind for this dataset (we didn't use `air_temp` or `water_temp` yet), or move onto the next one. Johnson, D. S., C. Crowley, K. Longmire, J. Nelson, B. Williams, and S. Wittyngham. 2019. The fiddler crab, Minuca pugnax, follows Bergmann’s rule. Ecology and Evolution 9:14489–14497. # Frozen lakes ![](https://lter.github.io/lterdatasampler/reference/figures/ntl_icecover_img.jpg){fig-align="center"} ## Dataset The dataset contains information aboout the ice cover of two lakes in Madison, WI, USA. The dataset starts in the 19th century, so has the potential to be informative about how the climate has changed on long time scales. Data columns: - `lakeid`: name of the lake - `year`: year that the record was from - `ice_duration`: the number of days that the lake was covered with ice ## Exercises ### Load the data into your `R` session and assign it to an object called `lake_data`. Inspect the data using the `head()` and `str()` functions as before. ```{r} #| code-summary: "Show potential solution" lake_data <- read.csv("ntl_icecover.csv", stringsAsFactors = TRUE) head(lake_data) str(lake_data) ``` ------------------------------------------------------------------------ ### Plot a histogram of `ice_duration`. ```{r} #| code-fold: true #| code-summary: "Show potential solution" hist( x = lake_data$ice_duration, xlab = "Ice duration (days)", main = "" # note that providing an empty string is one way to not plot a title or axis. ) ``` Say we wanted to plot separately histograms for each lake. To do that we need to learn a bit about `par()`. `par()` hold global (i.e. effects everything) plotting parameters. Most of these are the default values for plot shape, size, colour, etc that are used when you do no specifiy what they should be. `par()` also controls how many panels the plotting window has. To plot two plots one above the other, we need to modify the `mfrow` argument (remember, use `?par` to find out more). We can then make our two plots, separating the data for each lake. ```{r} #| code-fold: false # Set the layout of the plotting window to have two rows par(mfrow = c(2, 1)) # Plot the histogram of ice duration for Lake Mendota hist( lake_data$ice_duration[lake_data$lake == "Lake Mendota"], main = "Lake Mendota", xlab = "Ice duration (days)" ) # Plot the histogram of ice duration for Lake Monona hist( lake_data$ice_duration[lake_data$lake == "Lake Monona"], main = "Lake Monona", xlab = "Ice duration (days)" ) # Set the layout back to a single plot per window par(mfrow = c(1, 1)) ``` For some of you, the fact that the breaks (bin sizes) are not the same, and that x axis does not match, will be very annoying. We can fix this by changing some arguments! Pass a vector of two numbers (the min and max value) to the argument `xlim =` in each plot, to see if you can fix this issue. To enforce the breaks to be the same, you can use the `breaks =` argument. Try to figure this out with `?hist` before looking at the solution. ```{r} #| code-summary: "Show potential solution" # to save typing it twice, we can define our xlim and breaks vector before hand # we can use the functions min() and max() to find a suitble range min(lake_data$ice_duration) # 21 max(lake_data$ice_duration) # 161 x_limit <- c(10,170) n_breaks <- 20 # Set the layout of the plotting window to have two rows par(mfrow = c(2, 1)) # Plot the histogram of ice duration for Lake Mendota hist( lake_data$ice_duration[lake_data$lake == "Lake Mendota"], main = "Lake Mendota", xlab = "Ice duration (days)", xlim = x_limit, breaks = n_breaks ) # Plot the histogram of ice duration for Lake Monona hist( lake_data$ice_duration[lake_data$lake == "Lake Monona"], main = "Lake Monona", xlab = "Ice duration (days)", xlim = x_limit, breaks = n_breaks ) # Set the layout back to a single plot per window par(mfrow = c(1, 1)) ``` You might have noticed that the code for our histograms is almost identical, apart from which lake is selected. We could make our code much shorter using `for` loops, where we just replace which lake is used and the same code is run twice. This will be covered in a dedicated session on the 19th October! For now, let's move on. ------------------------------------------------------------------------ ### The big question here is probably surrounding how `ice_duration` has changed over time (that is, the number of days a year that the lakes are covered in ice). How could we visualise that? One option might be to plot the relationship between `ice_duration` and `year`. Try to do that using the `plot()` function. Give the plot descripte axis labels and a title. Check the helpfiles if you are unsure. ```{r} #| code-summary: "Show potential solution" # using the formula notation plot( ice_duration ~ year, data = lake_data, main = "Ice duration over time", xlab = "Year", ylab = "Ice duration (days)" ) # OR # providing just the vectors plot( x = lake_data$year, y = lake_data$ice_duration, main = "Ice duration over time", xlab = "Year", ylab = "Ice duration (days)" ) # Both make the same plot ``` This sort of data is often called timeseries data, and we usually see it presented using a line, not individual points. We can use the `type = l` argument for the `plot()` function to acheive this. However, this will "join" the dots **in order they appear in the dataframe**. We also have two lakes, and therefore two data points for each value of the x axis. If you don't see what the issue is, you can try plot it! Let's deal with the two lake problem first. Try to adapt the code we used in the previous dataset to calculate a mean `ice_duration` for each year across the two lakes. ```{r} #| code-summary: "Show potential solution" mean_ice_duration <- aggregate(ice_duration ~ year, data = lake_data, FUN = mean) ``` Now let's ensure our data is in the correct order. We can do that by ordering the dataset by the year column. Use `?` to find out how `order()` works. ```{r} mean_ice_duration <- mean_ice_duration[order(mean_ice_duration$year), ] ``` Ok, now we are ready to make the plot. Use the `type = l` argument for the `plot()` function to make the timeseries plot with a line connecting the data points. Change the colour of the line to one of your preference. As an additional execrise, how could you change the Y axis to be in weeks, rather than days? ```{r} #| code-summary: "Show potential solution" plot( ice_duration ~ year, data = mean_ice_duration, type = "l", main = "Mean ice duration over time", xlab = "Year", ylab = "Mean Ice duration (days)", col = "red" ) ``` ```{r} #| code-summary: "Change days to weeks" plot( ice_duration/7 ~ year, # we can make the edit here, without editing the dataframe object. diving by 7 will give weeks. data = mean_ice_duration, type = "l", main = "Mean ice duration over time", xlab = "Year", ylab = "Mean Ice duration (weeks)", col = "red" ) ``` Now, modify your plot to instead show both lines and points (look at `?plot` for help). Make the points solid instead of having no fill. ```{r} #| code-summary: "Show potential solution" plot( ice_duration ~ year, data = mean_ice_duration, type = "b", main = "Mean ice duration over time", xlab = "Year", ylab = "Mean Ice duration (days)", col = "red", pch = 16 # 19 will also work ) ``` Use `lm()` and `abline()` to add a trendline to the figure. Make the line dashed instead of solid, and increase it's width to `2`. Maybe you also need to change the colour so it stands out on your plot? ```{r} #| code-summary: "Show potential solution" abline(lm(ice_duration ~ year, data = lake_data), lty = 5, lwd = 2) ``` ------------------------------------------------------------------------ ### From what we have covered in the last two datasets, do you think you can make a similar plot as in the last section, but have a separate line for each lake? There's a few ways to do it. You will need to use the `lines()` function to add a second line to the plot. Add in a legend to show which line is which. You don't need to add a trendline with `abline()`. Good luck! ```{r} #| code-summary: "Show potential solution" # separate the data for each lake mendota <- lake_data[lake_data$lakeid == "Lake Mendota", ] monona <- lake_data[lake_data$lakeid == "Lake Monona", ] # make sure both datasets are in order mendota <- mendota[order(mendota$year), ] monona <- monona[order(monona$year), ] # make the plot with the first lake plot( ice_duration ~ year, data = mendota, type = "l", main = "Mean ice duration over time", xlab = "Year", ylab = "Mean Ice duration (days)", col = "red" ) # add the second lake lines( ice_duration ~ year, data = monona, col = "blue" ) # add a legend legend( "topright", legend = c("Lake Mendota", "Lake Monona"), col = c("red", "blue"), lty = 1 ) ``` ------------------------------------------------------------------------ Awesome work! Congratulations for getting this far! The next section is quite open for you to do what you like, so feel free to use the Sugar Maple dataset, or to use one of your own, or to keep working on other plots from the two datasets above. # Maples and ggplot2 ![](https://lter.github.io/lterdatasampler/reference/figures/hbr_maples_img_3.jpg){fig-align="center"} ## Dataset The `hbr_maples.csv` dataset contains observations on sugar maple seedlings in untreated and calcium-amended watersheds at Hubbard Brook Experimental Forest in New Hampshire, USA. To investigate the impact of soil calcium supplementation on sugar maple seedling growth, the authors recorded “general sugar maple germinant health by height, leaf area, biomass, and chlorophyll content” for seedlings in untreated and previously calcium-treated watersheds. By comparing seedling growth in calcium-treated (W1) versus untreated (Reference) watersheds, calcium impacts on sugar maple seedling growth can be explored. Sugar maple seedlings were examined on calcium-treated and reference sites during August 2003 and June 2004. Seedlings were sampled every ten steps in transects. - `year`: a number denoting the year that the sample was taken - `watershed`: a factor denoting the watershed where the sample was collected; W1 = calcium-treated, Reference = reference - `elevation`: a factor describing the Elevation of transect; Low = low elevation, Mid = mid elevation - `transect`: a factor denoting the transect number within the watershed - `sample`: a factor denoting the sample number within transect. There are twenty samples in each transect - `stem_length`: a number denoting the height of the seedling in millimeters - `leaf1area`: a number denoting the area of the first sampled leaf in square centimeters - `leaf2area`: a number denoting the area of the second sampled leaf in square centimeters - `leaf_dry_mass`: a number denoting the dry mass of both sampled leaves in grams - `stem_dry_mass`: a number denoting the dry mass of the stem in grams - `corrected_leaf_area`: a number denoting the area of both leaves in addition to the area removed for chlorophyll measurements in square centimeters Field observations suggest that the seedlings in the calcium-treated watershed appeared to be more *abundant and healthier* than in the reference areas. They looked *larger, greener, and had bigger root systems* that held more soil on the roots when they were plucked out of the ground. ------------------------------------------------------------------------ ## Exercises Based on those observations, you should produce a set of plots that examine the differences between the sugar maple seedlings across treatments. We will use ggplot2 to do this and at the end of this exercise you should produce a bar plot, scatterplot, or boxplot using ggplot2 based on the field observations above. Be creative! Try and export your figures to PDF files at the end. ------------------------------------------------------------------------ ### Read the data Load the `ggplot2` package and import the data set you’ll be working with. If we have never used `ggplot2` before we can use the function `install.packages("")` to get it. If you have used `ggplot2` before you can call it with the function `library()`. ```{r} #| code-summary: "Show potential solution" library(ggplot2) plants <- read.csv("hbr_maples.csv") ``` ------------------------------------------------------------------------ ### Scatterplot Use `ggplot()` to plot *stem_length* on the x-axis and *leaf_dry_mass* on the y-axis. Make sure the plot shows each observation as a point. ```{r} #| code-summary: "Show potential solution" ggplot(plants, aes(x = stem_length, y = leaf_dry_mass)) + geom_point() #plants -> data frame we’re plotting. #aes(...) -> aesthetics—it tells R which variables map to which visual axes. #x -> stem_length means the x-axis shows stem length. #y -> leaf_dry_mass means the y-axis shows leaf dry mass. #+ -> the + adds a “layer” to the plot. #geom_point() -> says the geometry of this layer is points, so we’re making a scatterplot where points differ in position, color, and size ``` Now that you know how to do a scatterplot using `ggplot()`, can you do it for *corrected_leaf_area* in the x-axis and *stem_dry_mass* on the y-axis? ------------------------------------------------------------------------ ### Map Aesthetics: Color and Size Update your scatterplot so that points are *colored by watershed* and *sized by stem_dry_mass*. This will help you visualize how these variables vary together ```{r} #| code-summary: "Show potential solution" ggplot(plants, aes(x = stem_length, y = leaf_dry_mass, color = watershed, size = stem_dry_mass)) + geom_point() #color -> the variable by which we will be coloring the dots #size -> the variable by which we will be having different dot sizes ``` Could you color your scatterplot by *year* and have the dot size by *transect*? ------------------------------------------------------------------------ ### Customize Labels and Theme Using the same scatterplot, add descriptive axis labels and a clear title using `labs()`, and apply a ggplot2 theme (e.g., `theme_minimal()`) so the plot looks polished and easy to read. You can find theme variations in https://ggplot2.tidyverse.org/reference/ggtheme.html ```{r} #| code-summary: "Show potential solution" ggplot(plants, aes(stem_length, leaf_dry_mass, color = watershed)) + geom_point() + labs(title = "Stem length vs Leaf dry mass", x = "Stem length (cm)", y = "Leaf dry mass (g)") + theme_minimal() #labs(...) -> function to add titles and axis labels #title -> the title of the graph #x -> the x-axis label #y -> the y-axis label # + theme_minimal() -> applies a minimalist theme to the graph ``` Could you try applying a different theme to your graphic? ------------------------------------------------------------------------ ### Add a Smoother (Trend Line) Add a linear trend line to the scatterplot (for example, with `geom_smooth(method = "lm")`) to show the overall relationship between stem length and leaf dry mass. ```{r} #| code-summary: "Show potential solution" ggplot(plants, aes(stem_length, leaf_dry_mass)) + geom_point() + geom_smooth(method = "lm", se = FALSE) #+ geom_smooth(method = "lm", se = FALSE) -> adds a smooth trend line. #method = "lm" -> tells ggplot to fit a linear model (a straight regression line). #se = FALSE -> removes the shaded confidence band around the line. ``` How could you modify this command so you can see the confidence interval of the trend line? ------------------------------------------------------------------------ ### Use Faceting to Compare Elevations Split the scatterplot into separate panels for each elevation level using `facet_wrap()` or `facet_grid()`. This lets you compare patterns across elevations. ```{r} #| code-summary: "Show potential solution" ggplot(plants, aes(stem_length, leaf_dry_mass)) + geom_point() + facet_wrap(~ elevation) #+ facet_wrap(~ elevation) -> splits the plot into multiple small panels (facets). #~ elevation -> each panel shows data for one level of the variable elevation, side by side. ``` ------------------------------------------------------------------------ Could you use `facet_grid()` to split the scatterplot by years? ### Summarize Means with a Bar Plot (using dplyr) We will use `dplyr` to calculate the mean *leaf_dry_mass* for each watershed. For that, first we need to load the `dplyr` package using `library()` or, if we have never used `dplyr` before, we can get the package with `install.packages ("")`. Then we need to use `summarise()` to create a new summary table with the mean leaf dry mass values for each watershed. Afterwards we can create a bar plot of these means to show differences between watersheds. ```{r} #| code-summary: "Show potential solution" library(dplyr) mean_mass <- plants %>% group_by(watershed) %>% summarise(mean_leaf_mass = mean(leaf_dry_mass, na.rm = TRUE)) #plants -> data frame we’re plotting #%>% -> (the “pipe”) passes the result of one step to the next. #group_by(watershed) -> groups the data by the variable watershed. #summarise(...) -> creates a new summary table: #mean_leaf_mass = mean(leaf_dry_mass, na.rm = TRUE) -> calculates the mean leaf dry mass for each watershed, ignoring missing values (na.rm = TRUE). The result is saved as a new data frame called mean_mass. ggplot(mean_mass, aes(x = watershed, y = mean_leaf_mass)) + geom_col() #Starts a plot using the summarized data, mapping watershed to the x-axis and mean_leaf_mass to the y-axis. #+ geom_col() -> Adds a bar chart layer where the heights of the columns represent the mean_leaf_mass values. ``` ------------------------------------------------------------------------ ### Make a Boxplot Create a boxplot to show the distribution of *stem_length* (or another variable of interest) across different groups, such as *transect* or *elevation*. ```{r} #| code-summary: "Show potential solution" ggplot(plants, aes(x = transect, y = stem_length)) + geom_boxplot() + geom_jitter(alpha = 0.3) #ggplot(plants, aes(x = transect, y = stem_length)) -> Starts the plot with the plants data. Maps transect to the x-axis (categories) and stem_length to the y-axis (numeric values). #+ geom_boxplot() -> Adds a boxplot layer, showing the median, quartiles, and possible outliers of stem length for each transect. #+ geom_jitter(alpha = 0.3) -> Adds individual data points scattered (or “jittered”) slightly so they don’t overlap. #alpha = 0.3 -> makes the points partly transparent so dense areas are easier to see. ``` Could you modify this command to see the distribution of *corrected_leaf_area* across *elevation*, and make the dots solid black? ------------------------------------------------------------------------ ### Combine Multiple Variables in One Plot Design a plot that displays at least three variables at once. For example, let's take a look at the distribution of the values of *leaf1area* and *leaf2area* by mapping color to the *watershed*, shape to the *elevation*, and size to *stem_dry_mass*. This way you can explore more complex relationships. ```{r} #| code-summary: "Show a potential solution" ggplot(plants, aes(x = leaf1area, y = leaf2area, color = watershed, shape = elevation, size = stem_dry_mass)) + geom_point() ``` ------------------------------------------------------------------------ ### Apply Custom Scales and Colors Refine one of your previous plots by setting custom color palettes, axis scales, or legend labels to make the visualization clearer and more visually appealing. ```{r} #| code-summary: "Show a potential solution" ggplot(plants, aes(x = stem_length, y = leaf_dry_mass, color = stem_length)) + geom_point() + scale_color_gradient(low = "blue", high = "red") #+ scale_color_gradient(low = "blue", high = "red") -> Adjusts the color scale for the color aesthetic. Points with the smallest stem lengths are blue; those with the largest are red, with a smooth gradient in between. #OPTIONAL # You can use scale_color_gradient2(low = "blue", mid = "white", high = "red", midpoint = avg_stem) to add a mid color point ``` ------------------------------------------------------------------------ ### Open ended project Create a final visualization of your choice using the maple data set. Decide which question you want to answer or which relationship you want to highlight, and use the plotting techniques you’ve learned. Try and export your figures as PDF files, add a title to them and a small description of which question you tried to answer.