Assignment 2: dplyr and ggplot (8 marks)

Feel free to mark this on https://rpubs.com/PontSatyre11119/eeb313-assignment-2

To submit this assignment, upload the full document to Quercus, including the original questions, your code, and the output. Submit your assignment as a knitted .pdf (preferred), .docx, or .html file.

Plotting (1 mark)

Run the code below to create a categorical variable of the activ column. This will make dplyr recognize that there are only two levels of activity (0 and 1), rather than a continuous range 0-1, which will facilitate plotting.

library(tidyverse)

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6      ✔ purrr   0.3.4 
## ✔ tibble  3.1.8      ✔ dplyr   1.0.10
## ✔ tidyr   1.2.1      ✔ stringr 1.4.1 
## ✔ readr   2.1.3      ✔ forcats 0.5.2 
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

beaver1_f <- beaver1 %>%
  mutate(factor_activ = factor(activ)) %>% 
  group_by(day) %>% 
  mutate(count = n_distinct(activ))

(unique(beaver1_f$activ)) # here we have only two factors now

## [1] 0 1

In the previous assignment, we saw that the beaver’s body temperature was the highest when the beaver was outside the retreat. However, we did not explore the distribution of temperatures for the active and inactive conditions. Create a histogram with the temperature on the x-axis and color the bins corresponding to the activity variable. (0.25 marks)

Hint: You need to use the fill parameter rather than color.

library(ggplot2)
library(dplyr)

ggplot(beaver1_f, aes(fill = factor_activ)) +
  geom_histogram(aes(x = temp), bins = 20) + # this makes a bar plot using the factors made beforehand
  labs(title = "Beaver Temperature vs Activity", x = "Temperature C", y = "Count", fill = "Activity") +
  theme_classic()

We already know that the beaver’s body temperature is correlated with whether it is outside the retreat or not. However, we did not control for the time of day, maybe the beaver’s temperature is even better predicted by knowing what time of day it is. To satisfactorily answer this question, we should perform a regression analysis, but we can easily get a good overview by plotting the data. Make a scatter plot with the time of day on the x-axis and the body temperature on the y-axis. Color the scatter points according the beaver’s activity level and separate the measurements into one plot per day. (0.75 marks)

Hint: To separate measurements per day, you could use filter() and two chunks of code, but try the more efficient way of facetting into subplots, which we talked about in the lecture.

ggplot(beaver1_f, aes(color = factor_activ)) +
  geom_point(aes(x = time, y = temp)) + # this makes a scatterplot with x-axis being time and y-axis being temperature
  facet_wrap(vars(day), scales = "free_x") + # we are facetting by day
  theme_classic() +
  labs(title = "Beaver Temperature over Time", subtitle = "For days 346 and 347", color = "Activity", x = "Time (hhmm)", y = "Temperature C")

Read in and pre-process data (1 mark)

Ok, that’s enough about beaver body temperatures. Now you will apply your data wrangling skills on the yearly change in biomass of plants in the beautiful Abisko national park in northern Sweden. We have preprocessed this data and made it available as a csv-file via this link. You can find the original data and a short readme on figshare and dryad. The original study¹ is available un an open access license. Reading through the readme on figshare, and the study abstract will increase your understanding for working with the data.

Read the data directly from the provided URL, into a variable called plant_biomass and display the first six rows. (0.25 mark)

plant_biomass <- read.csv("plant-biomass-preprocess.csv") # reads the csv
head(plant_biomass) # displays the top 6

Convert the Latin column names into their common English names: lingonberry, bilberry, bog bilberry, dwarf birch, crowberry, and wavy hair grass. After this, display all column names. (0.25 marks)

Hint: Search online to find out which Latin and English names pair up.

colnames(plant_biomass) <- (c("Year", "Site", "Habitat", "Treatment", "Bilberry", "Bog Bilberry", "Crowberry", "Dwarf Birch", "Lingonberry", "Wavy Hair Grass")) # Here I changed the column names to the common names with correct formatting. It makes it easier for plotting later on.

plant_biomass

This is a wide data frame (species make up the column names). A long format is easier to analyze, so gather the species names into one column (species) and the measurement values into another column (biomass). Assign it to the variable plant_long. (0.5 marks)

plant_long <- pivot_longer(plant_biomass, names_to = "species", values_to = "biomass", cols = -c(Year, Site, Habitat, Treatment)) # I am making the data from wide to long format

head(plant_long) # here is the top 6

Data exploration (5 marks)

Now that our data is in a tidy format, we can start exploring it!

What is the average biomass in g/m² for all observations in the study? (0.25 marks)

bio_filt <- plant_long %>% # I removed all NAs and values less than 0. You can't have biomass that is less than 0.
  filter(!is.na(biomass) & biomass > 0)

bio_filt %>% # this is my new df
  summarize(mean(biomass)) # this calculates the mean for everything

How does the average biomass compare between the grazed control sites and those that were protected from herbivores? (0.25 marks)

bio_filt %>% 
  group_by(Treatment) %>% # we want to calculate by treatment
  summarize(mean(biomass)) # this performs the mean calculation

Display a table of the average plant biomass for each year. (0.25 marks)

bio_filt %>% 
  group_by(Year) %>% # we want to calculate by year
  summarize(mean = mean(biomass)) # i am mean

What is the mean plant biomass per year for the grazedcontrol and rodentexclosure groups (spread these variables as separate columns in a table)? (0.5 marks)

bio_filt %>% 
  group_by(Treatment, Year, species) %>% # this makes summarize work
  summarize(mean = mean(biomass)) %>% # calculate means biomass for each species based on treatment
  pivot_wider(names_from = Treatment, values_from = mean) # makes the long df wider with treatment as two separate columns

## `summarise()` has grouped output by 'Treatment', 'Year'. You can override using
## the `.groups` argument.

Compare the biomass for grazedcontrol with that of rodentexclosure graphically in a line plot. What could explain the big dip in biomass year 2012? (0.75 marks)

library(ggplot2)
bio_filt %>% 
  group_by(Year, Treatment) %>% # we want to compare the biomass over time separated by treatment
  summarize(mean = mean(biomass)) %>% # calculates mean biomass
  ggplot(aes(x = Year, y = mean, color = Treatment)) +
  geom_line() + # makes a line plot with our requested analysis
  labs(title = "Average Biomass by Treatment", x = "Year", y = "Yearly Biomas", color = "Treatment", subtitle = "The dip in biomass is explained as a comment.") +
  scale_colour_discrete(labels = c("Grazed Control", "Rodent Exclosure")) +
  theme_classic()

## `summarise()` has grouped output by 'Year'. You can override using the
## `.groups` argument.

# Note the dip in 2012 is caused by an outbreak of geometrid moths causing herbivory on plant buds and leaves. An increase in pathogenic fungus also ensued, killing many plants.

Hint: The published study might be able to help with the second question…

Check whether there is an equal number of observations per species. (0.25 marks)

bio_filt %>% 
  group_by(species) %>% 
  tally()

# There is not an equal number of observations per species. However, the data is good enough for graphing. We might want to do an ANOVA or calculate variances.

Compare the yearly change in biomass for each species in a lineplot. (0.5 marks)

bio_filt %>% 
  group_by(Year, species) %>% # makes the summarize function usable
  summarize(mean = mean(biomass)) %>% # calculates mean biomass
  ggplot(aes(x = Year, y = mean, color = species)) +
  geom_line() + # make a line diagram for each species's mean biomass over time
  labs(title = "Change in Biomass per Species", x = "Year", y = "Mean Biomass", color = "Species") +
  scale_colour_discrete(labels = c("Bilberry", "Bog Bilberry", "Crowberry", "Dwarf Birch", "Lingonberry", "Wavy Hair Grass")) +
  theme_classic()

## `summarise()` has grouped output by 'Year'. You can override using the
## `.groups` argument.

From the previous two questions, we found that the biomass is higher in the sites with rodent exclosures (especially in recent years), and that the crowberry is the dominant species. Notice how the lines for rodentexclosure and crowberry are of similar shape. Coincidence? Let’s find out! Use a facetted line plot to explore whether all plant species are impacted equally by grazing. (0.75 mark)

bio_filt %>% 
  group_by(Year, species, Treatment) %>% # makes the summarize function usable
  summarize(mean = mean(biomass)) %>% # calculates the mean
  ggplot(aes(x = Year, y = mean, color = Treatment)) +
  geom_line() + # make x-axis time and y-axis mean biomass, with points being treatment
  facet_wrap(vars(species), scales = "free_x") +
  labs(title = "Change in Biomass by Treatment", x = "Year", y = "Mean Biomass", color = "Species") +
  scale_colour_discrete(labels = c("Grazed Control", "Rodent Exclosure")) +
  theme_classic()

## `summarise()` has grouped output by 'Year', 'species'. You can override using
## the `.groups` argument.

The habitat could also be affecting the biomass of different species. Explore this graphically. (0.75 marks)

bio_filt %>% 
  group_by(Year, species, Habitat) %>% # again makes the summarize function usable
  summarize(mean = mean(biomass)) %>% # calculates the mean
  ggplot(aes(x = Year, y = mean, color = Habitat)) +
  geom_line() + # make x-axis time and y-axis mean biomass, with points being habitat
  facet_wrap(vars(species), scales = "free_x") +
  labs(title = "Change in Biomass by Habitat", x = "Year", y = "Mean Biomass", color = "Habitat") +
  scale_colour_discrete(labels = c("Forest", "Tundra")) +
  theme_classic()

## `summarise()` has grouped output by 'Year', 'species'. You can override using
## the `.groups` argument.

It looks like both habitat and treatment have an effect on most of the species! Let’s dissect the data further by visualizing the effect of both the habitat and treatment on each species by facetting the plot accordingly. (0.75 mark)

bio_filt %>% 
  group_by(Year, species, Habitat, Treatment) %>% # essentially treating this as select() by this point
  summarize(mean = mean(biomass)) %>% # add a mean biomass column
  ggplot(aes(x = Year, y = mean, color = species), labels = (c("Grazed Control", "Rodent Exclosure"))) +
  geom_line() + # plot x-axis as time and y-axis as mean biomass, with species as points
  facet_grid(Habitat ~ Treatment) + # compare habitat with treatment
  labs(title = "Change in Biomass by Habitat & Treatment", x = "Year", y = "Mean Biomass", color = "Species") +
  theme_classic()

## `summarise()` has grouped output by 'Year', 'species', 'Habitat'. You can
## override using the `.groups` argument.

Hint: This is a hard one! You may want to explore R’s documentation for ggplot’s facet_grid

Create a new column that represents the square of the biomass. Display the three largest squared_biomass observations in descending order. Only include the columns year, squared_biomass and species and only observations between the years 2003 and 2008 from the forest habitat.

Hint: Break this down into single criteria and add one at a time. You will be able to obtain the desired result with five operations (1 mark)

(bio_filt %>% 
  mutate(squared_biomass = biomass^2) %>% # add a column of squared biomass
  filter(Year == c(2003:2008) & Habitat == "Forest")  %>% # filter by years and habitat
  select(Year, squared_biomass, species) %>% # only show the three columns
  arrange(desc(squared_biomass)) # arrange by ascending and sort by descending
 )[1:3,] # select the first three

Olofsson J, te Beest M, Ericson L (2013) Complex biotic interactions drive long-term vegetation dynamics in a subarctic ecosystem. Philosophical Transactions of the Royal Society B 368(1624): 20120486. https://dx.doi.org/10.1098/rstb.2012.0486 ↩︎

Assignment 2: dplyr and ggplot (8 marks)

Matthew Ho