The aim of this code was to examine any temporal patterns in the species seen, This produces the table of species as a heatmap - the list of species ordered by temporal mean. According to google gemini sorting species by their “temporal center of gravity” is a common way to visualize shifts in biodiversity over time.
I first sorted the data into a structure with years as columns (wide format), the code uses the tidyverse suite to calculate the “weighted average year” for each species. This tells us, on average, when a species was most prevalent.
No correction in this data for effort as I don’t think it would be good when some years have only one or two dives. It would be better to do this from data from a wider area though - when there are several dives per year so less likely to get biases created by low dive years!
Then I tried clustering by ‘temporal similarity’ so using Jaccard similarity in whether species were present in similar patterns of years, but didn’t seem so useful fo this data. Think it would be if we looked at a wider area though and then could probably try clustering which factors in effort correction. So not included here.
Make sure the libraries are installed!
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(readxl) # to read workbook rather than csv format
library(ggdendro) # For visualizing the similarity tree
This has the first column as ‘Species’ and then the following columns are years. In this example the data is presence absence (1,0). So no abundance or correction for dive effort - this does need to be addressed.
Polk <- read_excel("Drawna_Polkerris_Bay.xlsx", sheet = "Present_Absent")
View the first few rows:
head(Polk)
## # A tibble: 6 × 17
## Species `2003` `2004` `2005` `2008` `2009` `2010` `2011` `2012` `2013` `2014`
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Acantho… 0 0 0 0 0 0 1 0 1 0
## 2 Actinia… 0 0 0 0 0 0 1 0 1 0
## 3 Actinia… 0 0 0 1 0 0 1 0 1 0
## 4 Aequorea 0 0 0 0 0 0 0 0 0 0
## 5 Aglaoph… 0 0 0 0 0 0 0 0 1 0
## 6 Aglaoph… 0 0 0 0 0 0 0 0 1 0
## # ℹ 6 more variables: `2015` <dbl>, `2016` <dbl>, `2017` <dbl>, `2019` <dbl>,
## # `2020` <dbl>, `2021` <dbl>
Change the table to a long format with three columns: Species, year,
present.
Note as the years are column headings in the original format they will
come in as text strings so mutate is used to convert to numerical.
Polk_long <- Polk %>%
pivot_longer(cols = -Species, names_to = "year", values_to = "present") %>%
mutate(year = as.numeric(year))
From Gemini: Before calculating an average year, we have to get rid
of the “zeros.” If a species was absent in 2005, that year shouldn’t
influence its average. This line ensures we only look at the years the
species was actually spotted.
.groups = ‘drop’ -
This is just a bit of “coding housekeeping.” It tells R to forget the
grouping we did in step 2 so that future calculations don’t get
confused.
Then we join these stats back to the original database and sort by avg_year to group early-peak species first and late-peak last. The last line gets rid of the two new columns (note here we haven’t used ‘count’ anyway - this coloumn tells you how many years the species were seen in)
species_stats <- Polk_long %>%
filter(present == 1) %>%
group_by(Species) %>%
summarize(
avg_year = mean(year),
count = n(),
.groups = 'drop'
)
Polk_sorted <- Polk %>%
left_join(species_stats, by = "Species") %>%
arrange(avg_year) %>%
select(-avg_year, -count) # Clean up helper columns
Finally produce a heatmap / table of the data - this needs to be pretty long so probably best to export as a png and view after! This reaaranges the data into long format and makes sure the Species become a factor with the specified order (ie the avg year used to sort it previously) otherwise the plot would revert to alphabetical order.
# 1. Prepare the Plotting Data
# pivot to long format and ensure 'species' is a factor
# ordered by our 'avg_year' calculation
plot_data <- Polk_sorted %>%
pivot_longer(cols = -Species, names_to = "year", values_to = "presence") %>%
mutate(
year = as.numeric(year),
# This line is crucial: it locks in the order of the species
Species = factor(Species, levels = rev(Polk_sorted$Species))
)
# 2. Create the Heatmap
p <- ggplot(plot_data, aes(x = year, y = Species, fill = factor(presence))) +
geom_tile(color = "white", linewidth = 0.1) +
# Custom colors: 0 is light grey, 1 is a distinct color (e.g., dark blue)
scale_fill_manual(values = c("0" = "#f0f0f0", "1" = "#2c7bb6"),
labels = c("Absent", "Present"),
name = "Status") +
scale_x_continuous(breaks = 2003:2021) +
labs(
title = "Species Temporal Distribution (2003-2021)",
subtitle = "Sorted by Mean Year of Occurrence",
x = "Year",
y = "Species"
) +
theme_minimal() +
theme(
axis.text.x = element_text(angle = 45, hjust = 1),
panel.grid = element_blank(),
legend.position = "bottom"
)
p
You can export the figure as a png
ggsave(
filename = "PolkPlot.png",
plot = p,
width = 720 / 72, # 10 inches
height = 3000 / 72, # 41.67 inches
dpi = 72, # Dots Per Inch
limitsize = FALSE # Necessary because the height is very large
)