Ideas

The aim of this code was to examine any temporal patterns in the species seen, This produces the table of species as a heatmap - the list of species ordered by temporal mean. According to google gemini sorting species by their “temporal center of gravity” is a common way to visualize shifts in biodiversity over time.

I first sorted the data into a structure with years as columns (wide format), the code uses the tidyverse suite to calculate the “weighted average year” for each species. This tells us, on average, when a species was most prevalent.

No correction in this data for effort as I don’t think it would be good when some years have only one or two dives. It would be better to do this from data from a wider area though - when there are several dives per year so less likely to get biases created by low dive years!

Then I tried clustering by ‘temporal similarity’ so using Jaccard similarity in whether species were present in similar patterns of years, but didn’t seem so useful fo this data. Think it would be if we looked at a wider area though and then could probably try clustering which factors in effort correction. So not included here.

Libraries

Make sure the libraries are installed!

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(readxl) # to read workbook rather than csv format
library(ggdendro) # For visualizing the similarity tree

Data

This has the first column as ‘Species’ and then the following columns are years. In this example the data is presence absence (1,0). So no abundance or correction for dive effort - this does need to be addressed.

Polk <- read_excel("Drawna_Polkerris_Bay.xlsx", sheet = "Present_Absent")

View the first few rows:

head(Polk)
## # A tibble: 6 × 17
##   Species  `2003` `2004` `2005` `2008` `2009` `2010` `2011` `2012` `2013` `2014`
##   <chr>     <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
## 1 Acantho…      0      0      0      0      0      0      1      0      1      0
## 2 Actinia…      0      0      0      0      0      0      1      0      1      0
## 3 Actinia…      0      0      0      1      0      0      1      0      1      0
## 4 Aequorea      0      0      0      0      0      0      0      0      0      0
## 5 Aglaoph…      0      0      0      0      0      0      0      0      1      0
## 6 Aglaoph…      0      0      0      0      0      0      0      0      1      0
## # ℹ 6 more variables: `2015` <dbl>, `2016` <dbl>, `2017` <dbl>, `2019` <dbl>,
## #   `2020` <dbl>, `2021` <dbl>

Reformatting the data

Change the table to a long format with three columns: Species, year, present.
Note as the years are column headings in the original format they will come in as text strings so mutate is used to convert to numerical.

Polk_long <- Polk %>%
  pivot_longer(cols = -Species, names_to = "year", values_to = "present") %>%
  mutate(year = as.numeric(year))

Calculate the weighted mean year per species.

From Gemini: Before calculating an average year, we have to get rid of the “zeros.” If a species was absent in 2005, that year shouldn’t influence its average. This line ensures we only look at the years the species was actually spotted.
.groups = ‘drop’ -
This is just a bit of “coding housekeeping.” It tells R to forget the grouping we did in step 2 so that future calculations don’t get confused.

Then we join these stats back to the original database and sort by avg_year to group early-peak species first and late-peak last. The last line gets rid of the two new columns (note here we haven’t used ‘count’ anyway - this coloumn tells you how many years the species were seen in)

species_stats <- Polk_long %>%
  filter(present == 1) %>%
  group_by(Species) %>%
  summarize(
    avg_year = mean(year),
    count = n(),
    .groups = 'drop'
  )

Polk_sorted <- Polk %>%
  left_join(species_stats, by = "Species") %>%
  arrange(avg_year) %>%
  select(-avg_year, -count) # Clean up helper columns

Visualise the data

Finally produce a heatmap / table of the data - this needs to be pretty long so probably best to export as a png and view after! This reaaranges the data into long format and makes sure the Species become a factor with the specified order (ie the avg year used to sort it previously) otherwise the plot would revert to alphabetical order.

# 1. Prepare the Plotting Data
# pivot to long format and ensure 'species' is a factor 
# ordered by our 'avg_year' calculation
plot_data <- Polk_sorted %>%
  pivot_longer(cols = -Species, names_to = "year", values_to = "presence") %>%
  mutate(
    year = as.numeric(year),
    # This line is crucial: it locks in the order of the species
    Species = factor(Species, levels = rev(Polk_sorted$Species))
  )
# 2. Create the Heatmap
p <- ggplot(plot_data, aes(x = year, y = Species, fill = factor(presence))) +
  geom_tile(color = "white", linewidth = 0.1) +
  # Custom colors: 0 is light grey, 1 is a distinct color (e.g., dark blue)
  scale_fill_manual(values = c("0" = "#f0f0f0", "1" = "#2c7bb6"), 
                    labels = c("Absent", "Present"),
                    name = "Status") +
  scale_x_continuous(breaks = 2003:2021) +
  labs(
    title = "Species Temporal Distribution (2003-2021)",
    subtitle = "Sorted by Mean Year of Occurrence",
    x = "Year",
    y = "Species"
  ) +
  theme_minimal() +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1),
    panel.grid = element_blank(),
    legend.position = "bottom"
  )

p

You can export the figure as a png

ggsave(
  filename = "PolkPlot.png",
  plot = p,
  width = 720 / 72,   # 10 inches
  height = 3000 / 72,  # 41.67 inches
  dpi = 72,            # Dots Per Inch
  limitsize = FALSE    # Necessary because the height is very large
)