Analyzing biodiversity metrics in R:

My Data set: SFCN Coral Reef Monitoring Data Package Version 2 (1999-2024): CDR, Disease, Species List, and Video Datasets

Description: The coral reef monitoring dataset from the South Florida/Caribbean Network (SFCN) focuses on reefs in five national parks including Biscayne, Dry Tortugas, Virgin Islands, Buck Island, and Salt River Bay. Researchers collected information on stony coral cover, algae, sponges, gorgonians, and substrate, as well as coral species diversity, reef structure, and rugosity (surface complexity). They also tracked coral bleaching, coral disease, and the abundance of the long-spined sea urchin Diadema antillarum, since these factors strongly influence reef health. The purpose of collecting this data is to understand how reef ecosystems change over time and across different management zones, and to evaluate the effects of unusual events like hurricanes or bleaching episodes.

Loading in the Data:

library(readr)
CoralData <- read_csv("/Users/jacobo/Desktop/CoralData.csv")
Rows: 48889 Columns: 23
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (12): TripName, StartDate, Park, SiteFullName, Site, Project, Purpose, ...
dbl   (9): TripID, Year, SiteLatitude, SiteLongitude, ProtocolVersion, Trans...
lgl   (1): SensitiveTaxon
date  (1): VerifiedDate

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Site Names Across All years:

Sites<- unique(CoralData$SiteFullName)
print(Sites)
 [1] "Tektite"                 "BAR_01"                 
 [3] "BAR_02"                  "BAR_03"                 
 [5] "BAR_04"                  "Yawzi"                  
 [7] "Ball Buoy"               "Bird Key"               
 [9] "Loggerhead Forest 1"     "Loggerhead Forest 2"    
[11] "Loggerhead Forest 3"     "Loggerhead Forest 4"    
[13] "Loggerhead Forest 5"     "Santa's Village 1"      
[15] "Santa's Village 2"       "Santa's Village 3"      
[17] "South Fore Reef"         "Salt River"             
[19] "Newfound"                "Haulover"               
[21] "Mennebeck"               "Amanda's Reef"          
[23] "I_26"                    "I_36"                   
[25] "I_38"                    "I_41"                   
[27] "I_44"                    "O_42"                   
[29] "O_56"                    "O_72"                   
[31] "O_82"                    "O_84"                   
[33] "O_94"                    "Western Spur and Groove"
[35] "Bird Key North"          "I_33"                   
[37] "I_63"                    "O_12"                   
[39] "O_27"                    "O_92"                   

Number of Sites (This is the total number of sites across all years without taking into account the variation in the number of sites per year):

num_sites <- length(unique(CoralData$SiteFullName))
print(num_sites)
[1] 40

Number of Sites Sampled in 2024:

num_sites_2024 <- length(unique(CoralData$SiteFullName[CoralData$Year == 2024]))
print(num_sites_2024)
[1] 20

Only half of the total number of sites across all years were sampled (or resampled) in 2024.

All of the Unique Taxons Across All Years:

Taxons<- unique(CoralData$TaxonScientificName)
print(Taxons)
 [1] "Agaricia lamarcki"         "Siderastrea siderea"      
 [3] "Montastraea cavernosa"     "Scolymia sp."             
 [5] "Stephanocoenia intersepta" "Colpophyllia natans"      
 [7] "Orbicella faveolata"       "Orbicella franksi"        
 [9] "Porites astreoides"        "Porites porites"          
[11] "Millepora alcicornis"      "Undaria agaricites"       
[13] "Mycetophyllia aliciae"     "Madracis auretenra"       
[15] "Madracis decactis"         "Orbicella annularis"      
[17] "Pseudodiploria strigosa"   "Eusmilia fastigiata"      
[19] "Helioseris cucullata"      "Agaricia fragilis"        
[21] "Meandrina meandrites"      "Diploria labyrinthiformis"
[23] "Siderastrea radians"       "Stylaster roseus"         
[25] "Mussa angulosa"            "Dichocoenia stokesii"     
[27] "Solenastrea bournoni"      "Mycetophyllia ferox"      
[29] "Mycetophyllia lamarckiana" NA                         
[31] "Agaricia undata"           "Manicina areolata"        
[33] "Pseudodiploria clivosa"    "Millepora complanata"     
[35] "Acropora cervicornis"      "Favia fragum"             
[37] "Isophyllia rigida"         "Scolymia cubensis"        
[39] "Acropora palmata"          "Porites furcata"          
[41] "Agaricia sp."              "Dendrogyra cylindrus"     
[43] "Mycetophyllia sp."         "Meandrina jacksoni"       
[45] "Isophyllia sinuosa"        "Porites divaricata"       
[47] "Orbicella sp."             "Agaricia grahamae"        
[49] "Scolymia lacera"           "Undaria tenuifolia"       
[51] "Undaria humilis"           "Madracis sp."             

Species Richness:

Number of Taxons Across All Sites Across All years: (Gamma Diversity)

num_taxons<-length(unique(CoralData$TaxonScientificName))
print(num_taxons)
[1] 52

Number of taxons present in 2024 (Gamma Diversity for 2024)

num_taxons_2024 <- length(unique(CoralData$TaxonScientificName[CoralData$Year == 2024]))
print(num_taxons_2024)
[1] 35

Average Number of Taxons per plot in 2024:

library(dplyr)

Attaching package: 'dplyr'
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union
avg_taxa_per_plot_2024 <- CoralData %>%
  filter(Year == 2024) %>%
  group_by(SiteFullName) %>%
  summarise(num_taxa = n_distinct(TaxonScientificName)) %>%
  summarise(avg_taxa = mean(num_taxa)) %>%
  pull(avg_taxa)
print(avg_taxa_per_plot_2024)
[1] 16.55

Are the Differences in species richness Statistically Significant?

T-Test: Comparing the Average Species Richness Across all years vs 2024

richness_by_plot_year <- CoralData %>%
  group_by(Year, SiteFullName) %>%
  summarise(richness = n_distinct(TaxonScientificName), .groups = "drop")

richness_2024 <- richness_by_plot_year %>%
  filter(Year == 2024) %>%
  pull(richness)

richness_all_years <- richness_by_plot_year %>%
  filter(Year != 2024) %>%
  pull(richness)

t_test_result <- t.test(richness_2024, richness_all_years)

t_test_result

    Welch Two Sample t-test

data:  richness_2024 and richness_all_years
t = -2.9558, df = 21.489, p-value = 0.007425
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -6.249403 -1.091591
sample estimates:
mean of x mean of y 
  16.5500   20.2205 

The average species richness in 2024 (16.55 taxa per plot) was significantly lower than in other years (20.22 taxa per plot), with a p‑value of 0.007 (assuming our significance level is 0.05) indicating a reliable difference of about 1 to 6 fewer taxa in 2024.

library(dplyr)

Individuals per Taxon Across All Years:

individuals_per_taxon <- CoralData %>%
  group_by(TaxonScientificName) %>%
  summarise(TotalIndividuals = n()) %>%
  arrange(desc(TotalIndividuals))
print(individuals_per_taxon)
# A tibble: 52 × 2
   TaxonScientificName       TotalIndividuals
   <chr>                                <int>
 1 Porites astreoides                    3861
 2 Siderastrea siderea                   3751
 3 Millepora alcicornis                  3729
 4 Porites porites                       3547
 5 Undaria agaricites                    3483
 6 Montastraea cavernosa                 3083
 7 Orbicella franksi                     2942
 8 Orbicella faveolata                   2639
 9 Stephanocoenia intersepta             2627
10 Orbicella annularis                   2092
# ℹ 42 more rows

Individuals per Taxon Across all Sites Per Year:

library(dplyr)
library(ggplot2)
taxon_year_counts <- CoralData %>%
  group_by(Year, TaxonScientificName) %>%
  summarise(Count = n()) %>%
  ungroup()
`summarise()` has grouped output by 'Year'. You can override using the
`.groups` argument.
ggplot(taxon_year_counts, aes(x = Year, y = Count, color = TaxonScientificName)) +
  geom_line(size = 1) +
  labs(title = "Individuals per Taxon Over Time Across All Sites",
       x = "Year",
       y = "Number of Individuals",
       color = "Taxon") +
  theme_minimal() +
  theme(legend.position = "right",
        legend.title = element_text(face = "bold"),
        axis.text.x = element_text(angle = 45, hjust = 1))
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.

A Better View of the Graph:

My year of focus is 2024. The reason why theres a huge dip in the general graph in 2020 for all species is because they didn’t sample all of the plots that they were sampling in the first couple of years, so the data is incomplete. This dip in 2020 is primarily due to COVID-19 pandemic and the lack of work during the period due to quarantine. In the years following 2020, the number of plots being sampled is still less than the number of plots sampled in the first couple of years but there is an upward trend.

Individuals Per Taxon Per Site 2024:

library(dplyr)
library(ggplot2)

taxon_site_2024 <- CoralData %>%
  filter(Year == 2024) %>%
  group_by(SiteFullName, TaxonScientificName) %>%
  summarise(Count = n(), .groups = "drop")

ggplot(taxon_site_2024, aes(x = SiteFullName, y = Count, fill = TaxonScientificName)) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(title = "Individuals per Taxon per Site (2024)",
       x = "Site",
       y = "Number of Individuals",
       fill = "Taxon") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        legend.position = "bottom")

Sites with the Greatest Amount of Taxa in 2024:

taxa_richness_2024 <- CoralData %>%
  filter(Year == 2024) %>%
  group_by(SiteFullName) %>%
  summarise(num_taxa = n_distinct(TaxonScientificName), .groups = "drop") %>%
  arrange(desc(num_taxa))
head(taxa_richness_2024)
# A tibble: 6 × 2
  SiteFullName    num_taxa
  <chr>              <int>
1 Haulover              25
2 Newfound              25
3 Salt River            25
4 Yawzi                 25
5 Tektite               23
6 South Fore Reef       21

Now that we know the sites with the greatest amount of taxa for 2024 we want to compare these sites to see how similar or different their species compositions are to eachother. To do that we will calculate the average Sorensen dissimilarity index across these six sites.

Average Sorensen Dissimilarity Index across the 6 sites with the Greatest Biodiversity (2024):

library(vegan)
Loading required package: permute
sites_of_interest <- c("Haulover", "Newfound", "Salt River", 
                       "Yawzi", "Tektite", "South Fore Reef")

Coral2024_sites <- CoralData %>%
  filter(Year == 2024, SiteFullName %in% sites_of_interest)

comm_matrix <- table(Coral2024_sites$SiteFullName, Coral2024_sites$TaxonScientificName)
comm_matrix[comm_matrix > 0] <- 1

sorensen_dist <- vegdist(comm_matrix, method = "bray", binary = TRUE)
sorensen_matrix <- as.matrix(sorensen_dist)

avg_sorensen <- mean(sorensen_dist)
print(avg_sorensen)
[1] 0.1270754

We got a value of 0.127 indicating that our sites share a lot of their taxa. Although they are sites which show the largest amount of biodiversity across all sites, they share the same kinds of species.

Average Individuals per Species per Site Across All Years:

library(dplyr)

yearly_counts <- CoralData %>%
  group_by(Year, SiteFullName, TaxonScientificName) %>%
  summarise(Count = n(), .groups = "drop")

avg_counts <- yearly_counts %>%
  group_by(SiteFullName, TaxonScientificName) %>%
  summarise(AvgIndividuals = mean(Count), .groups = "drop")
ggplot(avg_counts, aes(x = SiteFullName, y = AvgIndividuals, fill = TaxonScientificName)) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(title = "Average Individuals per Species per Site Across All Years",
       x = "Site",
       y = "Average Individuals",
       fill = "Taxon") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        legend.position = "bottom")

The Most Abundant Species in 2024:

library(dplyr)
species_site_2024 <- CoralData %>%
  filter(Year == 2024) %>%
  group_by(SiteFullName, TaxonScientificName) %>%
  summarise(Count = n(), .groups = "drop")

avg_species_2024 <- species_site_2024 %>%
  group_by(TaxonScientificName) %>%
  summarise(AvgIndividuals = mean(Count), .groups = "drop")

most_abundant_2024 <- avg_species_2024 %>%
  arrange(desc(AvgIndividuals)) %>%
  slice(1)

print(most_abundant_2024)
# A tibble: 1 × 2
  TaxonScientificName AvgIndividuals
  <chr>                        <dbl>
1 Orbicella annularis           15.4

The Most Abundant Species On Average Across All Years:

library(dplyr)
yearly_counts <- CoralData %>%
  group_by(Year, SiteFullName, TaxonScientificName) %>%
  summarise(Count = n(), .groups = "drop")

avg_counts <- yearly_counts %>%
  group_by(TaxonScientificName) %>%
  summarise(AvgIndividuals = mean(Count), .groups = "drop")


most_dominant <- avg_counts %>%
  arrange(desc(AvgIndividuals)) %>%
  slice(1)

print(most_dominant)
# A tibble: 1 × 2
  TaxonScientificName AvgIndividuals
  <chr>                        <dbl>
1 Orbicella annularis           15.2

So overall the most abundant species per site per year is Orbicella annularis with about 15 individuals. This is consistent with 2024’s results as well. But there’s an explanation for this. Orbicella annularis is the most abundant coral species in South Florida and Caribbean reefs because it is a dominant reef-builder that creates large, massive colonies forming the backbone of reef structures. Its ability to adapt to different depths, currents, and light conditions allows it to thrive across diverse reef zones (de Matas, T.-J. ,2016). Compared to more vulnerable species like Acropora palmata or Dendrogyra cylindrus, historically Orbicella annularis has shown greater resilience to disease and bleaching, helping it maintain high population levels (de Matas, T.-J. ,2016). These reefs fall within its natural geographic range, making it consistently present in monitoring datasets across all years. As a keystone species, its established colonies and ecological importance explain why it remains the most abundant even in 2024.

Shannon diversity Index Per Site (2024)

library(vegan)
library(dplyr)
library(tidyr)
library(vegan)
library(tibble)

df_2024 <- CoralData %>%
  filter(Year == 2024)

species_matrix <- df_2024 %>%
  count(SiteFullName, TaxonScientificName) %>%
  pivot_wider(names_from = TaxonScientificName, values_from = n, values_fill = 0) %>%
  column_to_rownames("SiteFullName")

shannon_index <- diversity(species_matrix, index = "shannon")

print(shannon_index)
             BAR_01              BAR_02              BAR_03              BAR_04 
           2.346210            2.458612            2.414747            2.369317 
          Ball Buoy            Bird Key            Haulover Loggerhead Forest 1 
           2.491169            2.560116            2.791024            2.465730 
Loggerhead Forest 2 Loggerhead Forest 3 Loggerhead Forest 4 Loggerhead Forest 5 
           2.481937            2.519405            2.461686            2.379311 
           Newfound          Salt River   Santa's Village 1   Santa's Village 2 
           2.895545            2.914256            2.515604            2.366342 
  Santa's Village 3     South Fore Reef             Tektite               Yawzi 
           2.374493            2.728783            2.915705            3.039291 

Shannon Index Per Site Graph (2024):

shannon_df <- data.frame(
  Site = names(shannon_index),
  Shannon = as.numeric(shannon_index))

ggplot(shannon_df, aes(x = reorder(Site, Shannon), y = Shannon, fill = Shannon)) +
  geom_bar(stat = "identity") +
  scale_fill_gradient(low = "lightblue", high = "darkblue") +
  labs(title = "Shannon Diversity Index per Site (2024)",
       x = "Site",
       y = "Shannon Index",
       fill = "Diversity") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Average Shannon Index per site for 2024:

avg_shannon_index2024 <- (sum(shannon_index))/20
print(avg_shannon_index2024)
[1] 2.574464

Our average Shannon diversity value per site demonstrates that our biodiversity is more moderate. A Shannon Diversity index value of 2.57 is “moderate” because it indicates a community with a fair number of species and some balance, but not the highest possible richness or evenness.

Simpson Eveness Per Site(2024)

simpson_diversity <- diversity(species_matrix, index = "simpson")
print(simpson_diversity)
             BAR_01              BAR_02              BAR_03              BAR_04 
          0.9013879           0.9086925           0.9072978           0.8996540 
          Ball Buoy            Bird Key            Haulover Loggerhead Forest 1 
          0.9073854           0.9064760           0.9249056           0.9093878 
Loggerhead Forest 2 Loggerhead Forest 3 Loggerhead Forest 4 Loggerhead Forest 5 
          0.9116143           0.9130752           0.9089506           0.9013841 
           Newfound          Salt River   Santa's Village 1   Santa's Village 2 
          0.9370129           0.9384177           0.9127424           0.8984375 
  Santa's Village 3     South Fore Reef             Tektite               Yawzi 
          0.9013879           0.9256973           0.9406413           0.9477003 

Simpson Diversity per Site Graph (2024)

simpson_df <- data.frame(
  Site = names(simpson_diversity),
 Simpson = as.numeric(simpson_diversity))

ggplot(simpson_df, aes(x = reorder(Site, Simpson), y = Simpson, fill = Simpson)) +
  geom_bar(stat = "identity") +
  scale_fill_gradient(low = "lightblue", high = "darkblue") +
  labs(title = "Simpson Diversity per Site (2024)",
       x = "Site",
       y = "Simpson Diversity",
       fill = "Diversity") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Simpson Diversity Average Per Site (2024)

avg_simpson_diversity2024 <- (sum(simpson_diversity))/20
print(avg_simpson_diversity2024)
[1] 0.9151124

Since our Simpsons Index value average is 0.915 that means that most of our Simpson values are very close to 1 so that means we have low species diversity and high dominance.

Gini-Simpson’s Index Average Per Site:

"Gini-Simpson_Value" <- (1-(avg_simpson_diversity2024))
print(`Gini-Simpson_Value`)
[1] 0.08488758

0.084 is the probability that two randomly selected individuals from each site will be different from eachother indicating that we have low species diversity and high dominance within sites.

Justification for Not Using Jaccard, Euclidian, or Bray-Curtis Metrics:

For my specific Data Set I don’t believe using Jaccard, Euclidian, or Bray-Curtis metrics would be useful. These indices emphasize differences in species presence or abundance across sites, but my data show consistently low diversity and high dominance. Because most sites are dominated by only a few taxa, Jaccard values would be artificially high and fail to capture meaningful variation. Euclidean distance would mostly reflect differences in absolute counts of dominant taxa rather than true ecological differences. Similarly, Bray–Curtis would highlight shifts in dominance patterns rather than overall community structure, so measures like Simpsons, Shannon Diversity, and Sorensen dissimilarity index are more appropriate for interpreting this dataset. Also when i was trying to use euclidian distance I was getting really huge values per site which were non representative because the average number of species per site is around 15 and those values were in the hundreds. I will admit that part of that is due to human error, but generally for the information outlined on the rubric the metrics used suffice.

What overall hypotheses or conclusions can you draw from your analysis of this dataset?

The dataset analysis shows that reef biodiversity in 2024 experienced a statistically significant decline in species richness, dropping to an average of 16.55 taxa per plot compared to 20.22 in prior years. Despite this loss, overall diversity remained moderate, with a Shannon Index of 2.57 indicating a fair but not maximal balance of species. The reef’s structural resilience is maintained by the dominance of Orbicella annularis, a keystone species that continues to anchor the ecosystem and resist disease and bleaching. Among the most biodiverse sites, species composition was highly similar, reflected in a low Sorensen dissimilarity index of 0.127, suggesting widespread overlap in taxa. Nevertheless, due to sampling limitations, especially the reduced effort in 2020 due to the COVID-19 pandemic and fewer sites surveyed in 2024, our year to year comparisons might look a bit different than how they could be.

Citations:

de Matas, T.-J. (2016). Orbicella annularis – Boulder star coral. The Online Guide to the Animals of Trinidad and Tobago, Department of Life Sciences, University of the West Indies. Retrieved from https://sta.uwi.edu/fst/lifesciences/sites/default/files/lifesciences/documents/ogatt/Orbicella_annularis%20-%20Boulder%20Star%20Coral.pdf

National Park Service. (2024). SFCN coral reef monitoring data package, version 2 (1999–2024): CDR disease, species list, and video transects. Data.gov. . Retrieved from https://catalog.data.gov/dataset/sfcn-coral-reef-monitoring-data-package-version-2-1999-2024-cdr-disease-species-list-and-v