Rows: 85 Columns: 21
── Column specification ────────────────────────────────────────────────────────────────────────────────────────────────
Delimiter: ","
chr (1): Species
dbl (20): Yosemite Valley, Big Sur Coast, Mojave Desert, Sierra Foothills, Point Reyes, Lake Tahoe, Death Valley, Sa...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
kable(head(species_data, 10),caption ="First 10 Species in the California Wildlife Dataset",align ="l") %>%kable_styling(bootstrap_options =c("striped", "hover", "condensed"),full_width =FALSE,font_size =14,position ="left")
First 10 Species in the California Wildlife Dataset
cat("Number of species:", nrow(species_data), "\n")
Number of species: 85
cat("Number of sites:", ncol(species_data) -1, "\n")
Number of sites: 20
summary(species_data)
Species Yosemite Valley Big Sur Coast Mojave Desert Sierra Foothills Point Reyes
Length:85 Min. : 0.000 Min. : 0.000 Min. : 0.000 Min. : 0.000 Min. : 0.000
Class :character 1st Qu.: 2.000 1st Qu.: 2.000 1st Qu.: 2.000 1st Qu.: 3.000 1st Qu.: 2.000
Mode :character Median : 4.000 Median : 4.000 Median : 4.000 Median : 5.000 Median : 5.000
Mean : 5.024 Mean : 4.788 Mean : 5.271 Mean : 5.306 Mean : 5.106
3rd Qu.: 7.000 3rd Qu.: 7.000 3rd Qu.: 7.000 3rd Qu.: 7.000 3rd Qu.: 8.000
Max. :14.000 Max. :14.000 Max. :15.000 Max. :15.000 Max. :13.000
Lake Tahoe Death Valley Santa Monica Mountains Channel Islands Central Valley Wetlands
Min. : 0.000 Min. : 0.000 Min. : 0.000 Min. : 0.000 Min. : 0.000
1st Qu.: 2.000 1st Qu.: 2.000 1st Qu.: 2.000 1st Qu.: 3.000 1st Qu.: 3.000
Median : 5.000 Median : 5.000 Median : 4.000 Median : 5.000 Median : 5.000
Mean : 5.318 Mean : 5.388 Mean : 4.553 Mean : 5.141 Mean : 5.318
3rd Qu.: 8.000 3rd Qu.: 8.000 3rd Qu.: 7.000 3rd Qu.: 7.000 3rd Qu.: 8.000
Max. :15.000 Max. :13.000 Max. :12.000 Max. :17.000 Max. :13.000
San Gabriel Mountains Anza-Borrego Redwood National Park Salton Sea Lassen Volcanic Park Elkhorn Slough
Min. : 0.000 Min. : 0.0 Min. : 0.000 Min. : 0.000 Min. : 0.000 Min. : 0.000
1st Qu.: 2.000 1st Qu.: 3.0 1st Qu.: 2.000 1st Qu.: 3.000 1st Qu.: 2.000 1st Qu.: 2.000
Median : 4.000 Median : 5.0 Median : 5.000 Median : 6.000 Median : 4.000 Median : 4.000
Mean : 4.482 Mean : 4.8 Mean : 5.424 Mean : 5.541 Mean : 5.424 Mean : 5.012
3rd Qu.: 6.000 3rd Qu.: 7.0 3rd Qu.: 8.000 3rd Qu.: 8.000 3rd Qu.: 8.000 3rd Qu.: 7.000
Max. :12.000 Max. :13.0 Max. :14.000 Max. :13.000 Max. :14.000 Max. :14.000
Carrizo Plain Mount Shasta San Diego Chaparral Lake Berryessa
Min. : 0.000 Min. : 0.000 Min. : 0.000 Min. : 0.000
1st Qu.: 2.000 1st Qu.: 3.000 1st Qu.: 3.000 1st Qu.: 3.000
Median : 4.000 Median : 5.000 Median : 5.000 Median : 5.000
Mean : 5.353 Mean : 5.635 Mean : 5.294 Mean : 5.424
3rd Qu.: 8.000 3rd Qu.: 8.000 3rd Qu.: 8.000 3rd Qu.: 8.000
Max. :16.000 Max. :16.000 Max. :13.000 Max. :14.000
Exploratory Visualizations
Histogram of Species Abundances
species_long_temp <- species_data %>%pivot_longer(cols =-Species, names_to ="Site", values_to ="Abundance")ggplot(species_long_temp, aes(x = Abundance)) +geom_histogram(bins =30, fill ="steelblue", color ="black", alpha =0.7) +labs(title ="Distribution of Species Abundances Across All Sites",x ="Abundance Count",y ="Frequency") +theme_minimal(base_size =16) +theme(plot.title =element_text(face ="bold", size =20),axis.title =element_text(size =16, face ="bold"),axis.text =element_text(size =14))
Total Abundance by Site
# Calculate total abundance per sitesite_totals <- species_data %>%pivot_longer(cols =-Species, names_to ="Site", values_to ="Abundance") %>%group_by(Site) %>%summarize(Total_Abundance =sum(Abundance)) %>%arrange(desc(Total_Abundance))# Bar plot of total abundance per siteggplot(site_totals, aes(x =reorder(Site, Total_Abundance), y = Total_Abundance)) +geom_bar(stat ="identity", fill ="forestgreen", alpha =0.7) +coord_flip() +labs(title ="Total Species Abundance by California Site",x ="Site",y ="Total Abundance Count") +theme_minimal(base_size =16) +theme(plot.title =element_text(face ="bold", size =20),axis.title =element_text(size =16, face ="bold"),axis.text =element_text(size =14),axis.text.y =element_text(size =13))
Species Richness by Site
# Calculate species richness (number of species present) per sitespecies_richness <- species_data %>%pivot_longer(cols =-Species, names_to ="Site", values_to ="Abundance") %>%filter(Abundance >0) %>%group_by(Site) %>%summarize(Species_Richness =n()) %>%arrange(desc(Species_Richness))# Bar plot of species richness per siteggplot(species_richness, aes(x =reorder(Site, Species_Richness), y = Species_Richness)) +geom_bar(stat ="identity", fill ="darkorange", alpha =0.7) +coord_flip() +labs(title ="Species Richness by California Site",subtitle ="Number of Species Recorded at Each Location",x ="Site",y ="Number of Species") +theme_minimal(base_size =16) +theme(plot.title =element_text(face ="bold", size =20),plot.subtitle =element_text(size =14),axis.title =element_text(size =16, face ="bold"),axis.text =element_text(size =14),axis.text.y =element_text(size =13))
Species Richness vs Total Abundance
# Create a comparison plotcomparison_data <- site_totals %>%left_join(species_richness, by ="Site")ggplot(comparison_data, aes(x = Species_Richness, y = Total_Abundance)) +geom_point(size =5, color ="darkblue", alpha =0.7) +geom_text(aes(label = Site), hjust =-0.1, vjust =0.5, size =4.5, check_overlap =TRUE) +labs(title ="Species Richness vs Total Abundance",subtitle ="Relationship between diversity and abundance across California sites",x ="Species Richness (Number of Species)",y ="Total Abundance") +theme_minimal(base_size =16) +theme(plot.title =element_text(face ="bold", size =20),plot.subtitle =element_text(size =14),axis.title =element_text(size =16, face ="bold"),axis.text =element_text(size =14))
Data Description
This dataset includes abundance records for 86 species collected from 20 sites across California. Each row corresponds to a species and each column represents a sampling location. The sites span a wide range of ecosystems from coastal regions like Big Sur, Point Reyes, and the Channel Islands, to desert environments like Death Valley, Mojave Desert, and Anza-Borrego. Several high elevation areas like Yosemite Valley, Mount Shasta, and Larsen Volcanic Park contrast with wetlands like the Salton Sea, Central Valley Wetlands, and Elkhorn Slough. Overall the 20 sites span a wide range of different California habitat types. The species represented within these sites are equally diverse. The full dataset spans a wide range of California wildlife including birds, reptiles, amphibians and fish. However the subset used in this analysis (only the first 10 species) is all birds including the California Quail, American Kestrel, Western Bluebird, Acorn Woodpecker, Red-tailed Hawk, Great Egret, Snowy Plover, Peregrine Falcon, Northern Flicker, and the Western Meadowlark. Looking for trends within this dataset I noticed mainly that species abundances vary widely across sites. Some species, such as the Western Bluebird and Snowy Plover show relatively high counts at multiple sites, while others have patchier distributions. This being said no species is consistently abundant across all sites, and a few sites show many low or zero values potentially indicating unsuitable habitat or low detectability. The data constituting this dataset comes from a combination of wildlife survey methods including direct observation, tracking, tagging, camera traps, and other standardized techniques to record how many individuals of each species are present at each site. The dataset looks to be well-structured but needed to be reformatted for analysis with the vegan package. Specifically it needs to be converted from wide format to long format for some of the analyses, and checked to ensure there are no missing values or data quality issues that could affect diversity calculations.
Question 2: Clean and Wrangle Data
Create Long Format Dataset
species_long <- species_data %>%pivot_longer(cols =-Species,names_to ="Site",values_to ="Abundance") %>%filter(!is.na(Abundance)) # Remove any NA valuescat("Long format dimensions:", dim(species_long), "\n\n")
Long format dimensions: 1700 3
kable(head(species_long, 15),caption ="Long Format Data Structure (First 15 Rows)",align ="lrr",col.names =c("Species", "Site", "Abundance")) %>%kable_styling(bootstrap_options =c("striped", "hover", "condensed"),full_width =FALSE,font_size =14,position ="left")
Long Format Data Structure (First 15 Rows)
Species
Site
Abundance
California Quail
Yosemite Valley
5
California Quail
Big Sur Coast
0
California Quail
Mojave Desert
4
California Quail
Sierra Foothills
8
California Quail
Point Reyes
9
California Quail
Lake Tahoe
5
California Quail
Death Valley
6
California Quail
Santa Monica Mountains
7
California Quail
Channel Islands
10
California Quail
Central Valley Wetlands
8
California Quail
San Gabriel Mountains
1
California Quail
Anza-Borrego
3
California Quail
Redwood National Park
4
California Quail
Salton Sea
2
California Quail
Lassen Volcanic Park
2
Why this step: I converted the dataset to long format because it makes the analysis much easier. Instead of having each site as its own column, the long format gives one row per species-site combination, with separate columns for the site name and the abundance value. I used pivot_longer() to reshape the data. I also removed any NA values so the dataset is clean before moving on to the rest of the analysis.
kable(head(species_wide[, 1:8], 10),caption ="Wide Format Data Structure (First 10 Species, First 8 Sites)",align ="l") %>%kable_styling(bootstrap_options =c("striped", "hover", "condensed"),full_width =TRUE,font_size =13,position ="left") %>%scroll_box(width ="100%")
Wide Format Data Structure (First 10 Species, First 8 Sites)
Species
Yosemite Valley
Big Sur Coast
Mojave Desert
Sierra Foothills
Point Reyes
Lake Tahoe
Death Valley
California Quail
5
0
4
8
9
5
6
American Kestrel
8
9
0
9
5
5
2
Western Bluebird
6
2
4
7
2
11
8
Acorn Woodpecker
3
11
4
6
3
2
8
Red-tailed Hawk
4
3
7
0
2
3
4
Great Egret
3
4
2
12
9
3
0
Snowy Plover
2
10
8
10
5
7
4
Peregrine Falcon
7
12
3
1
1
5
7
Northern Flicker
3
8
12
7
6
0
11
Western Meadowlark
3
4
2
7
5
3
0
Why this step: I kept a wide version of the dataset because it preserves the original layout while making sure everything is clean and consistent. In this format, each species strays as a row and each site is a column. I used values_fill = 0 to replace any missing values with zeros (since a blank cell usually means the species wasn’t observed).
Community Data Matrix (First 10 Sites, First 8 Species)
California Quail
American Kestrel
Western Bluebird
Acorn Woodpecker
Red-tailed Hawk
Great Egret
Snowy Plover
Peregrine Falcon
Yosemite Valley
5
8
6
3
4
3
2
7
Big Sur Coast
0
9
2
11
3
4
10
12
Mojave Desert
4
0
4
4
7
2
8
3
Sierra Foothills
8
9
7
6
0
12
10
1
Point Reyes
9
5
2
3
2
9
5
1
Lake Tahoe
5
5
11
2
3
3
7
5
Death Valley
6
2
8
8
4
0
4
7
Santa Monica Mountains
7
3
4
2
2
1
3
3
Channel Islands
10
2
0
3
12
8
9
4
Central Valley Wetlands
8
7
9
5
10
0
11
1
Why this step: I created the community data matrix because it’s the format the vegan package needs for running diversity analyses. Vegan expects sites as rows and species as columns, which is the opposite of how the dataset originally started. In this matrix, each cell shows the abundance of a given species at a given site.
Question 3: Calculate Diversity Metrics for 3 Sites
kable(richness_selected,caption ="Species Richness for Three Selected California Sites",align ="lr",col.names =c("Site", "Species Richness")) %>%kable_styling(bootstrap_options =c("striped", "hover"),full_width =FALSE,font_size =16,position ="left") %>%column_spec(1, bold =TRUE, width ="15em") %>%column_spec(2, width ="10em")
Species Richness for Three Selected California Sites
Site
Species Richness
Lake Tahoe
Lake Tahoe
79
Death Valley
Death Valley
78
Channel Islands
Channel Islands
80
Shannon Diversity Index
shannon_selected <-data.frame(Site = selected_sites,Shannon_Index =sapply(selected_sites, function(site) {diversity(community_matrix[site, ], index ="shannon") }))cat("Shannon Diversity Index (H') for Selected Sites:\n\n")
Shannon Diversity Index (H') for Selected Sites:
kable(shannon_selected,caption ="Shannon Diversity Index for Three Selected California Sites",align ="lr",col.names =c("Site", "Shannon Index (H')"),digits =4) %>%kable_styling(bootstrap_options =c("striped", "hover"),full_width =FALSE,font_size =16,position ="left") %>%column_spec(1, bold =TRUE, width ="15em") %>%column_spec(2, width ="10em")
Shannon Diversity Index for Three Selected California Sites
Site
Shannon Index (H')
Lake Tahoe
Lake Tahoe
4.1765
Death Valley
Death Valley
4.1851
Channel Islands
Channel Islands
4.2261
Simpson’s Diversity Index
simpson_selected <-data.frame(Site = selected_sites,Simpson_Index =sapply(selected_sites, function(site) {diversity(community_matrix[site, ], index ="simpson") }))cat("Simpson's Diversity Index (D) for Selected Sites:\n\n")
Simpson's Diversity Index (D) for Selected Sites:
kable(simpson_selected,caption ="Simpson's Diversity Index (Gini-Simpson) for Three Selected California Sites",align ="lr",col.names =c("Site", "Simpson Index (1-D)"),digits =4) %>%kable_styling(bootstrap_options =c("striped", "hover"),full_width =FALSE,font_size =16,position ="left") %>%column_spec(1, bold =TRUE, width ="15em") %>%column_spec(2, width ="10em")
Simpson's Diversity Index (Gini-Simpson) for Three Selected California Sites
Site
Simpson Index (1-D)
Lake Tahoe
Lake Tahoe
0.9826
Death Valley
Death Valley
0.9831
Channel Islands
Channel Islands
0.9836
simpson_original <-data.frame(Site = selected_sites,Simpson_Dominance =sapply(selected_sites, function(site) {diversity(community_matrix[site, ], index ="invsimpson") }))cat("\n\nInverse Simpson's Index (1/D) for Selected Sites:\n\n")
Inverse Simpson's Index (1/D) for Selected Sites:
kable(simpson_original,caption ="Inverse Simpson's Index for Three Selected California Sites",align ="lr",col.names =c("Site", "Inverse Simpson (1/D)"),digits =4) %>%kable_styling(bootstrap_options =c("striped", "hover"),full_width =FALSE,font_size =16,position ="left") %>%column_spec(1, bold =TRUE, width ="15em") %>%column_spec(2, width ="10em")
Inverse Simpson's Index for Three Selected California Sites
Site
Inverse Simpson (1/D)
Lake Tahoe
Lake Tahoe
57.6153
Death Valley
Death Valley
59.1885
Channel Islands
Channel Islands
60.9930
Combined Comparison Table
diversity_comparison <- richness_selected %>%left_join(shannon_selected, by ="Site") %>%left_join(simpson_selected, by ="Site")cat("\nCombined Diversity Metrics for All Three Sites:\n\n")
Combined Diversity Metrics for All Three Sites:
kable(diversity_comparison,caption ="Combined Diversity Metrics: Comparison Across Three Sites",align ="lrrr",col.names =c("Site", "Species Richness", "Shannon Index (H')", "Simpson Index (1-D)"),digits =4) %>%kable_styling(bootstrap_options =c("striped", "hover", "bordered"),full_width =FALSE,font_size =16,position ="center") %>%column_spec(1, bold =TRUE, width ="15em") %>%column_spec(2:4, width ="10em") %>%row_spec(0, bold =TRUE, color ="white", background ="#3498db")
Combined Diversity Metrics: Comparison Across Three Sites
Site
Species Richness
Shannon Index (H')
Simpson Index (1-D)
Lake Tahoe
79
4.1765
0.9826
Death Valley
78
4.1851
0.9831
Channel Islands
80
4.2261
0.9836
Visualization: Comparison of Metrics
diversity_long_comparison <- diversity_comparison %>%pivot_longer(cols =-Site,names_to ="Metric",values_to ="Value")ggplot(diversity_long_comparison, aes(x = Site, y = Value, fill = Metric)) +geom_bar(stat ="identity", position ="dodge", alpha =0.7) +facet_wrap(~Metric, scales ="free_y") +labs(title ="Comparison of Diversity Metrics Across Three California Sites",subtitle ="Species Richness, Shannon Index, and Simpson's Index",x ="Site",y ="Index Value") +theme_minimal(base_size =16) +theme(plot.title =element_text(face ="bold", size =20),plot.subtitle =element_text(size =14),axis.title =element_text(size =16, face ="bold"),axis.text =element_text(size =13),axis.text.x =element_text(angle =45, hjust =1, size =14),strip.text =element_text(size =15, face ="bold"),legend.position ="none")
Observations and Interpretation
The diversity metrics calculated for the three California sites show slight differences in species diversity and abundance distribution. Species richness, which simply counts the number of species present, was highest at Channel Islands (80 species), followed by Lake Tahoe (79) and Death Valley (78). The Shannon Diversity index (H′), which accounts for both the number of species and how evenly individuals are distributed among them, was also highest at Channel Islands (4.2261), indicating not only slightly higher richness but fairly even abundances. Death Valley (4.1851) and Lake Tahoe (4.1765) were slightly lower. The Simpson Diversity index (1 – D) measures the probability that two randomly chosen individuals belong to different species, emphasizing the evenness of abundances. All sites had values near 1 (Channel Islands: 0.9836; Death Valley: 0.9831; Lake Tahoe: 0.9826), reflecting very even species distributions. Overall, all three metrics consistently rank Channel Islands as the most diverse site and Lake Tahoe as the least, though differences are small, indicating that all sites support similarly diverse communities.
Yosemite Valley Big Sur Coast Mojave Desert Sierra Foothills Point Reyes Lake Tahoe
Yosemite Valley 0.00000000 0.1428571 0.12941176 0.1309524 0.16470588 0.14117647
Big Sur Coast 0.14285714 0.0000000 0.15294118 0.1764706 0.16666667 0.16470588
Mojave Desert 0.12941176 0.1529412 0.00000000 0.1411765 0.10843373 0.12941176
Sierra Foothills 0.13095238 0.1764706 0.14117647 0.0000000 0.15476190 0.13095238
Point Reyes 0.16470588 0.1666667 0.10843373 0.1547619 0.00000000 0.16470588
Lake Tahoe 0.14117647 0.1647059 0.12941176 0.1309524 0.16470588 0.00000000
Death Valley 0.15294118 0.1764706 0.14117647 0.1428571 0.17647059 0.15294118
Santa Monica Mountains 0.11764706 0.1411765 0.10588235 0.1071429 0.14117647 0.11764706
Channel Islands 0.10714286 0.1309524 0.11764706 0.1411765 0.15294118 0.12941176
Central Valley Wetlands 0.15476190 0.2000000 0.14285714 0.1666667 0.15662651 0.17647059
San Gabriel Mountains 0.14117647 0.1647059 0.10714286 0.1309524 0.12048193 0.14117647
Anza-Borrego 0.12941176 0.1529412 0.11764706 0.1411765 0.15294118 0.12941176
Redwood National Park 0.14117647 0.1647059 0.10714286 0.1529412 0.14285714 0.14117647
Salton Sea 0.11764706 0.1411765 0.10588235 0.1294118 0.14117647 0.09523810
Lassen Volcanic Park 0.09411765 0.1176471 0.08235294 0.1058824 0.11764706 0.09411765
Elkhorn Slough 0.12941176 0.1309524 0.11764706 0.1411765 0.13095238 0.12941176
Carrizo Plain 0.11764706 0.1411765 0.08333333 0.1071429 0.09638554 0.11764706
Mount Shasta 0.12941176 0.1529412 0.11764706 0.1411765 0.13095238 0.12941176
San Diego Chaparral 0.11764706 0.1190476 0.10588235 0.1294118 0.14117647 0.09523810
Lake Berryessa 0.10714286 0.1309524 0.11764706 0.1411765 0.13095238 0.10714286
Death Valley Santa Monica Mountains Channel Islands Central Valley Wetlands
Yosemite Valley 0.15294118 0.11764706 0.10714286 0.1547619
Big Sur Coast 0.17647059 0.14117647 0.13095238 0.2000000
Mojave Desert 0.14117647 0.10588235 0.11764706 0.1428571
Sierra Foothills 0.14285714 0.10714286 0.14117647 0.1666667
Point Reyes 0.17647059 0.14117647 0.15294118 0.1566265
Lake Tahoe 0.15294118 0.11764706 0.12941176 0.1764706
Death Valley 0.00000000 0.12941176 0.14117647 0.1445783
Santa Monica Mountains 0.12941176 0.00000000 0.08333333 0.1529412
Channel Islands 0.14117647 0.08333333 0.00000000 0.1647059
Central Valley Wetlands 0.14457831 0.15294118 0.16470588 0.0000000
San Gabriel Mountains 0.15294118 0.09523810 0.12941176 0.1764706
Anza-Borrego 0.14117647 0.10588235 0.11764706 0.1428571
Redwood National Park 0.10843373 0.11764706 0.12941176 0.1097561
Salton Sea 0.10714286 0.09411765 0.10588235 0.1529412
Lassen Volcanic Park 0.10588235 0.04761905 0.05952381 0.1294118
Elkhorn Slough 0.14117647 0.10588235 0.11764706 0.1647059
Carrizo Plain 0.08433735 0.09411765 0.10588235 0.1309524
Mount Shasta 0.11904762 0.10588235 0.11764706 0.1428571
San Diego Chaparral 0.12941176 0.09411765 0.08333333 0.1529412
Lake Berryessa 0.14117647 0.08333333 0.11764706 0.1647059
San Gabriel Mountains Anza-Borrego Redwood National Park Salton Sea Lassen Volcanic Park
Yosemite Valley 0.14117647 0.12941176 0.14117647 0.11764706 0.09411765
Big Sur Coast 0.16470588 0.15294118 0.16470588 0.14117647 0.11764706
Mojave Desert 0.10714286 0.11764706 0.10714286 0.10588235 0.08235294
Sierra Foothills 0.13095238 0.14117647 0.15294118 0.12941176 0.10588235
Point Reyes 0.12048193 0.15294118 0.14285714 0.14117647 0.11764706
Lake Tahoe 0.14117647 0.12941176 0.14117647 0.09523810 0.09411765
Death Valley 0.15294118 0.14117647 0.10843373 0.10714286 0.10588235
Santa Monica Mountains 0.09523810 0.10588235 0.11764706 0.09411765 0.04761905
Channel Islands 0.12941176 0.11764706 0.12941176 0.10588235 0.05952381
Central Valley Wetlands 0.17647059 0.14285714 0.10975610 0.15294118 0.12941176
San Gabriel Mountains 0.00000000 0.12941176 0.11904762 0.11764706 0.09411765
Anza-Borrego 0.12941176 0.00000000 0.12941176 0.10588235 0.08235294
Redwood National Park 0.11904762 0.12941176 0.00000000 0.09523810 0.09411765
Salton Sea 0.11764706 0.10588235 0.09523810 0.00000000 0.07058824
Lassen Volcanic Park 0.09411765 0.08235294 0.09411765 0.07058824 0.00000000
Elkhorn Slough 0.12941176 0.11764706 0.12941176 0.10588235 0.08235294
Carrizo Plain 0.09523810 0.10588235 0.09523810 0.07142857 0.07058824
Mount Shasta 0.12941176 0.11764706 0.12941176 0.10588235 0.05952381
San Diego Chaparral 0.09523810 0.10588235 0.11764706 0.07142857 0.07058824
Lake Berryessa 0.12941176 0.09523810 0.12941176 0.10588235 0.08235294
Elkhorn Slough Carrizo Plain Mount Shasta San Diego Chaparral Lake Berryessa
Yosemite Valley 0.12941176 0.11764706 0.12941176 0.11764706 0.10714286
Big Sur Coast 0.13095238 0.14117647 0.15294118 0.11904762 0.13095238
Mojave Desert 0.11764706 0.08333333 0.11764706 0.10588235 0.11764706
Sierra Foothills 0.14117647 0.10714286 0.14117647 0.12941176 0.14117647
Point Reyes 0.13095238 0.09638554 0.13095238 0.14117647 0.13095238
Lake Tahoe 0.12941176 0.11764706 0.12941176 0.09523810 0.10714286
Death Valley 0.14117647 0.08433735 0.11904762 0.12941176 0.14117647
Santa Monica Mountains 0.10588235 0.09411765 0.10588235 0.09411765 0.08333333
Channel Islands 0.11764706 0.10588235 0.11764706 0.08333333 0.11764706
Central Valley Wetlands 0.16470588 0.13095238 0.14285714 0.15294118 0.16470588
San Gabriel Mountains 0.12941176 0.09523810 0.12941176 0.09523810 0.12941176
Anza-Borrego 0.11764706 0.10588235 0.11764706 0.10588235 0.09523810
Redwood National Park 0.12941176 0.09523810 0.12941176 0.11764706 0.12941176
Salton Sea 0.10588235 0.07142857 0.10588235 0.07142857 0.10588235
Lassen Volcanic Park 0.08235294 0.07058824 0.05952381 0.07058824 0.08235294
Elkhorn Slough 0.00000000 0.10588235 0.11764706 0.10588235 0.09523810
Carrizo Plain 0.10588235 0.00000000 0.10588235 0.09411765 0.10588235
Mount Shasta 0.11764706 0.10588235 0.00000000 0.10588235 0.11764706
San Diego Chaparral 0.10588235 0.09411765 0.10588235 0.00000000 0.10588235
Lake Berryessa 0.09523810 0.10588235 0.11764706 0.10588235 0.00000000
NMDS with Bray-Curtis Distances
set.seed(123) # For reproducibilitynmds_bray <-metaMDS(community_matrix, distance ="bray", k =2, trymax =100)
Wisconsin double standardization
Run 0 stress 0.281007
Run 1 stress 0.2923922
Run 2 stress 0.3105088
Run 3 stress 0.3218972
Run 4 stress 0.2941504
Run 5 stress 0.2990044
Run 6 stress 0.308303
Run 7 stress 0.3032368
Run 8 stress 0.2929312
Run 9 stress 0.3011717
Run 10 stress 0.2868759
Run 11 stress 0.3007704
Run 12 stress 0.2873156
Run 13 stress 0.283034
Run 14 stress 0.2912786
Run 15 stress 0.3027634
Run 16 stress 0.3125567
Run 17 stress 0.2868142
Run 18 stress 0.2980751
Run 19 stress 0.2768438
... New best solution
... Procrustes: rmse 0.1402713 max resid 0.3215578
Run 20 stress 0.2787651
Run 21 stress 0.2911912
Run 22 stress 0.3057632
Run 23 stress 0.2831251
Run 24 stress 0.2853592
Run 25 stress 0.3240949
Run 26 stress 0.291665
Run 27 stress 0.2967328
Run 28 stress 0.2937119
Run 29 stress 0.2803224
Run 30 stress 0.2934303
Run 31 stress 0.2779701
Run 32 stress 0.2892283
Run 33 stress 0.2831959
Run 34 stress 0.3007912
Run 35 stress 0.2881029
Run 36 stress 0.2959458
Run 37 stress 0.3012513
Run 38 stress 0.3028554
Run 39 stress 0.3053827
Run 40 stress 0.2982422
Run 41 stress 0.3041563
Run 42 stress 0.3112826
Run 43 stress 0.2982777
Run 44 stress 0.2773268
... Procrustes: rmse 0.04924779 max resid 0.1441582
Run 45 stress 0.2823682
Run 46 stress 0.2844833
Run 47 stress 0.2802228
Run 48 stress 0.2850497
Run 49 stress 0.3073537
Run 50 stress 0.3130785
Run 51 stress 0.2742365
... New best solution
... Procrustes: rmse 0.1419059 max resid 0.3145031
Run 52 stress 0.298485
Run 53 stress 0.2823502
Run 54 stress 0.3042179
Run 55 stress 0.2884774
Run 56 stress 0.2844712
Run 57 stress 0.3074574
Run 58 stress 0.2862099
Run 59 stress 0.2875857
Run 60 stress 0.2842428
Run 61 stress 0.2749927
Run 62 stress 0.2840932
Run 63 stress 0.292112
Run 64 stress 0.2800357
Run 65 stress 0.2859932
Run 66 stress 0.3105873
Run 67 stress 0.2895685
Run 68 stress 0.3063066
Run 69 stress 0.2900662
Run 70 stress 0.2917791
Run 71 stress 0.2908405
Run 72 stress 0.2939239
Run 73 stress 0.2982702
Run 74 stress 0.2810184
Run 75 stress 0.3155024
Run 76 stress 0.2874488
Run 77 stress 0.2924123
Run 78 stress 0.2987606
Run 79 stress 0.3024156
Run 80 stress 0.2888699
Run 81 stress 0.2918204
Run 82 stress 0.2990798
Run 83 stress 0.2986676
Run 84 stress 0.2867671
Run 85 stress 0.2914776
Run 86 stress 0.2912989
Run 87 stress 0.2990439
Run 88 stress 0.2828081
Run 89 stress 0.2912644
Run 90 stress 0.2923718
Run 91 stress 0.2936975
Run 92 stress 0.295154
Run 93 stress 0.2912083
Run 94 stress 0.2992323
Run 95 stress 0.2998503
Run 96 stress 0.2795509
Run 97 stress 0.2916514
Run 98 stress 0.290269
Run 99 stress 0.2832592
Run 100 stress 0.2866166
*** Best solution was not repeated -- monoMDS stopping criteria:
5: no. of iterations >= maxit
95: stress ratio > sratmax
set.seed(123) # For reproducibilitynmds_jaccard <-metaMDS(community_matrix, distance ="jaccard", binary =TRUE, k =2, trymax =100)
Wisconsin double standardization
Run 0 stress 0.1809183
Run 1 stress 0.1779049
... New best solution
... Procrustes: rmse 0.06125251 max resid 0.1754351
Run 2 stress 0.1865218
Run 3 stress 0.2266042
Run 4 stress 0.180501
Run 5 stress 0.192617
Run 6 stress 0.1866004
Run 7 stress 0.1997077
Run 8 stress 0.1931261
Run 9 stress 0.1904562
Run 10 stress 0.2037118
Run 11 stress 0.1776164
... New best solution
... Procrustes: rmse 0.04339829 max resid 0.1781998
Run 12 stress 0.2045054
Run 13 stress 0.2467821
Run 14 stress 0.2351829
Run 15 stress 0.2131532
Run 16 stress 0.1887383
Run 17 stress 0.1881437
Run 18 stress 0.1773501
... New best solution
... Procrustes: rmse 0.02562705 max resid 0.0826354
Run 19 stress 0.1811889
Run 20 stress 0.1903775
Run 21 stress 0.1903773
Run 22 stress 0.18782
Run 23 stress 0.2045201
Run 24 stress 0.1870151
Run 25 stress 0.1773502
... Procrustes: rmse 9.402431e-05 max resid 0.0002231769
... Similar to previous best
*** Best solution repeated 1 times
These distance matrices and NMDS ordinations reveal important patterns in species composition across California sites. First, looking at Euclidean distance, which measures the straight-line difference between sites based on species abundances, you can see how overall abundance differences make sites appear more or less similar. In the matrix, larger values indicate that two sites differ more in total abundances across all species, while smaller values indicate more similar abundances. Euclidean distance is sensitive to overall abundance differences, which is why it often produces larger and more variable values compared to Bray-Curtis or Jaccard distances. In this way because Euclidean distance emphasizes absolute abundance rather than ecological composition, it is not typically visualized with NMDS plots. Bray-Curtis dissimilarity on the other hand adjusts for total abundance and compares both species identity and relative abundances. Values near 0 indicate very similar communities, while values near 1 indicate very different ones. The Bray-Curtis NMDS in this way shows clear clustering by habitat type. For example coastal habitats like Point Reyes, Big Sur Coast, and the Channel Islands occupy the left side of the plot, showing they share more species with each other than with say, desert systems. The stress value for this NMDS is 0.2742 which indicates moderate distortion typical of complex datasets. This means broad patterns depicted are reliable but fine-scale distances between sites should be interpreted cautiously. Finally, Jaccard distance only considers species presence or absence, ignoring abundance. The Jaccard NMDS shows slightly different clusters than Bray-Curtis because it treats all species equally, regardless of how many individuals are present. Clustering in the Jaccard NMDS is more compact than in the Bray-Curtis NMDS. For example, both analyses show clustering of deserts like the Mojave and Death Valley, but in Bray-Curtis they are very far apart from other sites because abundance exaggerates separation. In Jaccard, these deserts remain distinct but sit closer to the center.
Question 5: Comprehensive Diversity Indices for All Sites
Calculate All Diversity Indices
richness_all <-specnumber(community_matrix)shannon_all <-diversity(community_matrix, index ="shannon")gini_simpson_all <-diversity(community_matrix, index ="simpson")inverse_simpson <-diversity(community_matrix, index ="invsimpson")simpson_evenness <- inverse_simpson / richness_alldiversity_all <-data.frame(Site =rownames(community_matrix),Shannon_Index = shannon_all,Gini_Simpson_Index = gini_simpson_all,Simpson_Evenness = simpson_evenness)cat("Complete Diversity Metrics for All 20 California Sites:\n\n")
Complete Diversity Metrics for All 20 California Sites:
kable(diversity_all,caption ="Comprehensive Diversity Analysis: All 20 California Sites",align ="lrrr",col.names =c("Site", "Shannon Index (H')", "Gini-Simpson Index (1-D)","Simpson's Evenness"),digits =4,row.names =FALSE) %>%kable_styling(bootstrap_options =c("striped", "hover", "bordered", "condensed"),full_width =TRUE,font_size =14,position ="center") %>%column_spec(1, bold =TRUE, width ="12em") %>%column_spec(2:4, width ="8em") %>%row_spec(0, bold =TRUE, color ="white", background ="#2c3e50", font_size =15) %>%scroll_box(width ="100%", height ="500px")
Comprehensive Diversity Analysis: All 20 California Sites
Site
Shannon Index (H')
Gini-Simpson Index (1-D)
Simpson's Evenness
Yosemite Valley
4.1599
0.9821
0.7069
Big Sur Coast
4.1405
0.9816
0.7070
Mojave Desert
4.1693
0.9821
0.6988
Sierra Foothills
4.1769
0.9827
0.7419
Point Reyes
4.1560
0.9825
0.7426
Lake Tahoe
4.1765
0.9826
0.7293
Death Valley
4.1851
0.9831
0.7588
Santa Monica Mountains
4.2122
0.9831
0.7288
Channel Islands
4.2261
0.9836
0.7624
Central Valley Wetlands
4.1821
0.9832
0.7810
San Gabriel Mountains
4.1878
0.9829
0.7394
Anza-Borrego
4.2302
0.9838
0.7712
Redwood National Park
4.2027
0.9832
0.7546
Salton Sea
4.2434
0.9842
0.7809
Lassen Volcanic Park
4.2045
0.9829
0.7029
Elkhorn Slough
4.1744
0.9823
0.7071
Carrizo Plain
4.1603
0.9818
0.6785
Mount Shasta
4.2119
0.9834
0.7534
San Diego Chaparral
4.2050
0.9832
0.7331
Lake Berryessa
4.1848
0.9827
0.7229
Summary Statistics
cat("\n=== Summary Statistics for Diversity Indices ===\n")
=== Summary Statistics for Diversity Indices ===
cat("\nShannon Index:\n")
Shannon Index:
print(summary(diversity_all$Shannon_Index))
Min. 1st Qu. Median Mean 3rd Qu. Max.
4.140 4.173 4.185 4.189 4.207 4.243
cat("\nGini-Simpson Index:\n")
Gini-Simpson Index:
print(summary(diversity_all$Gini_Simpson_Index))
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.9816 0.9825 0.9829 0.9828 0.9832 0.9842
cat("\nSimpson's Evenness:\n")
Simpson's Evenness:
print(summary(diversity_all$Simpson_Evenness))
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.6785 0.7071 0.7363 0.7351 0.7557 0.7810
Visualization: Multi-Panel Comparison
diversity_all_ordered <- diversity_all %>%arrange(desc(Shannon_Index))p1 <-ggplot(diversity_all_ordered, aes(x =reorder(Site, Shannon_Index), y = Shannon_Index)) +geom_bar(stat ="identity", fill ="steelblue", alpha =0.7) +coord_flip() +labs(title ="Shannon Diversity Index",x ="Site", y ="Shannon Index (H')") +theme_minimal(base_size =14) +theme(plot.title =element_text(face ="bold", size =18),axis.title =element_text(size =14, face ="bold"),axis.text =element_text(size =12),axis.text.y =element_text(size =12))p2 <-ggplot(diversity_all_ordered, aes(x =reorder(Site, Shannon_Index), y = Gini_Simpson_Index)) +geom_bar(stat ="identity", fill ="darkgreen", alpha =0.7) +coord_flip() +labs(title ="Gini-Simpson Index",x ="Site", y ="Gini-Simpson Index (1-D)") +theme_minimal(base_size =14) +theme(plot.title =element_text(face ="bold", size =18),axis.title =element_text(size =14, face ="bold"),axis.text =element_text(size =12),axis.text.y =element_text(size =12))p3 <-ggplot(diversity_all_ordered, aes(x =reorder(Site, Shannon_Index), y = Simpson_Evenness)) +geom_bar(stat ="identity", fill ="darkorange", alpha =0.7) +coord_flip() +labs(title ="Simpson's Evenness",x ="Site", y ="Simpson's Evenness (E)") +theme_minimal(base_size =14) +theme(plot.title =element_text(face ="bold", size =18),axis.title =element_text(size =14, face ="bold"),axis.text =element_text(size =12),axis.text.y =element_text(size =12))p1 / p2 / p3 +plot_annotation(title ="Diversity Indices Across All California Sites",theme =theme(plot.title =element_text(face ="bold", size =22)))
Visualization: Relationships Between Metrics
ggplot(diversity_all, aes(x = Shannon_Index, y = Gini_Simpson_Index)) +geom_point(size =5, color ="darkblue", alpha =0.7) +geom_smooth(method ="lm", se =TRUE, color ="red", linetype ="dashed", linewidth =1.2) +geom_text(aes(label = Site), hjust =-0.1, vjust =0.5, size =4.5, check_overlap =TRUE) +labs(title ="Shannon Index vs Gini-Simpson Index",subtitle ="Relationship between two diversity metrics",x ="Shannon Index (H')",y ="Gini-Simpson Index (1-D)") +theme_minimal(base_size =16) +theme(plot.title =element_text(face ="bold", size =20),plot.subtitle =element_text(size =14),axis.title =element_text(size =16, face ="bold"),axis.text =element_text(size =14))
`geom_smooth()` using formula = 'y ~ x'
ggplot(diversity_all, aes(x = Shannon_Index, y = Simpson_Evenness)) +geom_point(size =5, color ="darkgreen", alpha =0.7) +geom_smooth(method ="lm", se =TRUE, color ="red", linetype ="dashed", linewidth =1.2) +geom_text(aes(label = Site), hjust =-0.1, vjust =0.5, size =4.5, check_overlap =TRUE) +labs(title ="Shannon Index vs Simpson's Evenness",subtitle ="Does higher diversity correlate with higher evenness?",x ="Shannon Index (H')",y ="Simpson's Evenness (E)") +theme_minimal(base_size =16) +theme(plot.title =element_text(face ="bold", size =20),plot.subtitle =element_text(size =14),axis.title =element_text(size =16, face ="bold"),axis.text =element_text(size =14))
`geom_smooth()` using formula = 'y ~ x'
cor_shannon_gini <-cor(diversity_all$Shannon_Index, diversity_all$Gini_Simpson_Index)cor_shannon_evenness <-cor(diversity_all$Shannon_Index, diversity_all$Simpson_Evenness)cat("\nCorrelation between Shannon and Gini-Simpson:", round(cor_shannon_gini, 3), "\n")
Correlation between Shannon and Gini-Simpson: 0.932
cat("Correlation between Shannon and Evenness:", round(cor_shannon_evenness, 3), "\n")
Correlation between Shannon and Evenness: 0.629
Observations
Using all 20 California sites, the Gini-Simpson Index, Simpson’s Evenness, and Shannon Index reveal more about community structure beyond just species counts. This analysis expands beyond Question 3’s focus on three specific sites to look at the dataset as a whole. The Gini-Simpson index measures the chance that two randomly chosen individuals are from different species, with values near 1 showing high diversity. All sites scored pretty high (0.9816-0.9842), meaning no single species dominates. Simpson’s Evenness shows how evenly individuals are distributed among species. Higher values show more balanced communities. Most sites are fairly even, though Channel Islands and Anza-Borrego are slightly higher. Finally the Shannon Index combines richness and evenness, increasing when sites have many species with relatively equal abundances. Conceptualized together these metrics reveal that California’s ecosystems are very diverse and generally even, while small differences in patterns reflect variations in habitat and environmental conditions
Question 6: Overall Conclusions and Most Applicable Metrics
Based on these analyses of the California Species Abundance dataset, it can be concluded that generally California ecosystems are highly diverse. Most sites support many species that are fairly evenly distributed, though some differences in dominance and abundance patterns exist depending on habitat type. A reasonable hypothesis is that California’s ecosystems with more extreme or variable environmental conditions will have slightly lower diversity, while more stable habitats will support more balanced and diverse communities, despite this overall trend in high diversity. Desert sites, coastal sites, and montane sites each show subtle differences, reflecting how environmental conditions shape community structure. I found this easiest to understand through the visual NMDS plots that show sites clustered by ecological and geographic patterns. Specifically the Bray-Curtis and Jaccard distances were very effective in comparing sites in composition and abundance. Looking at the diversity metrics, I would say that the Shannon Index was particularly useful in this report because it captures both richness and evenness. In this way Simpson diversity also complemented Shannon diversity, highlighting dominant species patterns.