Introduction

In the study of biodiversity, or taxonomy, patterns of speciation can be of great interest and can reveal answers to questions like ‘are there more fish because there is more algae?’ or ‘Is the population of X increasing due to no presence of species Y?’. These questions can be effectively answered through the collection of species data in an area and using statistics to find patterns between those various species. Biodiversity relies heavily on the health and stability of a given environment. In the United States, National Parks are meant to be a reserve of the natural landscape, and should be representative of the most common species in that given area. Because National Parks contain high levels of biodiversity, this data can also be used to determine where species are present, changing patterns of population growth, and is crucial to tracking species of concern and determining where conservation efforts may be needed.

Biodiversity of 56 US National Parks is provided from two datasets from Kaggle. The first dataset contains the biodiversity data; 119,248 observations of 14 variables including levels of taxonomy and conservation status. The second data set includes specific data for the 56 parks, including longitude and latitude. For the purposes of analysis, these two data sets were combined to create a dataframe to work with that contained all 119,248 observations with 19 variables including those from the parks data set.

Questions that ecologists and biologists ask about biodiversity are primarily focussed on patterns and relationships between groups in a given environment. The National Parks dataset provides taxonomic information as well as location data, meaning that exact species can be tied to an approximate location (given by the latitude and longitude of each National Park). This gives an idea of overall species spread by various taxonomic group, and therefore can reveal how species are interacting and influencing one another.

Data Application / Methods

In order to use the data to answer questions about speciation and location, it needs cleaned up a bit. This was done through data wrangling from the tidyverse in R. This data frame can be used to map a spread of species diversity across the US. An additional column was added to the data frame to calculate ‘total diversity’ of species for that national park. This was done via the mutate function in R to calculate how many observations each park contributed of each Category by filtering for park name and grouping observations together with their location. Category was the ideal variable for this sorting because it only contained 14 levels of outputs.A plot of Category observations for each park is also displayed. In order to classify the data using simpler variable outputs later on, a separate date frame was created with the column “Plant_or_Animal” to distinguish between the spread of just animals and plants.

library(tidyverse)
## ── Attaching packages ───────────────── tidyverse 1.3.0 ──
## ✔ ggplot2 3.2.1     ✔ purrr   0.3.3
## ✔ tibble  2.1.3     ✔ dplyr   0.8.3
## ✔ tidyr   1.0.0     ✔ stringr 1.4.0
## ✔ readr   1.3.1     ✔ forcats 0.4.0
## ── Conflicts ──────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
library(maps)
## 
## Attaching package: 'maps'
## The following object is masked from 'package:purrr':
## 
##     map
library(kableExtra)
## 
## Attaching package: 'kableExtra'
## The following object is masked from 'package:dplyr':
## 
##     group_rows
# Data frames obtained via Kaggle website
Parks <- read.csv("parks.csv")
Species <- read.csv("species.csv")

# Two data frames joined together to add variables of location and total diversity
Biodiv_set1 <- Parks %>%
  right_join(Species, by='Park.Name') %>%
  group_by(Park.Name) %>%
  mutate(total_div = n())
 
## This removes 'national park' from the variable name to make plotting more readable' 

Biodiv_set1 <- Biodiv_set1 %>%
  mutate(park_name = str_replace(`Park.Name`, " Na.*", ""))
  
#Separate 'plant/animal' data frame; This separates the original data frame into separate sets based on animal or plant category. Then, a new coloumn denoting "Animal" or "Plant" is passed into the set and then the data frames are recombined with the new varaible. Also, for Conservation Status, a 'safe' output was added to make comparing the data easier. 

status<-function(x){
  if(x==''){return('Safe')}
  else {return(as.character(x))}
}
Species$Status<-sapply(Species$Conservation.Status,status)

newdata <- Parks %>%
  right_join(Species, by='Park.Name')



plants <- newdata %>%
  select(Category, Scientific.Name, Park.Code) %>%
  filter(Category %in% c("Algae", "Fungi", "Nonvascular Plant", "Vascular Plant"))
plants$Plant_or_Animal="Plant"

animals <- newdata %>%
  select(Category, Scientific.Name, Park.Code) %>%
  filter(Category %in% c("Bird", "Amphibian", "Crab/Lobster/Shrimp", "Fish", "Insect", "Invertebrate", "Mammal", "Reptile", "Slug/Snail", "Spider/Scoripion"))

animals$Plant_or_Animal="Animal"


newdata2 <- inner_join(newdata, plants)
## Joining, by = c("Park.Code", "Category", "Scientific.Name")
newdata3 <- inner_join(newdata, animals)  
## Joining, by = c("Park.Code", "Category", "Scientific.Name")
final_data <- full_join(newdata2, newdata3) 
## Joining, by = c("Park.Code", "Park.Name", "State", "Acres",
## "Latitude", "Longitude", "Species.ID", "Category", "Order", "Family",
## "Scientific.Name", "Common.Names", "Record.Status", "Occurrence",
## "Nativeness", "Abundance", "Seasonality", "Conservation.Status", "X",
## "Status", "Plant_or_Animal")
# Map of species density spread on USA Map (no filter)

Biodiv_set1 %>%
  filter(State != "AK" & State != "HI") %>% 
  ggplot(aes(Longitude, Latitude)) +
  borders("state") +
  geom_point(
    aes(
      size=total_div, 
      color=total_div
    )
  ) +
  coord_quickmap() +   
  labs(title="U.S. National Parks",               
       subtitle="Species Biodiversity"
  ) + 
  scale_size_continuous(
    range=c(3,9),
    guide=FALSE) +
  scale_colour_distiller(name = "Total Species",
                         breaks= ,
                         palette="Spectral")

## Plot of plant categories

plant_biodiv <- Biodiv_set1 %>%
  filter(Category == "Algae" | Category == "Fungi" | Category == "Nonvascular Plant" | Category == "Vascular Plant") %>%
  group_by(park_name) %>%
  mutate(total_div = n())

ggplot(plant_biodiv) + 
  geom_bar(
    mapping = aes(x = fct_reorder(park_name, total_div), fill = Category),   
    width = 0.8,                       
    color = "black",                 
    alpha = 0.8
  ) + 
  labs(title="Plant Biodiversity Within U.S. National Parks",                   
       caption="source: Kaggle biodiversity dataset", 
       x = "National Park",
       y = "Number of Species"
  ) + 
  coord_flip()

## plot of Animal Categories
animal_biodiv <- Biodiv_set1 %>%
  filter(Category == "Amphibian" | Category == "Reptile" | Category == "Bird" | Category == "Fish" | Category ==  "Mammal") %>%
  group_by(Park.Name) %>%
  mutate(total_div = n())

ggplot(animal_biodiv) + 
  geom_bar(
    mapping = aes(x = fct_reorder(park_name, total_div), fill = Category), 
    width = 0.8,                       
    alpha = 0.8
  ) + 
  labs(title="Animal Biodiversity Within U.S. National Parks",                         
       caption="Source: Kaggle biodiversity dataset", 
       x = "National Park",
       y = "Number of Species"
  ) + coord_flip()

In this map of species diversity, National Parks are represented by their relative acrage (circle size) with number of observations for each park represented as colors. From this map one can see that there is a reasonable spread of observations across the country with high levels at the Redwood Forest National Park Shenadoah National Park. The plot of Categories makes it easier to distinguish between species in the data set and on the plot itself. To show patterns of species spread more precisely, one can look at the relationship between a specific taxonmic group and the park location. Most species diveristy occurs in warmer climates and lower latitudes, so we would expect diversity to increase at areas with low latitudes and decrease as latitude increases. Similarly, patterns of speciation are typically observed as increasing as longitude increases towards 0. For this model, Category was used as the taxonimic

# Number of Species with Latitude 

final_data %>% 
  select(Latitude, Category, Scientific.Name, Park.Code, Longitude) %>% 
  unique() %>%
  count(Category, Latitude) %>%
  group_by(Latitude) %>%
  mutate(total = sum(n)) %>%
  ggplot(aes(x = Latitude)) + 
  geom_point(mapping = aes(y = n, color = Category), size = 2) + 
  geom_point(mapping = aes(y = total),
             color = "grey", alpha = 0.2, size = 4, shape = 15) + 
  geom_smooth(mapping = aes(y = total),
              se = FALSE, color = "black") + 
  ggtitle("Latitude -- Number of species") +
  ylab("number of species") + 
  theme_bw()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

# Number of Species with Longitude

final_data %>% 
  select(Latitude, Category, Scientific.Name, Park.Code, Longitude) %>% 
  unique() %>%
  count(Category, Longitude) %>%
  group_by(Longitude) %>%
  mutate(total = sum(n)) %>%
  ggplot(aes(x = Longitude)) + 
  geom_point(mapping = aes(y = n, color = Category), size = 2) + 
  geom_point(mapping = aes(y = total),
             color = "grey", alpha = 0.2, size = 4, shape = 15) + 
  geom_smooth(mapping = aes(y = total),
              se = FALSE, color = "black") + 
  ggtitle("Longitude -- Number of species") +
  ylab("number of species") + 
  theme_bw()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Both models were fit using a LOESS line due to the flexible nature of the data and to not ‘overfit’ due to multiple variable outputs. Each category for each park is represented by different colors, with the average diversity of the park represented as grey squares. From the plot of latitude, the assumption from above holds true. There is an slight increase in biodiversity as latitude decreases, with the most notable peak in the data around 40 degrees. based on the plot of longitude, there appears to be pockets of biodiveristy at -160 to -140 degrees, -120 to -100 degrees, and -90 to -75 degrees of longitude. Most species are found at around -120 to -100 degrees of longitude.

It is also useful to visualize the data based on the variable of ‘Plant_or_Animal’, which was obtained by filtering the data by category into plant and animal groups and then mutating the categorical data into a new column. Below is a plot of this variable against latitude. This showcases how many plants and animals are in each park based on their location and highlights where there might be differences in expected plant or animal populations.

final_data %>% 
  select(Latitude, Category, Scientific.Name, Park.Code, Longitude, Plant_or_Animal) %>% 
  unique() %>%
  count(Category, Latitude, Plant_or_Animal) %>%
  group_by(Latitude) %>%
  mutate(total = sum(n)) %>%
  ggplot(aes(x = Latitude)) + 
  geom_point(mapping = aes(y = n, color = Plant_or_Animal), size = 2) + 
  geom_point(mapping = aes(y = total),
             color = "grey", alpha = 0.2, size = 4, shape = 15) + 
  geom_smooth(mapping = aes(y = total),
              se = FALSE, color = "black") + 
  ggtitle("Latitude -- Number of species") +
  ylab("number of species") + 
  theme_bw()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

From this plot one can note that there for a majority of the parks listed, there are much lower populations of animal species versus plant species. From the fit of the LOESS lines, there is evidence for a relationship between species population and location, the fits are nearly identical, but the benefits from this sorting shine through more in the readability of the graph. Sorting the variables in this way also highlights points in which there is a close relationship between the number of plants and the number of animals in a given park.

In addition to modeling the location of species, for the purpose of this investigation it is also important to model species status. Species status determines how the population is doing; is it endangered, threatened, in recovery etc.

#Conservation status table; those observations without a status listed are assumed to be 'safe' and filtered out of this table. 


test_set <- final_data %>% 
  select(Park.Name, Conservation.Status, Scientific.Name, Category, Park.Name) %>%
  filter(Conservation.Status == c("Endangered", "Threatened", "Proposed Threatened", "Proposed Endangered", "In Recovery", "Species of Concern", "Under Review"))
## Warning in `==.default`(Conservation.Status, c("Endangered",
## "Threatened", : longer object length is not a multiple of shorter object
## length
## Warning in is.na(e1) | is.na(e2): longer object length is not a multiple of
## shorter object length
test_set %>%
  ggplot(aes(x = reorder(Category, Category, function(x) length(x)),  
             fill = Conservation.Status)) + 
  geom_bar()  +
  scale_y_continuous(expand = c(0, 20)) +
  ggtitle("Species in conservation") +
  xlab(" ")   + 
  theme(axis.text.y = element_text(size = 10, hjust=1),
        axis.text.x = element_text(size = 10, hjust=1),
        legend.text = element_text(size = 10), 
        plot.title = element_text(size = 12))+ coord_flip()

## Not sure why the data here is flipped for Vasular plants and Birds, (when r markdown was originally compiled Vascular plants were listed as the top category)

Based on this table, and data from above, Vascular plants appear to be the group with the highest population, also the category with the most species with some kind of conservation status. Tracking endangered and threatened species can be helpful in understanding human impact on wildlife and conservation efforts. From this data we can also model which parks contain the most endangered vascular plant species in each category.

parks_endanger <- Biodiv_set1 %>%
  select(Conservation.Status, Category, Park.Name, Scientific.Name, park_name) %>%
  filter(Conservation.Status == "Endangered", Category == "Vascular Plant")

ggplot(parks_endanger,aes(park_name,fill=Category))+
  geom_bar(stat="count")+theme(axis.text.x = element_text(angle=90,vjust=0.5),plot.title=element_text(family=,hjust=0.5),legend.position = "bottom")+labs(x="Park",y="Species Count")+ggtitle("Parks where endangered species are found")+coord_flip()

Results

Based on this data, there appears to be a positive relationship between the variables of location and species diversity, particularly with the variable of Category and conservation status. Many of the parks that also contain high levels of diversity also contain populations of species that have a level of conservation status, and this can point towards environmental concerns such as a change in habitat or human influence. The LOESS models showcased how predicted patterns of speciation are observed where predicted along the longtitude / Latitude scale. Species status also appears to have a relatioship to location as well, as those with large concentrations of vascular plants are more likely to contain species of concern. Because Vascular plants also take up a majority of the data, it is expected that they would be most impacted by changes in their environment and it is not surprising that many species have a conservation status, especially when those populations of Vascular plants is also associated with high levels of animal populations. From this, it can be easy to assess where conservation efforts and investigations into diversity should be done, at parks and regions where there are high levels of differing plant and animal populations, as well as high concentrations of vascular plant species.

Conclusion

Biodiversity of multiple locations and muliple taxonomic groups can be difficult to parse through. When organized in a way that’s easier to understand, realtionships can be drawn between the location data and how the species are categorized. Based on Logitude and Latutide, it is possible to model the observations and see where pockets of diversity are in the United States. The most surprising aspect of deal with biodiversity data is primarily the size. It can be difficult to wrangle hundreds of thousands of observations, expecially if there are categorical and non ordinal variables. This data provided columns that were not always complete, which presents another challange to data wrangling as somtimes dummy variables need to be inputted into the column, or because there is no numeric value to the variables, a new column needs to be created to further sort the data into a completed column. The relationship between species diversity and location was also surprising. Through the longitide and latitude plots, it is easier to see how dense each categorical population of species is, and how that related to the spread of species across the country. The relationship is positive within the predicted bands along each scale. There also appears to be a significant relationship between animal and plant populations based off of that plot, pointing towards positive between the two variables. This can be used to track different species all across the country if more data is collected and it is sorted appropriatly.

Appendix

additional R code and plots

# Required Packages
library(tidyverse) #data wrangling
library(maps) #gives us map
library(kableExtra) #latex data organizer to make tables

# Additional plots

## total species in each category

cat_graph = final_data %>% group_by(Category) %>% summarise(Category.Count=n()) %>% arrange(-Category.Count)
ggplot(cat_graph,aes(factor(Category,levels=Category),Category.Count,fill=Category))+geom_bar(stat="identity")+theme(axis.text.x = element_text(angle=90,vjust=0.5),plot.title=element_text(family= ,hjust=0.5),legend.position = "none")+labs(x="Category",y="Total Species")+ggtitle("Total species in each category")

# distribution of each category per park
data_category <-as.data.frame(Biodiv_set1 %>% group_by(Category, park_name) %>% summarise(number = n()) %>% arrange(-number))

data_dist <- as.data.frame(Biodiv_set1) %>% 
                             group_by(Category, park_name) %>% 
                             summarise(number = n()) %>% 
                             arrange(-number)

dist_graph <-ggplot(data=data_dist, aes(Category,park_name)) + 
  geom_tile(aes(fill = number),colour = "white") + 
  scale_fill_gradient(low = "blue" ,high = "black" ) + 
  theme(legend.position='top',axis.text.y = element_text(size=8),axis.text.x = element_text(size=8,angle=45, hjust=1)) + 
  xlab('') + ylab('')
print(dist_graph)