Questions

How does the number of species recorded by FrogID vary among ecoregions?
Is the number of species recorded by FrogID affected by the remoteness of the sampling area?
Does the data on species richness of frogs in a grid cell obtained from FrogID accurately predict the known species richness from previous expert data sets?

The packages that are required for these questions are contained below

# Packages

    install.packages("tidyverse")
        library(tidyverse)
    
# Contains useful packages such as dpylr and ggplot2 which will be used later in the notebook

    install.packages("car", dependencies=TRUE)
        library(car)

# An optional package which is used to check the assumptions of one-way ANOVA using data tests

    install.packages("ggpmisc")
        library(ggpmisc)
        
# Package used to combine table onto a plot (required for a legend in Question 1)

The dataset folder that we will be using for the questions

# Datasets

    setwd("~/R/BEES2041/FrogID datasets")

# Datasets will be accessible from this specific folder

Question 1 - How does the number of species recorded by FrogID vary among ecoregions?

Null hypothesis - Ecoregions does not affect the amount of species recorded (species richness) in the FrogID dataset

Alternative hypothesis - Ecoregions does affect the amount of species recorded (species richness) in the FrogID dataset

Code for Question 1 - The statistical test that is appropriate for calculating the probability of a relationship between an independent categorical and dependent quantitative variable is a one-way ANOVA. This calculates an f statistic and p value which indicates the likelihood of the null hypothesis being true.

Preparation for One-way ANOVA

# Read in dataset
    Frog_richness <- read.csv(file = "frogid_richness_per_grid.csv",
                              header = TRUE)

# Check whether dataset has been successfully loaded in
    summary(Frog_richness) 
    
# Create a boxplot to view the factors of Ecoregion against species richness. 
    Speciesrichness_vs_Ecoregion <- ggplot(
                                           Frog_richness, aes(Ecoregion_acronym, Species_richness_frogid)) + 
                                           labs(title = "Species richness in different Ecoregions", 
                                           y = "Species Richness", x = "Ecoregion") + 
                                           geom_boxplot() +
                                           theme(axis.text.x = element_text(angle = 90))
 


# Abbreviation for Ecoregions to act as a legend for the boxplot
    full_names_of_ecoregion <- data.frame(
                                          abb =c("D & XS","MF","MG & S", "T & SG", "T & SMBF", "TB & MF", "TG"),
                                          fname = c(
                                          "Deserts & Xeric Shrublands ","Mediterranean Forests",
                                          "Montane Grasslands & Shrublands", "Tropical & Subtropical Grasslands", 
                                          "Tropical & Subtropical Moist Broadleaf Forests", 
                                          "Temperate Broadlead & Mixed Forests", "Temperate Grasslands" ))
                                                
# Run the variables using the + annotate function to add the legend to the boxplot (to the right of the graph)
    Speciesrichness_vs_Ecoregion + annotate(geom = "table", 
                                            x = 16, y = 10, 
                                            label = list(full_names_of_ecoregion))
    ## This is where the package "ggpismc" is used

# Finally ensure that invalid rows and columns have been taken out to declutter data. Has already been done in this example.

Perform the ANOVA test on the dataset

# Store the ANOVA data into another variable which is easily identifiable as the ANOVA data
    Frog_richness.ANOVA <- aov(
                               Species_richness_frogid ~ Ecoregion_acronym, 
                               data = Frog_richness)

# Obtain a statistical summary of the ANOVA data, in particular the f statistic & p-value
    summary(Frog_richness.ANOVA)

[F(6, 569) = 22.07, p = <2e-16]

This p value indicates that there is a difference between the Species richness at different Ecoregions.

Check whether assumptions of one-way ANOVA are met

# Scatterplot for homogeneity of variance
    plot(Frog_richness.ANOVA$residuals)
        ## Have tried log and sqaure root transformation to fit heteroscedasticity but is unable to fit

# There is also the option for the Levenes Test for constant variance which gives a numeric summary of the heteroscedasticity
    leveneTest(Frog_richness.ANOVA)
        ## In this case there is a violation of constant variance as we have obtained a value <0.05, which means that our data is                     likely to possess very vague and incorrect confidence intervals. 


# Normal distribution of each group with no major outliers. Could look at a histogram, normal Q-Q plot or a boxplot to assess the normality
    hist(Frog_richness.ANOVA$residuals)
    qqnorm(Frog_richness.ANOVA$residuals)
        qqline(Frog_richness.ANOVA$residuals)

    ## Data does not follow a normal distribution. Transformations have also been experimented but it is still unable to produce a normal plot from the data

The data obtained from this test will most likely have larger confidence intervals then it is supposed due to some of the assumptions not being met for the One-way ANOVA test.

Preparing a Tukey’s test

If the null hypothesis is rejected and there is found to be a relationship between the variables then a Tukey test is appropriate to assess the similarities and differences within groups.

# Find the critical value of q for an error rate of 0.05 ##
    qtukey(0.95, 7, 569)
  
# Perform the Tukey Test to find the relationship (similarity/ difference) between each of the Ecoregions and print the data 
    Tukey_Species_richness <- TukeyHSD(Frog_richness.ANOVA)
        print(Tukey_Species_richness)
  
# It would be better to display this information graphically, and therefore we can print a 95% family-wise confidence level to display the relationships between Ecoregions richness levels. 

# cex.axis (Character expression language) function reduces the font size of the y axis in order to read the entire name of the ecoregion, whilst las rotates the text so that it is displayed horizontally rather than vertically ###
    plot(
         Tukey_Species_richness, 
         las=1, 
         cex.axis = 0.40)

The graph displays the means of each relationship (indicated by the indent in each line), the 95% confidence interval (range of each bar) and a 0 vertical line which indicates the true mean of all the relationships. Any confidence intervals that do not overlap the 0 vertical line are considered as statistically different. For example, the relationship between TG-TS & MF at the bottom of the 95% family-wise confidence level graph are statistically different as they do not pass through the middle vertical line.

Results

From this analysis of the FrogID dataset it is obvious that there exists a relationship where Species richness is influenced by Ecoregions [F(6, 569) = 22.07, p = <2e-16]. Whilst some of the assumptions are not met for the post-HOC test we were able to produce a 95% confidence interval graph (inprecise confidence intervals due to some assumptions not being met) which catalgoues the relationships between each Ecoregion, and how dis/similar they are.

Question 2 - Is the number of species recorded by FrogID affected by the remoteness of the sampling area?

Null Hypothesis - there is no relationship between remoteness of the sampling area and the number of species recorded by FrogID

Alternative hypothesis - The number of species recorded by FrogID was affected by the remoteness of the sampling area

Code for Question 2 - In order to answer this question it is appropriate to use a simple linear regression test to build a mathematical formula where y is a function of x. As a result we’ll obtain a t value which determines the precision of the regression coefficient, and a p value that indicates the likelihood for the null hypothesis being true.

Preparation for linear regression

# Read in the dataset and apply a variable name that relates well to the dataset 
    Frog_richness <- read.csv(file = "frogid_richness_per_grid.csv", header = TRUE)

Check assumptions of the linear regression test

# Create a scatterplot to visualise number of species recorded against the remoteness of sampling area to check assumption of linearity of x and y variables)
    ggplot(Frog_richness, aes(Remoteness, Species_richness_frogid)) + geom_point() + geom_smooth(method="lm")

# Transform the data if necessary 
    Frog_richness["Remoteness_log"] <- log(Frog_richness$Remoteness)

# Create the scatterplot with the transformed data
    ggplot(Frog_richness, aes(Remoteness_log, Species_richness_frogid)) + geom_point() + geom_smooth(method="lm")

# Are Y values normal and evenly distributed about the regression line. Check a residual plots homoscedasticity to do this.
      Richness_vs_Remoteness.lm <- lm(Species_richness_frogid ~ Remoteness_log, data = Frog_richness)
          plot(Richness_vs_Remoteness.lm, caption = ("Residuals vs Fitted"))

Linearity is maintained as the red line is roughly level. There is a slight increasing trend in the residual points indicating that error variance increases with the independent variable, but this is minimal. Increasing error variance means that standard error is increased.

# Checking a histogram and a boxplot are also good methods to check normality around 0
    hist(residuals(Richness_vs_Remoteness.lm))
    boxplot(residuals(Richness_vs_Remoteness.lm))

)

Normality of the dataset is roughly followed, although the data is slightly positive skewed as indicated by the boxplot

R statistical output of scatterplot

# Obtain t and p values
summary(Richness_vs_Remoteness.lm)

Results

Through linear regression testing of the FrogId dataset we have obtained the following output[F(1, 574) = 148.3, p = <2e-16] to determine the relationship between Remoteness of sample and number of species recorded. With this p value we can reject the null hypothesis in favour of the alternative hypothesis that there does exist a relationship. It should be noted though that there is likely a high level of standard error due to the increasing variance with the independent variable (remoteness), meaning that this test in particular may not be the most accurate.

Question 3 - Does the data on species richness of frogs in a grid cell obtained from FrogID accurately predict the known species richness from previous expert data sets?

Null hypothesis - The species richness data obtained by Richard cogger does not significantly relate to the frogid data of species richness

Alternative hypothesis - Both species richness data obtained by Richard cogger and frogid are significantly related and predict eachother

Code for Question 3 - To answer this question we will need to compare the species richness from the two separate datasets under one dataset and determine whether there is a significant difference or not. Therefore it is again appropriate to use a linear regression model to quantify the relationship between these two variables.

Preparation for linear regression

# Read both data tables into the notebook
    Frog_richness <- read.csv(file = "frogid_richness_per_grid.csv", header = TRUE)
    Cogger_richness <- read.csv(file = "cogger_richness_per_grid.csv", header = TRUE)


# We will need to combine both of the datasets into a singular dataset in order to proceed with a linear regression test. The merge function will suffice for this
    Combined_data <- merge(Cogger_richness, Frog_richness, by = "grid_id", all.x = TRUE, all.y = TRUE)

Check assumptions of the linear regression test

# Check the assumptions of linearity of y ~ x data with scatterplot

    ggplot(Combined_data, aes(Species_richness_frogid, Species_richness_hal)) + geom_point() + geom_smooth(method="lm")

# Use log transformation of Species_richness_hal to create a stronger linear model
    Combined_data["Hal_richness_log"] <- log(Combined_data$Species_richness_hal)

# ScatterPlot of log transformed cogger richness data vs non-transformed frogid richness data
    ggplot(Combined_data, aes(Species_richness_frogid, Hal_richness_log)) + geom_point() + geom_smooth(method="lm")

# Normal distribution about the regression line
    Combined_data.lm <- lm(Hal_richness_log ~ Species_richness_frogid, data = Combined_data)
        plot(Combined_data.lm, caption = ("Residuals vs Fitted"))

Linearity is maintained as the red line is level. There is a decreasing trend in the residual points indicating that error variance decreases with the independent variable.

Normality is maintained in this plot, which is further proven by the boxplot and histogram below

# Histogram and boxplot to check normality graphically
          hist(residuals(Combined_data.lm))
          boxplot(residuals(Combined_data.lm))

R statistical output of scatterplot

# Obtain t and p values
    summary(Combined_data.lm)

Results Through linear regression testing of the FrogId and Cogger datasets we have obtained the following output[F(1, 574) = 265.8, p = <2e-16] to find a relationship between the species recorded by both dataset. We have used the linear regression test to test for a relationship between both samples. If there is a relationship between the two samples then it becomes possible to predict the outcome of one sample with data from the other one. The p value output that we obtained disproves the null hypothesis of no relationship between the two richness datasets, and therefore it is possible to predict the outcome of past datasets (such as cogger_richness) using the frogid dataset.

Use of citizen science to monitor frogs across Australia - BEES2041

Questions

Question 1 - How does the number of species recorded by FrogID vary among ecoregions?

Question 2 - Is the number of species recorded by FrogID affected by the remoteness of the sampling area?

Question 3 - Does the data on species richness of frogs in a grid cell obtained from FrogID accurately predict the known species richness from previous expert data sets?