Investigating Biovolumes in SE Australian Waters

The question

Is there a correlation between biovolume and cloud-free solar radiation?

In this vignette I will use R programming to test for a correlation between biovolume and solar radiation as recorded by FRV “Warreen” in SE Australian waters over the period of May 1938 to June 1942.

The data

Biovolume observations were made by means of towing plankton collection nets from the vessel. Biovolume [cc] is the abundance of organisms given in counts of 1/10th of the tow volume. Solar radiation is cloud free surface irradiance [W/m2] at local mean time or sundial time (I.e. accounting for midday at 149 E being 4 minutes after 150 E.) More information can be found at Warreen hydrology and plankton data

The source data used in this discussion has been ‘munged’ in Excel rather than R. The original source is warren4csiro.xls

Limitations

The consistency of this data set could be affected by a number of variables that that are weather-dependent, hence not able to be controlled. These include the number of collections taken on each voyage, duration of each collection (drag time), currents. One such example is the sea-state, a combination of wind and wave height which is measured on the Beauford Wind Scale. High sea-state can make it difficult to take a reading, reducing the available drag time or even can prevent the vessel’s crew from performing the drag.

Setting up to answer the question …

Loading the raw data

The raw data set for this exercise consists of a single file in csv format with one row of column headings.

To load the raw data into R, use the read.csv() function.

  • a new data object called warreen is created,

  • data is read from a file called "warreen_cleaned.csv" into warreen, and

  • the header row is detected header=TRUE

# Step 1 - Read in the csv file
warreen <- read.csv("warreen_cleaned.csv", header=TRUE)

Next, confirm that the data object can be viewed and check for anomalies. Tip: This code was also useful for troubleshooting when I made changes to the data structure later on.

To this, use head(). This function will display the first n rows from the new data object. head(warreen, 6) As the data is so wide, use kableExtra::kable() to place the results in a scrollbox. Tip: is intended for HTML output and will not print nicely.

# Step 2 - Inspect the result

head_warreen <- head(warreen, 6)
kable(cbind(head_warreen, head_warreen), "html") %>%
  kable_styling() %>%
  scroll_box(width = "500px")
ï..Net.type St.no. Time.of.day day month mth Season year latitude longitude CloudFree Biovolume DominantOrganism euphausiids thaliacea larvaceans chaetognaths Counter ï..Net.type St.no. Time.of.day day month mth Season year latitude longitude CloudFree Biovolume DominantOrganism euphausiids thaliacea larvaceans chaetognaths Counter
N70_V50_0 6 651 7 1 Jan Summer 1939 -35.10 150.83 95.2145 5 Type 2 11 0 3 5 54 N70_V50_0 6 651 7 1 Jan Summer 1939 -35.10 150.83 95.2145 5 Type 2 11 0 3 5 54
N70_V50_0 24 1220 19 1 Jan Summer 1939 -43.22 148.07 777.6114 10 Type 2 3 1 0 0 55 N70_V50_0 24 1220 19 1 Jan Summer 1939 -43.22 148.07 777.6114 10 Type 2 3 1 0 0 55
N70_V50_0 26 740 20 1 Jan Summer 1939 -41.23 148.45 237.8405 100 Type 2 0 0 0 3 56 N70_V50_0 26 740 20 1 Jan Summer 1939 -41.23 148.45 237.8405 100 Type 2 0 0 0 3 56
N70_V50_0 1 1802 9 1 Jan Summer 1940 -34.92 151.25 24.9927 30 Type 2 0 84 106 0 151 N70_V50_0 1 1802 9 1 Jan Summer 1940 -34.92 151.25 24.9927 30 Type 2 0 84 106 0 151
N70_V50_0 2 37 9 1 Jan Summer 1940 -34.23 151.67 0.0000 20 Type 2 0 0 0 0 152 N70_V50_0 2 37 9 1 Jan Summer 1940 -34.23 151.67 0.0000 20 Type 2 0 0 0 0 152
N70_V50_0 4 1457 11 1 Jan Summer 1940 -32.78 152.23 544.7137 20 Type 2 0 7 146 2 153 N70_V50_0 4 1457 11 1 Jan Summer 1940 -32.78 152.23 544.7137 20 Type 2 0 7 146 2 153

Note the following columns: month (month of the observation), mth, Season, year, CloudFree (Cloud-free Solar Radiation), Biovolume and DominantOrganism (as found in the biovolume sample).

First glance at the data set

First create a scatter plot of the entire data set. Use the ggplot2::ggplot() function and call the warreen data object to create the chart itself. Then use the ggplot2::geom_point() function to add the point layer on the chart. The points represent the distribution of Biovolume against CloudFree.

# Step 3 - Create a scatter plot of Biovolume vs Cloud-free solar radiation  

ggplot(data = warreen) + 
  geom_point(mapping = aes(x=CloudFree, y=Biovolume))

There result is not very meaningful as it does not take into account the impact of seasonal variations over the collection period.

What type of Organism are the outliers?

There are two outliers. I’m interested to know if these are both the same type of dominant organism.

To find out I will use colours to identify each point by type of Dominant Organism.

Add the colour aesthetic colour=DominantOrganism to the geom_point() function.

# Step 4 - Add the colour aesthetic to set for colours each DominantOrganism and apply the to the points

ggplot(data = warreen) + 
  geom_point(mapping = aes(x=CloudFree, y=Biovolume, colour=DominantOrganism))

The chart now shows that the two outliers are different types of organism and also which clusters of points they belong to.

Selecting an appropriate sub-set of data

As this data is impacted by seasonal factors, such cloud cover, it is important to take into consideration the year and month in which the samples were taken.

The next step is to understand the distribution and consistency of samples by time. To do this I will split the data set first by year and then by month.

Distribution by year

The function facet_wrap() splits the plot into a separate charts and renders them across the page, effectively screen-wrapping them to the number of rows specified.

One chart is rendered for each value in the specified variable, in this case year, and displays in rows, determined the argument nrow=2. (You can play with the layout further but all that is needed at this stage is a high-level distribution across the years.)

# Step 5 - review distribution by year

ggplot(data = warreen) + 
  geom_point(mapping = aes(x=CloudFree, y=Biovolume, colour=DominantOrganism)) +
    facet_wrap(~ year, nrow = 2)

As expected, the samples for 1938 to 1940 are more dense. I will focus on these years.

Distribution by year and month

The next step is to confirm the distribution of collections accross the selected years.

To do this use:

  • filter() to limit the data set to years 1938-1940

  • facet_grid() to show distribution by month for each year

# Step 6 - review distribution by year and month

ggplot(data = filter(warreen, between(year,1938, 1940))) + 
  geom_point(mapping = aes(x=CloudFree, y=Biovolume, colour=DominantOrganism)) +
    facet_grid(month ~ year, shrink=TRUE)

You can see that 1939 is the only year that with collections performed in all twelve months of the year. However, looking across the years there is a reasonable coverage with valid explanations for the gaps in early 1938 (no collections) and throughout 1940 (impact of WWII).

Answering the question

It’s time to test for the correlation, using average monthly biovolumes and solar radiation and taking seasonal variations in to account.
To do this: * use group_by() - create a new data object by_month of grouped records,

  • Use summarise() with by_month - creates a second new data object to store the collection counts, mean cloud-free solar radiation and mean biovolume for each month.

Finally, plot the results with ggplot()

  • colour the points by season geom_point(aes(colour=Season), alpha = 1/2)

  • label the points `geom_text(aes(label=mth),hjust=0, vjust=0)’

  • overlay the line of best fit geom_smooth()

  • plot the area scale_size_area()

# Step 7 - calculate the average biovolumes and cloud-free solar radiation for each month in the data set 

by_month <- group_by(filter(warreen, between(year,1938,1940)), Season,month, mth)
biov <- summarise(by_month,
  count = n(),
  CF = mean(CloudFree, na.rm = TRUE),
  biov = mean(Biovolume, na.rm = TRUE))

ggplot(biov, aes(CF, biov)) +
  geom_point(aes(colour=Season), alpha = 1/2) +
  geom_text(aes(label=mth),hjust=0, vjust=0) +
  geom_smooth() +
  scale_size_area()
## `geom_smooth()` using method = 'loess'

This result supports the hypothesis that there is a correlation between biovolume and cloud-free solar radiation. This is also demonstrates the expected matching trend in monthly averages when grouped by seasons. Specifically, the there is greater biovolume in spring and summer (longer days and wamer water) and lower in autumn and winter (cooler days and cooler water).

Why is March an outlier?

March appears to be an outlier. Is this sample sparsity for March over the collection years? One way to explore this idea is to look at the number of collections by month.

To do this, add the size aethetic aes(size = count, colour=Season) to geom_point(). As you can see, the count of collections will be used to determine the size of the point.

# Step 8 - overlay the collection count in the chart

by_month <- group_by(filter(warreen, between(year,1938,1940)), Season,month, mth)
biov <- summarise(by_month,
  count = n(),
  CF = mean(CloudFree, na.rm = TRUE),
  biov = mean(Biovolume, na.rm = TRUE))

ggplot(biov, aes(CF, biov)) +
  geom_point(aes(size = count, colour=Season), alpha = 1/2) +
  geom_text(aes(label=mth),hjust=0, vjust=0) +
  geom_smooth() +
  scale_size_area()
## `geom_smooth()` using method = 'loess'

The size of the points now shows that there were significantly less collections in performed in the March. This is supported by the sparsity observed in step5.