Lab, Feburary 25th, 2022

Getting Back into Rstudio

Go to rstudio.bowdoin.edu and log in with your bowdoin username (your email without the @bowdoin.edu) and password.

Unless you’ve used R since our last class, all your work from last week should pop up. To separate the work we’re doing this week from last, we’re going to want to create a new file. Go to the top left corner and click on the white square with the green plus to create a new R script.

Once that’s open, you’re going to want to make sure that all our libraries are loaded. Remember to type everything into your file, not the console. If you type directly into the console, your work won’t save.

library(seacarb) #our library for working with carbonate chemistry

Picking up where we left off

First things first, we need to check that all our data from last time is still in our environment in the top right. If you scroll up and down in that box, you should see both of the following: scsc (with 4 observations of 11 variables) and scsc_carb (with 4 observations of 19 variables). If you have more than that, don’t worry, that’s totally okay. We are going to be calling on scsc and scsc_carb though, so we need to make sure that they’re ready to go.

New Meadows Data

We are extremely lucky that in 2017 and 2019, the University of Maine had a buoy in the New Meadows River off of Bombazine Island, which is quite close to Merritt Island and Jordan Kramer’s clam farm site! The buoy monitored a suite of environmental data from July-December of both years. While our field sampling is constricted to the winter/spring, this data can help provide insight into the biogeochemical conditions during the late summer and fall growing season - especially for the clams that were suspended in the water!

Let’s import this data into R - click upload in the middle right of the screen and upload the csv like we did last week. The csv we’re using fist is called buoy. We’ll then

buoy <- read.csv("buoy.csv")

Wow - 6,724 observations! Let’s have a look at the structure of this new dataframe.

str(buoy)

Oof - date and time are being read as factor variables. That means R thinks each unique date is it’s own category. All the various date/time combos (and timezones!) are notoriously irritating to deal with in the coding world. Luckily in R there is a package called “Lubridate” that provides functions to “Make Dealing with Dates a Little Easier”. Let’s download and load the package:

library(lubridate)

Now - lubridate has a function mdy_hm() that tells R to read a date like ours as month/day/year hours:minutes. Let’s use it, and then check out the structure of our dataframe!

buoy$datetime_eastern <- mdy_hm(buoy$datetime_eastern)
str(buoy)

Hooray! “POSIXct format” means R is now interpreting the values in this variable as calendar dates/times, measured in the number of seconds since 1970. That means we can now plot datetime_eastern as a variable, and R will know it is a date and not a million different categories ordered alphabetically! Give it a try with the base-R plotting function - pick a bouy variable (y) to plot over time (x)! For some plots, you may wish to use the ylim argument to change the y-axis limits to exclude outliers.

plot(buoy$datetime_eastern, buoy$chl_ug_L)
plot(buoy$datetime_eastern, buoy$pH) 
plot(buoy$datetime_eastern, buoy$sal_psu) 
plot(buoy$datetime_eastern, buoy$temp_C)

Cool! Real data from two years in the New Meadows River. . . Can you already see crudely that there are some seasonal patterns, and also some differences between the years?

While we are thinking about time, let’s also make a new column for “month” and a new column for “year” to make it easy to subset data over different timescales. The lubridate package also has two easy functions for this:

buoy$year<- year(buoy$datetime_eastern)
buoy$month <- month(buoy$datetime_eastern)

OK! Now we are set up to calculate means and standard deviations for the conditions in different months in both years.

Calculating Summary Statistics and Visualizing Data

You could write your own script to calculate means and standard deviation, but the nice thing about R is that in addition to packages, users in the community have already written many great, freely-available scripts that may be useful to your research. summarySE() is a script developed by users in the R community that provides a nice shortcut to calculate the mean, standard deviation, standard error and 95% confidence interval for subsets of data. I’ve copied the code to create this function below. The easiest way to add the function to your RStudio is to create a new file and paste the code there. Then run it all at once. In your environment pane on the top right of your Rstudio window, you should be able to see, Functions and then summarySE.

The code can be found on the website: http://www.cookbook-r.com/Graphs/Plotting_means_and_error_bars_(ggplot2)/ by scrolling down to the “Helper Functions” section.

Here’s the code to run:

## Gives count, mean, standard deviation, standard error of the mean, and confidence interval (default 95%).
##   data: a data frame.
##   measurevar: the name of a column that contains the variable to be summariezed
##   groupvars: a vector containing names of columns that contain grouping variables
##   na.rm: a boolean that indicates whether to ignore NA's
##   conf.interval: the percent range of the confidence interval (default is 95%)
summarySE <- function(data=NULL, measurevar, groupvars=NULL, na.rm=FALSE,
                      conf.interval=.95, .drop=TRUE) {
    library(plyr)

    # New version of length which can handle NA's: if na.rm==T, don't count them
    length2 <- function (x, na.rm=FALSE) {
        if (na.rm) sum(!is.na(x))
        else       length(x)
    }

    # This does the summary. For each group's data frame, return a vector with
    # N, mean, and sd
    datac <- ddply(data, groupvars, .drop=.drop,
      .fun = function(xx, col) {
        c(N    = length2(xx[[col]], na.rm=na.rm),
          mean = mean   (xx[[col]], na.rm=na.rm),
          sd   = sd     (xx[[col]], na.rm=na.rm)
        )
      },
      measurevar
    )

    # Rename the "mean" column    
    datac <- rename(datac, c("mean" = measurevar))

    datac$se <- datac$sd / sqrt(datac$N)  # Calculate standard error of the mean

    # Confidence interval multiplier for standard error
    # Calculate t-statistic for confidence interval: 
    # e.g., if conf.interval is .95, use .975 (above/below), and use df=N-1
    ciMult <- qt(conf.interval/2 + .5, datac$N-1)
    datac$ci <- datac$se * ciMult

    return(datac)
}

After running the script, you will see a new category called “Functions” appear in the R Studio Environment window, with the summarySE function and the arguments that it takes.

Now - let’s use the summarySE function to create a new data table with the summary statistics for one of our variables of interest - like the pH over time in the New Meadows River. Let’s call the new dataframe pH_stats. The “measurevar” argument indicates which column of which dataframe you want the mean and standard deviation to be calculated from (pH from the buoy dataframe, in this case). The “groupvars” designates the groups within which you want to analyze the variable of interest. In this case, we are interested in grouping BOTH by Month and by Year (e.g., we want to find the mean for July/2017 and July/2018 - not just all of July.

pH_stats <- summarySE(data=buoy, measurevar = "pH", groupvars=c("year","month"))

#If you're getting an error here, try this:
library(tidyverse)
pH_stats=buoy%>%
  group_by(year, month)%>%
  summarize(mean=mean(pH), sd=sd(pH),  n = n(), se = sd / sqrt(n))
head(pH_stats)

Open the pH_stats to consider these summary statistics.

“N” is the number of samples for each category, and you can see there are not an equal number of measurements for each month. That’s because the buoy went into the water at different times, and some months are shorter than others.
pH is the mean pH for that year/month subset of data, and sd is the standard deviation. If there are extreme outlier data points (for example, sensor broken and then repaired) these will be reflected in the standard deviation and we may want to investigate/exclude.

Calculating Aragonite

“All models are wrong, but some are useful”

You probably noticed there is only ONE of our carbonate chemistry unknowns in the buoy dataset (pH), and we need at least TWO (typically measured - alkalinty or DIC) to solve for our carbonate chemistry unkowns. Sadly these are not easily measured by sensors and so we don’t have the same kind of time-series data for them.

However, we do know from the literature that alkalinity is often strongly correlated to salinity in coastal environments. And last summer there was a Gulf of Maine-wide effort called “Shell Day” to measure alkalinity and a suite of environmental variables in seawater up and down the coast to see if there are any strong relationships that can be used to predict alkalinity. Michele’s lab has been involved in the alkalinity measurements, as as such we are going to have a sneak peak of that data! We have provided a reduced datasheet showing high-tide data salinity and alkalinity measurements.

Load the data into R and plot it using ggplot. . .

library(ggplot2)
shellday <- read.csv("shellday_TA.csv")

To begin with, let’s see if Salinity and Alkalinity are correlated in this dataset using the function cor.test(). Correlation determines if one variable varies systematically as another variable changes. It does not specify that one variable is the dependent variable and the other is the independent variable. Often, it is useful to look at which variables are correlated to others in a data set, and it is especially useful to see which variables correlate to a particular variable of interest. As you write your lab report, you may be interested in looking into whether correlations exist among other the variables in the Buoy dataset.

cor.test(shellday$Salinity, shellday$TA)

Correlations vary from -1 to 1, with 0 indicating no association. A strong positive correlation, like we observe with r=0.89, indicates that both increases in salinity and alkalinity are associated with one another in this dataset. The degrees of freedom reflects our sample size, and is used to help calculate the p-value (more samples -> less chance of random association). Traditionally most people use a cutoff of p<0.05 for statistical significance - you can see our p-value is very much lower than 0.05, indicating highly significant correlation! If you were reporting this result, you would say that on Shell Day, high tide salinity and alkalinity measurements were significantly, strongly positively correlated (Pearson’s product-moment correlation, r=0.89, p<0.001). Now, let’s visualize this data with a ggplot, plotting the data as points.

ggplot(shellday, aes(Salinity,TA))+ 
  geom_point()+
  ylab("Total Alkalinity (umol/kg)")+ 
  xlab("Salinity (PSU)")

As our correlation indicated, there does appear to be a linear relationship. You can add the linear regression with just one more line of code to your ggplot - the geom_smooth() function, with the method calling for ‘lm’ (a ‘linear model’). This plots the regression in blue and the 95% confidence interval around the line in gray.

ggplot(shellday, aes(Salinity,TA))+ 
  geom_point()+ 
  geom_smooth(method=lm)+
  ylab("Total Alkalinity (umol/kg)")+ 
  xlab("Salinity (PSU)")

Now, let’s calculate the equation of that line such that we can predict alkalinity based on salinity. The linear model function in R takes the variables in the format: lm(response variable ~ predictor variable, data). By saving your linear model as an object (let’s call it alk for simplicity), you can get detailed information about the model’s performance using the summary() command.

alk <-lm(TA~Salinity, data=shellday) 
summary(alk)

What does this model output mean?

Formula call. The first item shown in the output is the formula R used to fit the data: the response variable (TA) and the predictor variable (Salinity), together with the data being used (shellday).
Residuals. Residuals are essentially the difference between the actual observed response values (Total Alkalinity measured in the lab, in our case) and the response values that the model predicted. The Residuals section of the model output breaks it down into 5 summary points. When assessing how well the model fit the data, you should look for a symmetrical distribution across these points on the mean value zero (0). In our example, we can see that the distribution of the residuals do not appear to be strongly symmetrical. That means that the model predicts certain points that fall far away from the actual observed points. We could take this further consider plotting the residuals to see whether this normally distributed, etc. but will skip this for this example because with the wide range of waterbodies included in our dataset we expect to have some outliers.
Coefficients. In a simple linear regression (y=mx+b), the coefficient estimates are two unknown constants that represent the slope (m) and intercept(b) terms in the linear model. So for our model, Total Alkalinity increases by 52.183 umol/kg with each increase in PSU. The standard error measures the average amount that the coefficient estimates vary from the actual average value of the response variable. Ideally, we want this to be a lower number relative to our estimate. The standard error can be used to compute confidence intervals and statistically test the hypothesis that a relationship exists between alkalinity and salinity. The t values are how many standard deviations away our coefficient estimate is from zero, and are used to compute the p-value. A small Pr(>|t|) or p-value indicates that it is unlikely we will observe a relationship between the predictor (salinity) and response variable (TA) due to chance - we have very small/highly significant p-values associated with both our intercept and slope, indicating we can conclude there is a linear relationship between salinity and alkalinity.
Multiple R-squared. This is a proportion of variance statistic providing a measure of how well the model is fitting the data. It always lies between 0 and 1, with 0 representing a regression that does not explain the variance in the response variable, and 1 being a perfect fit of the data to the line. With an Rˆ2 of 0.8, we can say that in our model 80% of variance observed in TA can be explained by our predictor variable, Salinity. If you add more than one predictor variable to your model, the Adjusted Rˆ2 reflects the additional variables improve your model or not.
F-statistic. Another indicator of whether there is statistically significant relationship between our variables, which takes into account the number of data points (degrees of freedom). The further it is from 1, the better it is. You can see ours is also highly significant.

So - now we have a model predicting TA (y) based on Salinity (x).

y = 52.183x+428.281

Let’s apply this to our buoy data and model TA for the New Meadows River! Make a new column called TA in our buoy dataframe - the contents should equal 52.183*(salinity)+428.281.

buoy$TA <- 52.183*buoy$sal_psu + 428.281

Excellent! Let’s see the range of alkalinity values generated by this equation. The summary() command can be run on a single variable or entire dataset to see the minimum, median, mean and max values for a variable.

summary(buoy$TA)

Our minimum indicates there are some values of 0 in our salinity measurements, which are likely errors with the sensor which we may want to exclude. In general, our range in TA is small, reflecting the relative stability of salinity in the New Meadows River time series.

You can check on how well the model performs with our data from the field by entering our field salinity value from Site 2 into the equation!

52.183*31.8+428.281

The predicted alkalinity is a little bit lower than what we measured in lab for that particular seawater sample (2230 umol/kg). So what caveats might you discuss when applying/interpreting this model? Think about the data it was based on.

Ok - let’s use seacarb to calculate the saturation state with respect to aragonite, using our modeled TA and measured pH, temperature, and salinity from the buoy!

Remember - the seacarb carb() function requires alkalinity in units of mol/kg, so we have to make a second alkalinity variable in the correct units.

buoy$Alk_mol <- buoy$TA/1000000

Now let’s solve for the carbonate chemistry parameters! Use the same code as before, but provide the buoy dataset variables.

buoy_carb <- carb(flag=8, var1 = buoy$pH, var2=buoy$Alk_mol, S=buoy$sal_psu, T=buoy$temp_C)

Nice! We now have carbonate variables for all 6,724 buoy observations. There is a warning that some observations had temperatures/salinities outside the ranges given in seacarb equations (you can see what the ranges are by looking up Ks or Kspa in help). Way too many numbers to look through visually, like we did before. So let’s extract OmegaAragonite from the seacarb results, and add it back to the buoy dataframe, to calculate some summary statistics just like we did before.

buoy$OmegaA <- buoy_carb$OmegaAragonite
nm_omega_stats <- summarySE(data=buoy, measurevar = "OmegaA", groupvars=c("year","month"))

We’re also going to need the salinity data to make our graphs. So we’ll use the same code to get the averages, but with Salinity as the variable we’re isolating.

nm_salinity_stats<-summarySE(data=buoy, measurevar = c("sal_psu"), groupvars=c("year","month"))

Look at your new omega_stats and salinity_stats dataframes. Notice that year and month are once again numeric variables, so be sure to tell R to make these categorical factors before you plot the data.

nm_omega_stats$month <- as.factor(nm_omega_stats$month)
nm_omega_stats$year <- as.factor(nm_omega_stats$year)
nm_salinity_stats$month <- as.factor(nm_salinity_stats$month)
nm_salinity_stats$year <- as.factor(nm_salinity_stats$year)

We’re going to plot the data in Excel so we’ll need to export our csv.

write.csv(nm_omega_stats, file="nm_omega.csv")#the nm stands for new meadows, where the buoy data originated from
write.csv(nm_salinity_stats, file="nm_salinity.csv")

We’ll then need to export the file from the cloud. In the lower right of your R window, click the check boxs next to nm_omega.csv and nm_salinity.csv. Then, in the middle right, click More, then export, then download. This will export the file to your downloads folder where we’ll eventually convert it to an excel file.

SCSC Data

Now we’re going to need to do the same things we did for the buoy data set to a Schiller data set. However, we need to download that data from the online Schiller data first. Luckily, the Schiller data is being consistently uploaded to an R repository. We’re still going to need to download that data from the repository. This is a complicated bit of coding, and there’s an Rstudio file on blackboard that will walk you through the code.

scsc_new <- Licor_Hcat

Unlike last time the timestamp is already in POSIXct, but let’s double check

str(scsc_new)

SCSC Stats

#new columns for year and month
scsc_new$year<- year(scsc_new$timestamp_EST_rounded)
scsc_new$month <- month(scsc_new$timestamp_EST_rounded)

#making our omega aragonite columns
scsc_new$Salinity.psu.<-as.numeric(scsc_new$Salinity.psu.)#the data is currently a character, not a number, so this code fixes that
scsc_new$TA <- 52.183*scsc_new$Salinity.psu. + 428.281
summary(scsc_new$TA)

scsc_new$Alk_mol <- scsc_new$TA/1000000 #we need to be in mol/kg
scsc_new$pH.pH.<-as.numeric(scsc_new$pH.pH.)
scsc_new$Temperature.C.<-as.numeric(scsc_new$Temperature.C.)
scsc_new_carb <- carb(flag=8, var1 = scsc_new$pH.pH., var2=scsc_new$Alk_mol, S=scsc_new$Salinity.psu., T=scsc_new$Temperature.C.)#running our seacarb package
scsc_new$OmegaA<- scsc_new_carb$OmegaAragonite#adding the omega aragonite values to our original data frame

#summarize omega aragonite values. This is another way to summarize values that doesn't require summaryse() but does require tidyverse

library(tidyverse)

scsc_omega_stats=scsc_new%>%
  group_by(year, month)%>%
  summarize(mean=mean(OmegaA), sd=sd(OmegaA),  n = n(), se = sd / sqrt(n))
head(scsc_omega_stats)#check it out, the data is in the same format as the new meadows data!

scsc_salinity_stats=scsc_new%>%
  group_by(year, month)%>%
  summarize(mean=mean(Salinity.psu.), sd=sd(Salinity.psu.),  n = n(), se = sd / sqrt(n))
head(scsc_salinity_stats) #summarize salinity values

#make month and year factors
scsc_salinity_stats$month <- as.factor(scsc_salinity_stats$month) 
scsc_salinity_stats$year <- as.factor(scsc_salinity_stats$year)
scsc_omega_stats$month <- as.factor(scsc_omega_stats$month) 
scsc_omega_stats$year <- as.factor(scsc_omega_stats$year)

#write our csv data frames
write.csv(scsc_omega_stats, file="scsc_omega.csv")
write.csv(scsc_salinity_stats, file="scsc_salinity.csv")

SAVE in your code!!! This will also save your project and prevent you from needing to rerun all your code later!