Lab 6 RStudio Tutorial

Today, we’re going to be using the online version of R. There’s a package that we’ll use later that doesn’t work on Macs, so it will be easier to use online R. This also means that you can change computers or iPads freely as all your files are saved in the R Cloud.

Go to rstudio.bowdoin.edu and log in with your bowdoin username (your email without the @bowdoin.edu) and password.

Setting up our working directory for later

To make it easy for us to begin work in class and finish it later, we want to have all our files in our working directory in an easily accessible place on the R Server. One way to do this is by making a Project.In RStudio, a Project is a great way to keep all your different coding projects organized. Each project has its own working directory, workspace, history, and documents.

To create a project, open R Studio and click on the little Project icon (R in a cube) in the upper left hand corner of RStudio. You will be prompted to select a “Project Type” - select New Directory.

Name your project whatever you would like, but something that will be meaningful to you later down the road. (Lab6 is pretty straightforward.) This creates a new folder that will be your project’s working directory.

Leave “Create project as a subdirectory of:”, blank.

You will now see the name of your project up in the top right-hand corner of RStudio. You will see you have a clean working directory with no files.

RSwirl tutorial

We’re now going to start an RSwirl tutorial, which will help you learn the basics of R. If you’ve had experience with R before, feel free to skip this tutorial. However, if you feel like you need to return to the basics or haven’t used R before, this will help you learn!

First, we’re going to download the tutorial. Go to https://swirlstats.com/scn/A_(very)_short_introduction_to_R.html and dowload the file. Then upload the file to RStudio. You can do this by going to files in the middle center of the page and clicking “Upload.”

In the area called Console, run the line of code that is given on the website. To “run” a piece of code, copy it into the console and then click enter.

Next, we’re going to download a package that’s going to help us run the tutorial. Either type the following into the console, or copy it. Every time you write in a new line, press enter to run the code.

utils::install.packages('swirl')

A bunch of code is going to show up and scroll through your console. Don’t worry. That’s supposed to happen.

library(swirl)

Follow the instructions that begin to appear, pressing enter each time to confirm your choices. Open A (very) short introduction to R by selecting option 1. Select option 1 again to run module 1. Everytime you see a set of 3 dots, press enter to move to the next step. You’ve already set your working directory, so don’t worry about that! When it asks you to install a new package, ignore it, we’ll get to installing more packages later.

Sometimes your might click enter and nothing will appear to happen. Pause. Is there a small stop sign in the upper left of the console? Then R is just thinking. Wait a minute and your code should compute!

Once you’re done with the first module, do modules 2 and 3 by selecting A (very) short introduction to R and then the next module.

In module 2, it’s going to insist that you set your working directory again. Follow the instructions and set your working directory ‘to project directory.’ Open a new file and save it as ‘firstscript’ when module 2 instructs you to.

In module 3, you will need to read in files. To make a file as the swirl program requests, use either notepad (on PCs) or TextEdit (on Macs) and copy the text into the file. Then Import that file into your working directory online by clicking the upload folder about halfway down the window on the right. For Mac users: Make sure that you have a PLAIN TEXT document when you are making your text document. Click format in the upper left control bar. If you see “make rich text” then you are in plain text mode and are all set.

Learning seacarb

We’re going to start learning how to use a package that is going to help us calculate all the components of carbonate chemistry. In our carbonate chemistry unit at the beginning of the semester, you learned to solve for all of the carbonate system unknowns (pC02, [CO2], [HCO3-], [CO32-], [H+]) by hand, using the three equilibrium constant (KH, K1 and K2) equations, and the equations for DIC and Alkalinity. Likewise, in lab we have learned two methods for measuring alkalinity in our seawater and porewater samples, and have been measuring pH, temperature and salinity in our samples. With this data we are all set up to solve for [CO32-], and subsequently the saturation state with respect to aragonite (which as we know from our Green et al. discussion and annotated bibliography research, is a very important biogeochemical factor for juvenile clam growth!).

But having done the calculation for a single sample by hand, you can now appreciate how hard it would be for a marine biogeochemist to look up all the k values for varied conditions, and solve all the equations for many samples. . . say, for example, if you had two months of tank data from our experiment at the Schiller Coastal Studies Center! Or if you had two years of daily water measurements from the New Meadows River, where Jordan Kramer’s clams were growing over the course of a summer, and you wanted to know how saturation state changed over time!

Luckily, a number of computer programs have been developed to calculate all the seawater carbonate chemistry parameters from your given measurements. Today we will be learning to use one of them so we can more easily address one of the essential questions of our collaborative research project: At the New Meadows River experimental clam aquaculture site, is seawater saturation state with respect to aragonite likely favorable for juvenile clam growth throughout the growing season?

Getting Data

Download the scsc.csv from the class’s blackboard page. Then upload it, following the same procedure we used for the swirl file and the text file.

Setting up a Script

Now, let’s create a new script where we will write, run and save all of our code from today. In the upper left hand corner, click on the down arrow next to the little green plus icon and select R Script. Save this script by clicking on the floppy disk icon. You should automatically open up to your Project folder. Navigate to scripts and save this as Lab6.

New coders tend to start their code right away with no comments or explanations of what they are doing. But, again, we want to produce transparent and reproducible research. Anything in your script preceded by the #-sign tells R that this is a comment, not an expression to be evaluated. So, it’s good coding etiquette to have comments that explain why you have everything you have in your code. It’s helpful to have the your script organized in the following manner:

Section 1: Start with a short intro of what your intend to do with your script. Say who you are, the date you created it, and when you last edited it.

Section 2: Load the package libraries. Here, I load all the libraries that I need for my code to run. (Usually you do not need to install.packages() every time if you are working on the same machine - but if you will be working on different machines, it might be a good idea to comment that you may need to do this.)

Section 3: Load data into R. This is where you load all your data files.

Section 4: Analyses, which can also be subset into different sections.

In this tutorial, we will be working through several different types of data as examples. You may find it more intuitive to load data/run analyses for each example sequentially. As you work through the tutorial, be sure to subdivide your code with comments when we begin a new analysis! The important thing is that the code in your script is ordered so that the whole script can run without errors at the end (for example: if your code for a plot is saved before the data is loaded, R will have no idea what your code refers to!) It is also helpful practice to use comments to explain/remind yourself what the code is doing.

SCSC Tank Data

Let’s start off with a simple example - data from the first week of our tank experiments at the marine lab. As you will remember from our titrations in Lab 3, we had some highly variable results leading to some uncertainty in alkalinity values. Luckily, Michele’s team of student researchers were also sampling for alkalinity, and research technician Katherine Guay has already run a few of these samples on the autotitrator in Michele’s lab. This is a very precise instrument being run by an expert - Katherine’s triplicate samples had standard deviations of less than 5 umol/kg (~0.2% variation around the mean!).

How do we get the data into R? You have a file called scsc.csv saved in your working directory. (Check this by clicking on the Files tab in the lower right-hand corner.) To import a csv file as a new dataframe, you need to tell R what to call this new object, and also give R the path to where it is saved so you can find it. This will be the first line of code for “Section 3” of your script.

Let’s call this new object “scsc”, since the data is from the SCSC tanks. Our current working directory is the R project we made at the beginning of class (you can double check this with the getwd() command), and the data we want is in our data folder. So the path to your csv should be:

scsc <- read.csv("scsc.csv")

To run a line of code or a chunk of code in your script, select it or place your cursor on that line. Then press the Run button in the upper right hand corner, or hold down the command button and hit return button.

Great! You should now see a new dataframe called scsc in your upper right hand Environment window. Clicking on scsc opens it up to view in the editor window. You can see the top line of the table, called the header, which contains the column names. Each horizontal line afterward denotes a data row. Have a look at the alkalinity data!

Another handy command is str(), which gives you the structure of an object including what kind of data - numerical, categorical, etc. - each column of a dataframe is, and how many observations (rows) there are.

str(scsc)

Remember from the “A (very) short intro to R” tutorial that columns in a dataframe can be of different classes. You will see all our measurements (Temp_C, Sal, pH, TA) are numerical, which is good, because we want R to be able to use them in calculations. Our treatment variables are recoginzed as categorical “factors”, which is also good. But there are a few changes we want to make:

Date is being considered a factor, with every day it’s own unique category! R needs to know that this is a Date variable, with the format of months/day/2-digit year. The as.Date() command will tell R to encode this as calendar dates, just as as.numeric() or as.factor() commands would encode your data vector as integers or factors. Remember, to tell R to work with a column/variable in a dataframe, the syntax in R is dataframe$variable.

scsc$Date <- as.Date(scsc$Date, format='%m/%d/%y')

Now try the str() command again and see how the class has been changed for Date, from Factor to Date.

Another handy command is summary(). This is a generic function that can help you see the categories and/or range of numbers in each column of a data frame. Give it a try!

summary(scsc)

Using our data to explore the seacarb package

Begin by installing the package seacarb. You may need to type yes into the console and then hit enter to get everything started. Like the last time we installed a package, there’s going to be a lot of text, and some of it is going to be red. Run each line separately by writing it in your file and then pressing command enter.

utils::install.packages('gsw')
utils::install.packages('seacarb')

Remember, even after it has been installed, you always have to load a package into R before you can begin using it. This should be the first line of code in “Section 2” of your script. Add a comment beforehand indicating install.packages() should be run first.

library(seacarb)

Before we jump in with our data, to give you an idea of some of the capabilities of seacarb, there are a number of examples you can run. The examples print out first the function and it’s arguments, and then the output. Run the code below one line at a time in the Console to observe how you can calculate the solubility product for aragonite given temperature and salinity (and therefore have freedom from the textbook tables!), or calculate the density of seawater (rho).

example(Kspa) 
example(rho)

Then, try running both functions for the “high temperature” treatment of 9C and Salinity of 33 that we observed in our SCSC experimental tank samples.

Kspa(S=33, T=9) 
rho(S=33, T=9)

Note: you can use the up arrow to repeat or edit previous commands in the console. These examples are just a few of the functions available to us in seacarb. Look at the help page to see a more complete list. Can you find the functions you just used in the examples?

help(package="seacarb")

What important information about the acceptable ranges of temperature and salinity for Kspa() is found in the help documentation? The carb() function will be the most important one for us in this class. In the help index, click on the “carb” hyperlink to learn about what arguments this function requires. Or if you can do the same thing with code by typing:

?carb()

You will see that the first argument to the function is a “flag”, which designates which two of the carbonate unknowns are being used for the carbonate calculations. There is also important information about units, as well as citations for the k values being used in the calculations. There are several default values set - for example, if you don’t include arguments specifying values for pressure or nutrients, then those values are automatically set to 0 (generally a fine assumption for our surface water samples).

  1. Note - which “flag” number should we use for our seawater data?

  2. Double check - what units do our alkalinity measurements need to be in for these calculations?

Aha! Our spreadsheet alkalinity measurements are in micromoles/kg, rather than mol/kg. So let’s tell R to make a new column in our dataframe, called Alk_mol (by giving the name of the dataframe, and using the $ to designate the name of the column we are creating). This column will be equal to our existing alkalinity measurement divided by 1*10ˆ6.

scsc$Alk_mol <- scsc$TA_avg_umol_kg/1000000

Check out your scsc dataframe - do you have a new column?

Let’s practice using the carb() function with the SCSC tank data.

  1. The argument flag=8 tells seacarb that variable 1 will be a pH measurement and variable 2 will be an Alkalinity (mol/kg) measurement. This means that for a single sample you would enter your measurements for the var1 and var2 arguments (eg var1 = 8.0), OR for many samples you would tell R what column to use for each argument (var=scsc$pH).
  2. Same goes for the arguments salinity (S=) and temperature (T=) Let’s try it out! Name the results of the carb function “scsc_carb”.
scsc_carb <- carb(flag=8, var1 = scsc$pH, var2=scsc$Alk_mol, S=scsc$Sal, T=scsc$Temp_C)

Wow! Now you have a new dataframe in your upper right-hand corner Environment window, named scsc_carb, containing 4 observations (each tank) with 19 columns. Click on scsc to view it as a spreadsheet tab in the upper left-hand window. You can see the first columns are the parameters you gave the function; the following are the carbonate chemsistry parameters that were calculated for you!

Which tanks were undersaturated or supersaturated with respect to Aragonite?

Because we are most interested in the OmegaA value for our tanks, let’s extract the OmegaAragonite column from scsc_carb dataframe and add it back to the dataframe with all the other tank data. To do this, just create a new column in our original dataframe:

scsc$OmegaA <- scsc_carb$OmegaAragonite

Fantastic! Now you can see ‘scsc’ has 11 variables, with the final variable being OmegaA.

Perhaps you would like to save these results as a .csv file in your project folder, to access later (like when you are writing your paper!) without rerunning all the scripts. To save your scsc dataframe with OmegaA values, use the command write.csv() and tell R where you want to save it. This is one output from our analyses, so let’s save it our Project’s working directory:

write.csv(scsc, file="scsc_oa")

If you want to download this file to your desktop, click the check box next to the file, then click more (it’s near the upload button). Finally, export the file.

Before we move on, let’s quickly visualize how the saturation varied between pH treatments, using base R’s plot() function. Nothing fancy, but this can be informative for data exploration! Tell plot() to make ph.Treatment the x variable and OmegaA the y variable.

scsc$pH.treatment=as.numeric(scsc$pH.treatment) #This line is making sure that the infinite decimal places don't mess up our plot
scsc$OmegaA=as.numeric(scsc$OmegaA) #Same with this line
plot(scsc$pH.treatment, scsc$OmegaA)

You can easily add in additional arguments to the plot() function to add things like axis labels or change the scale of y with ylim=c(ymin,ymax), as shown below:

# How to specify many axis options within base-R's plot()
plot(x, y, main="title", sub="subtitle", xlab="X-axis label", ylab="y-axix label", xlim=c(xmin, xmax), ylim=c(ymin, ymax))

Saving a plot is very easy in RStudio. Simply click on the “Export”" button in the Plots window, and select “Save as Image”. You can change the size of the figure at this point. Navigate to your Output folder and save your .jpg as Figure 1.

Of course, this data represents just one point in time (the first day of the experiment) - by the end of the experiment we would want to show the mean and standard devation for each of the tanks! As we get more measurements, we can update the spreadsheet in our data folder, and just rerun the code you have written to update the results. That is the great thing about organized coding and “Reproducible Work”!

Time-Series Data

Now, you are ready to move on to a bigger data set to learn some more methods for calculating summary statistics and visualizing data.

We are extremely lucky that in 2017 and 2019, the University of Maine had a buoy in the New Meadows River off of Bombazine Island, which is quite close to Merritt Island and Jordan Kramer’s clam farm site! The buoy monitored a suite of environmental data from July-December of both years. While our field sampling is constricted to the winter/spring, this data can help provide insight into the biogeochemical conditions during the late summer and fall growing season - especially for the clams that were suspended in the water!

Let’s import this data into R - either under the “load data” section of your Lab 6 script, or at the beginning of your next analysis section. The data file is called ‘buoy.csv’, and is located in your data folder.

buoy <- read.csv("buoy.csv")

Wow - 6,724 observations! Let’s have a look at the structure of this new dataframe.

str(buoy)

Oof - date and time are being read as factor variables. That means R thinks each unique date is it’s own category. All the various date/time combos (and timezones!) are notoriously irritating to deal with in the coding world. Luckily in R there is a package called “Lubridate” that provides functions to “Make Dealing with Dates a Little Easier”. Let’s download and load the package:

library(lubridate)

Now - lubridate has a function mdy_hm() that tells R to read a date like ours as month/day/year hours:minutes. Let’s use it, and then check out the structure of our dataframe!

buoy$datetime_eastern <- mdy_hm(buoy$datetime_eastern)
str(buoy)

Hooray! “POSIXct format” means R is now interpreting the values in this variable as calendar dates/times, measured in the number of seconds since 1970. That means we can now plot datetime_eastern as a variable, and R will know it is a date and not a million different categories ordered alphabetically! Give it a try with the base-R plotting function - pick a bouy variable (y) to plot over time (x)! For some plots, you may wish to use the ylim argument to change the y-axis limits to exclude outliers.

plot(buoy$datetime_eastern, buoy$chl_ug_L)
plot(buoy$datetime_eastern, buoy$pH) 
plot(buoy$datetime_eastern, buoy$sal_psu) 
plot(buoy$datetime_eastern, buoy$temp_C)

Cool! Real data from two years in the New Meadows River. . . Can you already see crudely that there are some seasonal patterns, and also some differences between the years?

While we are thinking about time, let’s also make a new column for “month” and a new column for “year” to make it easy to subset data over different timescales. The lubridate package also has two easy functions for this:

buoy$year<- year(buoy$datetime_eastern) 
buoy$month <- month(buoy$datetime_eastern)

OK! Now we are set up to calculate means and standard deviations for the conditions in different months in both years.

Calculating Summary Statistics and Visualizing Data

You could write your own script to calculate means and standard deviation, but the nice thing about R is that in addition to packages, users in the community have already written many great, freely-available scripts that may be useful to your research. summarySE() is a script developed by users in the R community that provides a nice shortcut to calculate the mean, standard deviation, standard error and 95% confidenceinterval for subsets of data. To open the summarySE() script, go to File>Open File>scripts and select ‘summarySE()’. It should open in a new tab in RStudio. (You can see in the script that the authors have commented their code with explanations of what everything is doing - should you want to modify it down the road.) Run the script by selecting all and clicking “Run”.

I have no idea what this is, is there a file called “summarySE()” in the microwave??

You will want to make a comment in your Lab6 script to remind yourself to run summarySE(), or actually copy and paste the code into your script for future use.

After running the script, you will see a new category called “Functions” appear in the R Studio Environment window, with the summarySE function and the arguments that it takes.

Now - let’s use the summarySE function to create a new data table with the summary statistics for one of our variables of interest - like the pH over time in the New Meadows River. Let’s call the new dataframe pH_stats. The “measurevar” argument indicates which column of which dataframe you want the mean and standard deviation to be calculated from (pH from the buoy dataframe, in this case). The “groupvars” designates the groups within which you want to analyze the variable of interest. In this case, we are interested in grouping BOTH by Month and by Year (e.g., we want to find the mean for July/2017 and July/2018 - not just all of July.

pH_stats <- summarySE(data=buoy, measurevar = "pH", groupvars=c("year","month"))

Open the new dataframe to consider these summary statistics.

  1. “N” is the number of samples for each category, and you can see there are not an equal number of measurements for each month. That’s because the buoy went into the water at different times, and some months are shorter than others.

  2. pH is the mean pH for that year/month subset of data, and sd is the standard deviation. If there are extreme outlier data points (for example, sensor broken and then repaired) these will be reflected in the standard deviation and we may want to investigate/exclude.

Beautiful, Customizable Plotting

Now it’s time to make some nicer, publication-quality plots. For this we will use a package called ggplot2. The “gg” stands for the “grammar of graphics”, which was developed to be a beautiful and flexible way to visualize data by R “superstar” developer Hadley Wickham. If you find you enjoy the logic of ggplot, he has gone on to create additional powerful packages, collectively known as the “tidyverse”, for cleaning and visualizing data.

Load ggplot - add this code to the Section 2 “Libraries” chunk of your script.

install.packages("ggplot2")
library(ggplot2)

In our case, we will would like to plot the monthly mean pH in the New Meadows River, using the means and standard errors we calculated in the “pH_stats” summary statistics dataframe. The x-axis will be month (we are thinking about this as a category), and the y-axis will be mean pH for that month. But we also want to distinguish between months from 2017 vs months from 2019, so we want to color our points by the year they belong to (meaning that year will be another category).

Looking at our pH_stats dataframe, however, year and month are being read as numbers, not categorical factors. So let’s change that!

pH_stats$month <- as.factor(pH_stats$month) 
pH_stats$year <- as.factor(pH_stats$year)

Alrighty - let’s make our first ggplot!

All plots begin with the ggplot function, which first needs to be told what dataframe you want to use. This is followed by the “aesthetics” (aes()) of the plot - at a minimum, what columns from your specified dataframe should be x and y. We will also add an argument about how to color-code the data.

Next, add (+) layers to your plot, providing a unique layer and argument for each thing you want to add, like:

  1. Do you want your data displayed as lines (geom_line()), points (geom_point()), bargraphs (geom_bar()), box and whisker plots(geom_boxplot()), etc?

  2. Do you want to add error bars? Where should R find that data for the max and mins? (In our case, from the standard deviation sd column in our pH_stats dataframe)

  3. Do you want to add axis labels? (Of course!)

  4. Do you want to add a title? (Maybe…)

ggplot(pH_stats, aes(month, pH, color=year))+
geom_point(position=position_dodge(0.1))+
geom_errorbar(aes(ymin=pH-sd, ymax=pH+sd), width=0.2, position=position_dodge(0.1))+ xlab("Month")+
ggtitle("New Meadows River, Monthly Mean pH")

The “position dodge” arguments in our geom_point() and geom_errorbar() layers offset the two year’s means from eachother, so they aren’t blocking eachother if they have the exact same value. You might also prefer to change the x axis label to “Month”, which can easily be done by adding the layer + xlab(“Month”) to your plot.

To save your figure, click on the Export button in the Plots tab. Save it to the output folder in your project.

“All models are wrong, but some are useful”

You probably noticed there is only ONE of our carbonate chemistry unknowns in the buoy dataset (pH), and we need at least TWO (typically measured - alkalinty or DIC) to solve for our carbonate chemistry unkowns. Sadly these are not easily measured by sensors and so we don’t have the same kind of time-series data for them.

However, we do know from the literature that alkalinity is often strongly correlated to salinity in coastal environments. And last summer there was a Gulf of Maine-wide effort called “Shell Day” to measure alkalinity and a suite of environmental variables in seawater up and down the coast to see if there are any strong relationships that can be used to predict alkalinity. Michele’s lab has been involved in the alkalinity measurements, as as such we are going to have a sneak peak of that data! We have provided a reduced datasheet showing high-tide data salinity and alkalinity measurements.

Load the data into R and plot it using ggplot. . .

shellday <- read.csv("shellday_TA.csv")

To begin with, let’s see if Salinity and Alkalinity are correlated in this dataset using the function cor.test(). Correlation determines if one variable varies systematically as another variable changes. It does not specify that one variable is the dependent variable and the other is the independent variable. Often, it is useful to look at which variables are correlated to others in a data set, and it is especially useful to see which variables correlate to a particular variable of interest. As you write your lab report, you may be interested in looking into whether correlations exist among other the variables in the Buoy dataset.

cor.test(shellday$Salinity, shellday$TA)

Correlations vary from -1 to 1, with 0 indicating no association. A strong positive correlation, like we observe with r=0.89, indicates that both increases in salinity and alkalinity are associated with one another in this dataset. The degrees of freedom reflects our sample size, and is used to help calculate the p-value (more samples -> less chance of random association). Traditionally most people use a cutoff of p<0.05 for statistical significance - you can see our p-value is very much lower than 0.05, indicating highly significant correlation! If you were reporting this result, you would say that on Shell Day, high tide salinity and alkalinity measurements were significantly, strongly positively correlated (Pearson’s product-moment correlation, r=0.89, p<0.001). Now, let’s visualize this data with a ggplot, plotting the data as points.

ggplot(shellday, aes(Salinity,TA))+ 
  geom_point()+
  ylab("Total Alkalinity (umol/kg)")+ 
  xlab("Salinity (PSU)")

As our correlation indicated, there does appear to be a linear relationship. You can add the linear regression with just one more line of code to your ggplot - the geom_smooth() function, with the method calling for ‘lm’ (a ‘linear model’). This plots the regression in blue and the 95% confidence interval around the line in gray.

ggplot(shellday, aes(Salinity,TA))+ 
  geom_point()+ 
  geom_smooth(method=lm)+
  ylab("Total Alkalinity (umol/kg)")+ 
  xlab("Salinity (PSU)")

Now, let’s calculate the equation of that line such that we can predict alkalinity based on salinity. The linear model function in R takes the variables in the format: lm(response variable ~ predictor variable, data). By saving your linear model as an object (let’s call it alk for simplicity), you can get detailed information about the model’s performance using the summary() command.

alk <-lm(TA~Salinity, data=shellday) 
summary(alk)

What does this model output mean?

  1. Formula call. The first item shown in the output is the formula R used to fit the data: the response variable (TA) and the predictor variable (Salinity), together with the data being used (shellday).

  2. Residuals. Residuals are essentially the difference between the actual observed response values (Total Alkalinity measured in the lab, in our case) and the response values that the model predicted. The Residuals section of the model output breaks it down into 5 summary points. When assessing how well the model fit the data, you should look for a symmetrical distribution across these points on the mean value zero (0). In our example, we can see that the distribution of the residuals do not appear to be strongly symmetrical. That means that the model predicts certain points that fall far away from the actual observed points. We could take this further consider plotting the residuals to see whether this normally distributed, etc. but will skip this for this example because with the wide range of waterbodies included in our dataset we expect to have some outliers.

  3. Coefficients. In a simple linear regression (y=mx+b), the coefficient estimates are two unknown constants that represent the slope (m) and intercept(b) terms in the linear model. So for our model, Total Alkalinity increases by 52.183 umol/kg with each increase in PSU. The standard error measures the average amount that the coefficient estimates vary from the actual average value of the response variable. Ideally, we want this to be a lower number relative to our estimate. The standard error can be used to compute confidence intervals and statistically test the hypothesis that a relationship exists between alkalinity and salinity. The t values are how many standard deviations away our coefficient estimate is from zero, and are used to compute the p-value. A small Pr(>|t|) or p-value indicates that it is unlikely we will observe a relationship between the predictor (salinity) and response variable (TA) due to chance - we have very small/highly significant p-values associated with both our intercept and slope, indicating we can conclude there is a linear relationship between salinity and alkalinity.

  4. Multiple R-squared. This is a proportion of variance statistic providing a measure of how well the model is fitting the data. It always lies between 0 and 1, with 0 representing a regression that does not explain the variance in the response variable, and 1 being a perfect fit of the data to the line. With an Rˆ2 of 0.8, we can say that in our model 80% of variance observed in TA can be explained by our predictor variable, Salinity. If you add more than one predictor variable to your model, the Adjusted Rˆ2 reflects the additional variables improve your model or not.

  5. F-statistic. Another indicator of whether there is statistically significant relationship between our variables, which takes into account the number of data points (degrees of freedom). The further it is from 1, the better it is. You can see ours is also highly significant.

So - now we have a model predicting TA (y) based on Salinity (x).

y = 52.183x+428.281

Let’s apply this to our buoy data and model TA for the New Meadows River! Make a new column called TA in our buoy dataframe - the contents should equal 52.183*(salinity)+428.281.

buoy$TA <- 52.183*buoy$sal_psu + 428.281

Excellent! Let’s see the range of alkalinity values generated by this equation. The summary() command can be run on a single variable or entire dataset to see the minimum, median, mean and max values for a variable.

summary(buoy$TA)

Our minimum indicates there are some values of 0 in our salinity measurements, which are likely errors with the sensor which we may want to exclude. In general, our range in TA is small, reflecting the relative stability of salinity in the New Meadows River time series.

You can check on how well the model performs with our data from the field by entering our field salinity value from Site 2 into the equation!

52.183*31.8+428.281

The predicted alkalinity is a little bit lower than what we measured in lab for that particular seawater sample (2230 umol/kg). So what caveats might you discuss when applying/interpreting this model? Think about the data it was based on.

Putting it all together

Ok - let’s use seacarb to calculate the saturation state with respect to aragonite, using our modeled TA and measured pH, temperature, and salinity from the buoy!

Remember - the seacarb carb() function requires alkalinity in units of mol/kg, so we have to make a second alkalinity variable in the correct units.

buoy$Alk_mol <- buoy$TA/1000000

Now let’s solve for the carbonate chemistry parameters! Use the same code as before, but provide the buoy dataset variables.

buoy_carb <- carb(flag=8, var1 = buoy$pH, var2=buoy$Alk_mol, S=buoy$sal_psu, T=buoy$temp_C)

Nice! We now have carbonate variables for all 6,724 buoy observations. There is a warning that some observations had temperatures/salinities outside the ranges given in seacarb equations (you can see what the ranges are by looking up Ks or Kspa in help). Way too many numbers to look through visually, like we did before. So let’s extract OmegaAragonite from the seacarb results, and add it back to the buoy dataframe, to calculate some summary statistics just like we did before.

buoy$OmegaA <- buoy_carb$OmegaAragonite
omega_stats <- summarySE(data=buoy, measurevar = "OmegaA", groupvars=c("year","month"))

Look at your new omega_stats dataframe. Notice that year and month are once again numeric variables, so be sure to tell R to make these categorical factors before you plot the data.

omega_stats$month <- as.factor(omega_stats$month) 
omega_stats$year <- as.factor(omega_stats$year)

Now return to your ggplot code for mean pH over time. Change the dataframe and y variable arguments to now plot mean omega over time, with error bars!

ggplot(omega_stats, aes(month, OmegaA, color=year))+
  geom_point(position=position_dodge(0.1))+
  geom_errorbar(aes(ymin=OmegaA-sd, ymax=OmegaA+sd), width=0.2, position=position_dodge(0.1))+
  ylab(expression(Omega))+
  ggtitle("Monthly Mean Saturation State with Respect to Aragonite")

In a figure like this, you might want to add in a line to indicate a saturation state “threshold” below which calcifying is more difficult for clams. Our Green et al. paper suggested “death by dissolution” for larval clams when omega was < 1.6, or you might have another value from the literature you would like to use. It’s easy to add a horizonal line (+geom_hline()) or vertical line (+geom_vline) to a ggplot as another layer. Below are some examples for changing line color and type, too.

# Add horizontal line at y = 1.6
+ geom_hline(yintercept=1.6)
# Change line type and color
+ geom_hline(yintercept=1.6, linetype="dashed", color = "red") # Change line size
sp + geom_hline(yintercept=1.6, linetype="dashed", color = "red", size=2)

SAVE in your code!!! This will also save your project and prevent you from needing to rerun all your code later!