R for Absolute Beginners

Hands-on Exercises

This tutorial uses a dataset retrieved from the study Reversal of ocean acidification enhances net coral reef calcification by Albright et al., 2016.

Exercise 0. Understand the context of your data (15 min)

Please read the following introduction.

A recent in situ experiment (Albright et al., 2016) has found that reducing the acidity of the seawater surrounding a natural coral reef in the southern Great Barrier Reef significantly increases calcification. In this experiment the acidity levels have been reverted to those characteristic of the pre-industrial era. By “turning back time”, the authors demonstrate that, all else being equal, net coral-reef calcification would have been around 7% higher than current observations, suggesting that ocean acidification may already be diminishing coral-reef growth.

Figure 1 | One Tree Reef in the southern Great Barrier Reef, Australia (Janice M. Lough, 2016).

One Tree Reef encloses three lagoons, two of which are hydrologically distinct (i.e., separated by reef walls). At low tide, the water level drops below the outer reef crest, and the lagoons are effectively isolated from the ocean (Figure 2c). Since First Lagoon sits approximately 30 cm higher than Third Lagoon, gravity-driven, unidirectional flow results from First Lagoon over the reef flat separating the two lagoons, ending up in Third Lagoon. The study site is situated along a section of the reef wall separating First and Third Lagoons (Figure 2d).

Figure 2 | One Tree Reef in the southern Great Barrier Reef, Australia (Albright et al., 2016).

Exercise 1. Get the data (10 min)

Download the Supplementary Table 1 file containing the raw data for chemical and physical parameters measured (or calculated) for all days and station locations. Save it in a directory of your choice, and please keep the file’s original name.

Exercise 2. Format conversion (10 min)

Open the downloaded file (should be named nature17155-s2.xlsx) with a spreadsheet software program (e.g. Microsoft Excel or OpenOffice Calc) and export it as CSV (comma separated value). Save the exported file in the same folder and name it nature17155-s2.csv.

Exercise 3. Set working directory (15 min)

Open RStudio (if you have not done so already) and from the R console run a command (or a combination of commands) that confirms that the file nature17155-s2.csv is “visible” from R. If the working directory is not the one containing nature17155-s2.csv, then change to it accordingly.

Hint: the functions getwd, setwd, list.files, dir and file.exists are your friends.

Exercise 4. Checking the data file structure (10 min)

The file nature17155-s2.csv has been saved as a CSV, which is a particular type of text file, where each value (i.e. column) is separated by a comma character (,). However, it is possible to have a few variations, the most relevant being: (i) the specific character used as value separator (e.g. comma, semi-colon (;), tab); (ii) whether values are quoted (**“**), (iii) whether the first line is a header (i.e. column names) or not. These details are decisive in order to correctly import the data into R.

So, in order to have a glance at how the data are formatted/organized in the dataset file nature17155-s2.csv, read (i.e. show) the first 3 lines in the R console:

readLines("nature17155-s2.csv", n = 3)

[1] "Station ID,Transect,Date,Type,T (C) in situ,Salinity,Alkalinity (umol/kg), Rhodamine (ppb),..."
[2] "D-16,Down,20140916,Control ,22.507,35.8542,2280.58,0.0507,8.1131,298.44,2273.95,..."
[3] "D-16,Down,20140917,Experiment ,22.995,35.8287,2253.42,0.1940,8.1206,298.4,2248.47,..."

Exercise 5. Load data into R (20 min)

From the output of the last command one can see that the comma character is indeed separating the different columns. Notice however that data-fields with text containing commas are quoted, i.e. enclosed in quotation marks, so that those free-text-commas are not mistakingly used as column separators. Additionally, notice that the first line is the header of the dataset (column names) and not an observation/data point.

Exercise 5.1

Now, import the dataset using the function read.table with appropriate arguments, and save it in an R object named reef:

reef <- read.table("nature17155-s2.csv", header = TRUE, sep = ",", strip.white = TRUE)
# the strip.white argument removes trailing white spaces (spaces in the
# begining and end of each column)

Exercise 5.2

Next, let us inspect some of the imported data which has been saved in the variable reef.

class(reef)
head(reef)  # inspect the first lines of reef
tail(reef)  # inspect the last lines of reef

Exercise 6. Inspecting an R data frame (20 min)

The imported data has been loaded as a data frame having several columns, such as Station.ID, Transect, Date, etc.. Notice that special characters like white spaces or parenthesis in the column names have been converted by R to dots (.).

Please note that in RStudio, data frames can be graphically inspected; by clicking on its name in the environment panel, a new tab opens in the text editor panel, showing its first 1000 lines; so try that too!

Examine other relevant information about the reef data frame.

# Dimensions
dim(reef)

[1] 526  16

# Column names
colnames(reef)

 [1] "Station.ID"                    "Transect"                     
 [3] "Date"                          "Type"                         
 [5] "T..C..in.situ"                 "Salinity"                     
 [7] "Alkalinity..umol.kg."          "Rhodamine..ppb."              
 [9] "Spec.pH..total."               "T..K..Spec.pH"                
[11] "Alk..S.Normalized..umol.kg."   "Rhodamine..S.Normalized..ppb."
[13] "in.situ.pH..total."            "in.situ.pCO2..uatm."          
[15] "in.situ.CT..umol.kg."          "in.situ.aragonite"

# Row names
head(rownames(reef))

[1] "1" "2" "3" "4" "5" "6"

# To find more details about the reef object
str(reef)

'data.frame':   526 obs. of  16 variables:
 $ Station.ID                   : Factor w/ 24 levels "D-1","D-10","D-16",..: 3 3 3 3 3 3 3 3 3 3 ...
 $ Transect                     : Factor w/ 2 levels "Down","Up": 1 1 1 1 1 1 1 1 1 1 ...
 $ Date                         : int  20140916 20140917 20140918 20140919 20140920 20140921 20140924 20140925 20140926 20140927 ...
 $ Type                         : Factor w/ 2 levels "Control","Experiment": 1 2 2 2 1 2 2 2 2 2 ...
 $ T..C..in.situ                : num  22.5 23 23 24.5 24.6 ...
 $ Salinity                     : num  35.9 35.8 35.8 35.9 35.9 ...
 $ Alkalinity..umol.kg.         : num  2281 2253 2245 2189 2186 ...
 $ Rhodamine..ppb.              : num  0.0507 0.194 0.6468 0.3877 0.1912 ...
 $ Spec.pH..total.              : num  8.11 8.12 8.16 8.13 8.18 ...
 $ T..K..Spec.pH                : num  298 298 298 299 299 ...
 $ Alk..S.Normalized..umol.kg.  : num  2274 2248 2243 2182 2178 ...
 $ Rhodamine..S.Normalized..ppb.: num  0.0505 0.1935 0.6462 0.3864 0.1905 ...
 $ in.situ.pH..total.           : num  8.15 8.15 8.18 8.15 8.2 ...
 $ in.situ.pCO2..uatm.          : num  287 284 260 276 241 ...
 $ in.situ.CT..umol.kg.         : num  1932 1903 1879 1834 1801 ...
 $ in.situ.aragonite            : num  3.78 3.79 3.95 3.82 4.12 3.92 3.6 3.75 3.39 3.13 ...

Exercise 7. Tidying-up the data (50 min)

From the output of the previous commands it can be seen that there are 16 variables (columns). Each row refers to an observation. In this context, observations correspond to sampling stations where sets of measurements were taken in the reef-flat study area.

The first column of the reef data frame is the Station.ID, an ID reference that identifies each location. This ID is composed of two parts: (i) the first character, either U (referring to the upstream transect) or D (for downstream transect); (ii) the following chars indicate the position (in metres) of the sampling location relative to the tank. Since this information is pivotal for later analyses, it is useful to save the station positions in its own column, formatted as a numeric vector.

The following exercises (7.*) show one possible way of transforming the Station.ID column into a station position vector: (i) spliting the string in two parts, (ii) extracting the position (second part) as a numeric vector and (iii) creating a new data frame (reef2) containing all original reef data plus the position as a new column. Try it for yourself, and make sure that you understand all the steps and code involved.

Exercise 7.1

Extract the Station.ID column as a character vector.

station.id <- as.character(reef$Station.ID)

Exercise 7.2

Use the strsplit function to split the ID string into its two relevant parts. The split argument indicates which characters are to be used to split the string. Please note that the characters used for the splitting are omitted (removed) leaving an empty string (“”).

split.result.list <- strsplit(station.id, split = "[U,D]")

Exercise 7.3

For each element in station.id we obtained two strings: the empty string "" and a string with the position in meters. Since strsplit returns a list, we will unlist it to converts it to a character vector.

split.result.vector <- unlist(split.result.list)

Exercise 7.4

Next we must remove the empty strings "" in order to get the positions nicely arranged in a single vector.

station.position <- split.result.vector[split.result.vector != ""]

Exercise 7.5

Since the positions are distances in meters, and we would like to use those values for future calculations, we must convert them from characters (text) to numeric.

station.position <- as.numeric(station.position)

Exercise 7.6

Finally, to include the new vector in the reef data frame, we will create a new data frame (reef2) by combining columns (cbind) of the reef data frame with the newly created column (named Station.Position) as column number 2, followed by all other columns from the reef data frame (removing column 1 which is already present in position 1).

# create the new reef2 data frame by binding the first column of reef, with the
 # station.position vector and the rest of the reef data frame (without the 1st column)
reef2 <- cbind(reef[, 1], station.position, reef[,-1])

# check the column names attribute
names(reef2)

 [1] "reef[, 1]"                     "station.position"             
 [3] "Transect"                      "Date"                         
 [5] "Type"                          "T..C..in.situ"                
 [7] "Salinity"                      "Alkalinity..umol.kg."         
 [9] "Rhodamine..ppb."               "Spec.pH..total."              
[11] "T..K..Spec.pH"                 "Alk..S.Normalized..umol.kg."  
[13] "Rhodamine..S.Normalized..ppb." "in.situ.pH..total."           
[15] "in.situ.pCO2..uatm."           "in.situ.CT..umol.kg."         
[17] "in.situ.aragonite"

# assign meaningfull column names to the first two columns
names(reef2)[c(1,2)] <- c("Station.ID","Station.Position")
# check the final column names
names (reef2)

 [1] "Station.ID"                    "Station.Position"             
 [3] "Transect"                      "Date"                         
 [5] "Type"                          "T..C..in.situ"                
 [7] "Salinity"                      "Alkalinity..umol.kg."         
 [9] "Rhodamine..ppb."               "Spec.pH..total."              
[11] "T..K..Spec.pH"                 "Alk..S.Normalized..umol.kg."  
[13] "Rhodamine..S.Normalized..ppb." "in.situ.pH..total."           
[15] "in.situ.pCO2..uatm."           "in.situ.CT..umol.kg."         
[17] "in.situ.aragonite"

Exercise 8. Exploring the data (1h30m)

Exercise 8.1

This reef experiment is a case-control study. How many observations (rows) are there for Control and Experiment days?

(Hint: reef2$Type contains dates which are Control or Experiment days; the table function discussed yesterday can be useful for counting).

(Answer: There are 166 and 360 observations for Control and Experiment days, respectively.)

Exercise 8.2

What is the time interval for this study?

(Hint: The column Date contains this information; min, max and/or range functions might help.)

(Answer: The study took place between 16/09/2014 and 10/10/2014.)

Exercise 8.3

This study comprises Control days (when no alkalinity is added to the solution pumped to the reef flat) and Experiment days (when 600 gram of NaOH is added). From the time interval of the study, how many days were “Control days”, and how many days were “Experimental days”?

(Hint: unique and length functions will be useful. The Date and Type columns are pivotal).

(Answer: 7 were control days and 15 were experimental days.)

Exercise 8.4

Measurements were taken along two transects: upstream and downstream of the reef flat (Figure 3). Compare the spreading of the locations of the sampling stations up- and downstream of the reef flat.

Figure 3 | Sampling stations’ locations (blue circles).

Are the two spreads alike?
Do you think there is an experimental design reason that explains this difference?
The standard deviation and the inter-quartile range seem to give contradictory results on which transect has its sampling stations more spread out. Which metric, in your opinion, best reflects your visual impression of spread?

# Assessing the spread of stations' positions at the upstream transect by
# looking at the standard deviation, inter-quartile range and range

# Upstream transect
up.pos <- unique(reef2$Station.Position[reef2$Transect == "Up"])
# mean position of the upstream transect stations' positions
mean(up.pos)

[1] 0

# standard deviation of the upstream transect stations' positions
sd(up.pos)

[1] 8.284021

sqrt(var(up.pos))  # the standard deviation is the square root of the variance!

[1] 8.284021

# inter-quartile range of the upstream transect stations' positions
IQR(up.pos)

[1] 3

# range of the upstream transect stations' positions
range(up.pos)

[1] -16  16

# Downstream transect
dn.pos <- unique(reef2$Station.Position[reef2$Transect == "Down"])
# mean position of the upstream transect stations' positions
mean(dn.pos)
# standard deviation of the downstream transect stations' positions
sd(dn.pos)
# inter-quartile range of the downstream transect stations' positions
IQR(dn.pos)
# range of the downstream transect station positions
range(dn.pos)

Answer: The spread, as measured by the standard deviation $\sigma$, is surprisingly similar: upstream is $\sigma$~8.3 and downstream is $\sigma$~7.96. Accordingly, judging by the standard deviation alone, one might think that the sampling stations would be slightly more spread out along the upstream transect. However, judging from the picture, this contradicts our intuition, since the majority of upstream stations are very close to the centre. This could be explained by the fact that the standard deviation is known to be very sensitive to outliers; however, both transects have “outlier” stations at the edges: in positions -16 and 16 metres, as one can observe from the output of range. For this case, the inter-quartile range (IQR) proves to be a more robust metric (IQR=3 upstream; IQR=8 downstream), working best at mathematically describing our intuition when observing Figure 3. This difference in spread between the two transects was probably a choice taken during the experimental design phase, reflecting the antecipated mixing and dilution of the solution as it flowed from upstream to downstream. Therefore it made sense to concentrate the sampling effort close to the source (upstream) and spread it out more at the downstream transect.

Exercise 8.5

summary is a very useful function that outputs the summary statistics for an R object. Try it on a subset of the reef2 data frame: run summary for the variables: Date, Type, Station.Position and Transect. The output should look like this:

      Date                  Type     Station.Position    Transect  
 Min.   :20140916   Control   :166   Min.   :-16.00000   Down:328  
 1st Qu.:20140921   Experiment:360   1st Qu.: -3.75000   Up  :198  
 Median :20140928                    Median :  0.00000             
 Mean   :20140960                    Mean   : -0.01141             
 3rd Qu.:20141005                    3rd Qu.:  3.00000             
 Max.   :20141010                    Max.   : 16.00000

Appreciate how the output is differently presented for Date and Station.Position compared to Type and Transect. Why is it differently presented?

Exercise 9. Exploring the data graphically (1h30m)

By default, R base alone allows the plotting of several, highly customizable graphics. There are however many graphical packages developed by the community that greatly expand its plotting potential (e.g., ggplot2). Nevertheless, in this tutorial we will focus only on a few of the most common plots that can be generated with functions included in the base installation of R.

R graphics are created using a series of high- and low-level plotting commands. High-level commands create new plots via functions such as plot, hist, boxplot, or curve, whereas low-level functions add to an existing plot created with a high-level plotting function; examples are points, lines, text, axis, arrows, etc..

Graphical parameters are customizable via the function par, containing over 70 different customizable fields (for details, see ?par). In this exercise we will look into a few of the common plotting functions: plot, hist, boxplot and curve; as well as several parameters that allow you to tweak the look ’n feel of your graphics.

Exercise 9.0 Introduction

Ocean acidification is the ongoing decrease in the pH of the Earth’s oceans, caused by the uptake of carbon dioxide (CO$_2$) from the atmosphere. Seawater is slightly alkaline (pH > 8), and this acidification is a shift towards less alkaline conditions rather than acidic conditions (pH < 7). An estimated 30–40% of the carbon dioxide from human activity released into the atmosphere dissolves into oceans, rivers and lakes. To achieve chemical equilibrium, some of it reacts with the water to form carbonic acid. Some of these extra carbonic acid molecules react with a water molecule to give a bicarbonate ion and a hydronium ion, thus increasing ocean acidity (H+ ion concentration).

Figure 4 | Ocean acidification and the resulting reduction in carbonate ions (climatecommission.angrygoats.net).

Aragonite is a carbonate mineral, one of the two common, naturally occurring, crystal forms of calcium carbonate, CaCO$_3$ (the other form being the mineral calcite). CaCO$_3$ saturation state $\Omega_{\text{arag}}$ was one of the chemical parameters measured at the sampling stations.

The saturation state of seawater with respect to aragonite can be defined as the product of the concentrations of dissolved calcium and carbonate ions in seawater divided by their product at equilibrium:

\[\Omega_{\text{arag}} = \frac{[\text{Ca}^{2+}] [\text{CO}_{3}^{2-}]}{[\text{CaCO}_3]} \]

Exercise 9.1 Scatterplots: Basic plotting with `plot`

Lets see if the Albright et al. study recapitulates the reported effect of acidic conditions leading to lower levels of CaCO$_3$ (corresponding to a lower $\Omega_{\text{arag}}$). To this end we will plot the aragonite saturation state $\Omega_{\text{arag}}$ (reef2$in.situ.aragonite) vs pH (reef2$in.situ.pH..total.).

# xvalues: pH
xvalues <- reef2$in.situ.pH..total.
# yvalues: aragonite saturation state
yvalues <- reef2$in.situ.aragonite

plot(xvalues, yvalues)

From the generated plot it is clear that the two variables are indeed correlated. The higher the pH, the higher the $\Omega_{\text{arag}}$. Notice how the axes’ labels were automatically set based on the name of the variables passed as arguments to the plot function.

To change the axes’ labels, you may specify them explicitly by setting plot’s arguments: xlab and ylab.

Generate the following plot:

Exercise 9.1.1 More parameters for `plot`

There are many parameters that allow you to customise your plot (see ?par). Here are some of the most commonly used:

Argument	Description
`main`	an overall title for the plot
`type`	what type of plot should be drawn: `"p"` points, `"l"` lines, `"n"` no plotting (see `?plot`)
`sub`	a sub title for the plot
`xlab`	a title for the x axis
`ylab`	a title for the y axis
`asp`	the y/x aspect ratio
`cex`	plotting text and symbols magnification factor relative to the default
`cex.axis`	magnification to be used for axis annotation relative to the current setting of cex
`axes`	whether to draw axes (`TRUE`) or not (`FALSE`)
`xlim`	x axis range (should be a vector of two numbers: xmin and xmax, respectively)
`ylim`	y axis range (should be a vector of two numbers: ymin and ymax, respectively)
`pch`	either an integer or a single character to be used as the defaults symbol in plotting points (see `?points`)
`col`	set plotting color of each point (see named colors with `colors()`)

Note: The demo(“graphics”) command shows examples of available plots in R, together with the R code that can be used to generate it. The colors () command shows the names of the available colors.

Try them out! Start by adding a main title, changing the type of points and their color.

Here is a contrived example using many parameters at once.

# define logical vector based on experiment type: TRUE ("Control"), FALSE ("Experiment")
type.logical <- reef2$Type == "Control"
# check if "violet" is a named color: "violet" %in% colors()
# define colors according to experiment type (either "Control" or "Experiment")
plot.colours <- ifelse(type.logical, "orange", "violet")
# define type of symbol for plotting points (see more options with ?points)
plot.points <- ifelse(type.logical, 22, 1)

# draw contrived plot example
plot(xvalues, yvalues, xlab = "pH", ylab = "aragonite saturation state",
     main = "Arag Sat. State vs pH", sub = "an R for Absolute Beginners Contrived Plot Example",
     xlim = c(7.5, 8.5), ylim = c(1, 7), cex = 1.5, cex.axis = 1.5, pch = plot.points,
     col = plot.colours)

9.2 Histograms and Boxplots

The function hist () shows the frequency (number of occurrences) of each observation; and the function boxplot () shows the distribution of the occurrences in each category (agegp, alcgp and tobg).

# basic histogram, with labels, title and bars each with a different color
# (rainbow function), using ~ 20 breaks
hist(reef2$in.situ.aragonite, breaks = 20, xlab = "Aragonite", main = "Aragonite Histogram", 
    col = rainbow(20))

# basic boxplot of the cases per age group
boxplot(reef2$in.situ.aragonite ~ reef2$Type, main = "Aragonite in Controls and Experiments", 
    border = "gray", lwd = 1, col = c("orange", "green"))

9.3 Curves

These are continuous plots (usually of known statistical distributions, like the Gaussian (dnorm), gamma, beta, etc). Here we will see how to add lines and text to the plot (in specific locations/coordinates), as well as an extra axis on top with a different color.

# multiple normal distribution curves, different mean and sd, and plot them
# in the same plot (add = TRUE)
curve(dnorm, from = -3, to = 5, lwd = 2, col = "red")
curve(dnorm(x, mean = 2), lwd = 2, col = "blue", add = TRUE)
curve(dnorm(x, mean = -1), lwd = 2, col = "green", add = TRUE)
curve(dnorm(x, mean = 0, sd = 1.5), lwd = 2, lty = 2, col = "red", add = TRUE)
# add a vertical line at the mean of the standard 'red' distribution
lines(c(0, 0), c(0, dnorm(0)), lty = 1, col = "red")
# add free text to the plot, in coordinates x=4, y=0.2
text(4, 0.2, "Gaussian distributions")
# add extra axis, on top (side 3), from -3 to 5, with tick-marks from -3 to
# 5, and colored violet
axis(3, -3:5, seq(-3, 5), col.axis = "violet")

9.4 Multiple Graphs

Different types of graphs can be combined in the same plotting area. Start by trying to plot an histogram of the temperature (in Kelvin) together with the gaussian distribution that best fits it (with appropriate mean and standard deviation).

# plot the histogram
hist(reef$T..K..Spec.pH, col = "red", xlab = "Temp (K)", freq = F, main = "Hist with Normal curve")
# calculate the x and y values
temp.x <- seq(min(reef$T..K..Spec.pH), max(reef2$T..K..Spec.pH), length = 100)
temp.y <- dnorm(temp.x, mean = mean(reef2$T..K..Spec.pH), sd = sd(reef2$T..K..Spec.pH))
# plot the normal curve using the function lines
lines(temp.x, temp.y, col = "blue", lwd = 2)

Exercise 9.4.1 Arranging several plots in one page

To create a page with several plots located in side-by-side panels, we can use the function par with one of the following parameters: par(mfrow=c(r,c)) or par(mfcol=c(r,c)). mfrow adds images per line, from left to right, and mfcol adds per column, from top to bottom.

Make sure you understand how mfrow parameter is specifying the order by which the plots fill up the layout.

# make a 2 by 2 array of plot panels
# fill up row by row
par(mfrow = c(2,2))
# create a new plot, type="n" means plot none
# first plot
#plot(c(0.5),type="n", axes = FALSE, ann=FALSE)
plot.new()
text(0.5, 0.5, "1", cex = 5)
box()
# second plot
plot.new()
#plot(c(0.5),type="n", axes = FALSE, ann=FALSE)
text(0.5, 0.5, "2", cex = 5)
box()
# third plot
plot.new()
#plot(c(0.5),type="n", axes = FALSE, ann=FALSE)
text(0.5, 0.5, "3", cex = 5)
box()
# fourth plot
plot.new()
#plot(c(0.5),type="n", axes = FALSE, ann=FALSE)
text(0.5, 0.5, "4", cex = 5)
box()

Here is the same example but with mfcol.

# make a 2 by 2 array of plot panels
# fill up column by column
par(mfcol = c(2,2))
# create a new plot, type="n" means plot none
# first plot
plot(c(1),type="n", axes = FALSE, ann=FALSE)
text(1, 1, "1", cex = 5)
box()
# second plot
plot(c(1),type="n", axes = FALSE, ann=FALSE)
text(1, 1, "2", cex = 5)
box()
# third plot
plot(c(1),type="n", axes = FALSE, ann=FALSE)
text(1, 1, "3", cex = 5)
box()
# fourth plot
plot(c(1),type="n", axes = FALSE, ann=FALSE)
text(1, 1, "4", cex = 5)
box()

Once terminated the panel plots, we must revert the graphical parameters to its default values, so that we can go back to plotting one chart per page.

# reset the graphical display parameters to 1 row and 1 column
par(mfrow = c(1, 1))

Now lets try to plot the real data from our tutorial dataset.

# set the graphical display parameters to 3 rows and 2 columns
par(mfrow = c(3, 2))  # mfrow adds plots per row, from left to right
# draw boxplots for experiments and controls, per each group
boxplot(reef2$Salinity ~ reef2$Type, xlab = "Salinity", border = "gray", lwd = 1, 
    col = c("violet", "magenta"))
boxplot(reef2$Alkalinity..umol.kg. ~ reef2$Type, xlab = "Alkalinity", border = "gray", 
    lwd = 1, col = c("yellow", "yellow2"))
boxplot(reef2$Spec.pH..total. ~ reef2$Type, border = "gray", xlab = "pH", lwd = 1, 
    col = c("green", "limegreen"))
boxplot(reef2$Rhodamine..ppb. ~ reef2$Type, border = "gray", xlab = "Rhodamine", 
    lwd = 1, col = c("blue", "lightskyblue"))
boxplot(reef2$T..K..Spec.pH ~ reef2$Type, border = "gray", xlab = "Temp (K)", 
    lwd = 1, col = c("tan", "tan4"))
boxplot(reef2$in.situ.pCO2..uatm. ~ reef2$Type, border = "gray", xlab = "pCO2", 
    lwd = 1, col = c("orange", "orange3"))
# add a title outside of the plotting area
title("Boxplots of Experiments and Controls", outer = TRUE, line = -2, cex.main = 2)

Exercise 9.5 Export and save plots

RStudio allows the visualization of the plots before exporting/saving them to an image file (i.e. a bitmap such as a jpeg or png which becomes pixelated when zoomed in), or as pdf or svg (vectorial formats that can be zoomed and stretched to infinity, without loosing image quality). However, pdf files don’t always export correctly, so they require some “trial and error” until one can get the “perfect” image.

To save plots you can quickly go to the Plots tab in the Workspace window and click on Export. However, this approach is not easily reproducible and can be time-consuming if the plots need to be generated many times. Alternatively, in order to save the plots programmatically, you can just wrap the plotting functions between two commands: pdf (or other depending on file format desired) and dev.off.

# pdf(...) or jpeg(...), png(...), svg(...), etc..

# Start graphics device
pdf("aragonite_ctrl_exp.pdf", width=7, height=5)
# draw boxplots for experiments and controls, per each group 
boxplot (reef2$in.situ.aragonite ~ reef2$Type, main="Aragonite in Controls and Experiments",
         border="gray", lwd=1, col=c("orange","blue"))
dev.off () # close graphics device (stop writing to file)

Save one of the previous plots as pdf and open it with your pdf viewer (e.g. Acrobat Reader).

Exercise 10. Extra study

Exercise 10.1 Write a function of your own

To assess the fraction of alkalinity taken up by the reef, a passive tracer, i.e. the non-reactive dye Rhodamine WT, was mixed with ambient sea water in the tank. Rhodamine WT concentration was then measured fluorometrically. Given that this measurement is temperature dependent, it needs to be corrected. The following formula provides this correction:

\[F_r = F_s e^{k (T_s-T_r)}\]

where $F_r$ and $F_s$ are the fluorescences at the reference and sample temperatures, $T_r$ and $T_s$ (in Kelvin), and $k=0.026$ per Kelvin (2.6% correction per Kelvin).

Challenge: Write a function named f.r that returns the $F_r$ value given as input $F_s$, $T_r$ and $T_s$. Also, include an argument in the function that allows changing the accepted temperature units from Kelvin (default) to Celsius.

Use your function f.r to calculate the temperature-corrected Rhodamine concentrations. Hint: $F_s$ is reef2$Rhodamine..ppb., $T_r$ and $T_s$ are T..C..in.situ and T..K..Spec.pH, respectively. Plot the calculated temperature-corrected Rhodamine concentrations versus reef2$Rhodamine..S.Normalized..ppb..

Exercise 10.2 Scatterplots with Error Bars

R base does not provide plots with error bars. However, it is easy (however laborious) to take advantage of the arrows function to mimic the same effect.

arrows(x, avg - sdev, x, avg + sdev, length = 0.05, angle = 90, code = 3)

In the code above x is a vector of x-positions, and avg-sdev and avg+sdev are vectors of the lower and upper y-positions of the error bars.

Using many of the elements you have seen today try to generate this set of two plots. See how these plots compare with those of Figure 2c-d from Albright et al., 2016.

END

R for Absolute Beginners - Part 2/2

Summary Statistics and Graphics in R

Authors: Isabel Duarte & Ramiro Magno | Collaborators: Bruno Louro & Rui Machado

31 May 2016

Introduction

Hands-on Exercises

Exercise 0. Understand the context of your data (15 min)

Exercise 1. Get the data (10 min)

Exercise 2. Format conversion (10 min)

Exercise 3. Set working directory (15 min)

Exercise 4. Checking the data file structure (10 min)

Exercise 5. Load data into R (20 min)

Exercise 5.1

Exercise 5.2

Exercise 6. Inspecting an R data frame (20 min)

Exercise 7. Tidying-up the data (50 min)

Exercise 7.1

Exercise 7.2

Exercise 7.3

Exercise 7.4

Exercise 7.5

Exercise 7.6

Exercise 8. Exploring the data (1h30m)

Exercise 8.1

Exercise 8.2

Exercise 8.3

Exercise 8.4

Exercise 8.5

Exercise 9. Exploring the data graphically (1h30m)

Exercise 9.0 Introduction

Exercise 9.1 Scatterplots: Basic plotting with `plot`

Exercise 9.1.1 More parameters for `plot`

9.2 Histograms and Boxplots

9.3 Curves

9.4 Multiple Graphs

Exercise 9.4.1 Arranging several plots in one page

Exercise 9.5 Export and save plots

Exercise 10. Extra study

Exercise 10.1 Write a function of your own

Exercise 10.2 Scatterplots with Error Bars

R for Absolute Beginners - Part 2/2

Summary Statistics and Graphics in R

Authors: Isabel Duarte & Ramiro Magno | Collaborators: Bruno Louro & Rui Machado

31 May 2016

Introduction

Hands-on Exercises

Exercise 0. Understand the context of your data (15 min)

Exercise 1. Get the data (10 min)

Exercise 2. Format conversion (10 min)

Exercise 3. Set working directory (15 min)

Exercise 4. Checking the data file structure (10 min)

Exercise 5. Load data into R (20 min)

Exercise 5.1

Exercise 5.2

Exercise 6. Inspecting an R data frame (20 min)

Exercise 7. Tidying-up the data (50 min)

Exercise 7.1

Exercise 7.2

Exercise 7.3

Exercise 7.4

Exercise 7.5

Exercise 7.6

Exercise 8. Exploring the data (1h30m)

Exercise 8.1

Exercise 8.2

Exercise 8.3

Exercise 8.4

Exercise 8.5

Exercise 9. Exploring the data graphically (1h30m)

Exercise 9.0 Introduction

Exercise 9.1 Scatterplots: Basic plotting with plot

Exercise 9.1.1 More parameters for plot

9.2 Histograms and Boxplots

9.3 Curves

9.4 Multiple Graphs

Exercise 9.4.1 Arranging several plots in one page

Exercise 9.5 Export and save plots

Exercise 10. Extra study

Exercise 10.1 Write a function of your own

Exercise 10.2 Scatterplots with Error Bars

Exercise 9.1 Scatterplots: Basic plotting with `plot`

Exercise 9.1.1 More parameters for `plot`