R for Marine Science Workshop 1

Author

Miroslava Guerrero

Published

June 3, 2026

Workshop 1 - Foundations of Data Science; Wrangling and Plotting.

This workshop aims to establish a foundational workflow for data science in R, with a focus on data wrangling and visualisation. The session covers best practices for project management, environment hygiene, and the use of tidyverse tools to extract ecological insights from tabular datasets.

Clear out environment

Load packages

Source script

Project initialisation and workspace architecture

Establishing a new R project independent of previous work, by pulling down the class repository.

There are two ways in which this can be made. By ‘Forking’ which is a connected copy of the class repository, and a ‘Standalone copy’, which is disconnected from the class repository.

In this occasion a Standalone will be the recommended pathway. By using ‘usethis’ package to close the repository and removing the connection to the master file. This will upload an independent copy onto your personal GitHub.

Clone class repository steps:

copy the HTTPS URL from the class repository
Open RStudio, create a new version control project and paste the URL, and save it where you want on your computer.
Click create project.

Remove the original connection and establish a new one.

Run the following code in the console
Push a fresh copy to own profile

Memory allocation and evnironment hygiene.

A computer’s active memory can retain hidden data values across multiple computing runs, which can linger and cause un-traceable errors in the current analysis.

To eliminate this issue, a baseline workflow can be established at the top of every fresh script file using:

objects() # List all active objects in the environment

[1] "acoustic_stream"  "benthic_cover"    "fisheries_annual" "mangrove_data"

rm(list = ls()) # Clear all objects from the environment 
objects() # Confirm that the environment is clean

character(0)

Breaking the ‘Save Workspace’ reflex and purging .RData

In data science one must train themselves to break the reflex of hitting Yes to the ‘Save Workspace’ prompt when closing R. When one does this RStudio saves a hidden binary snapshot of the current temporary RAM directly into the project folder as a .RData file. This reloads all the old variables, data frames and broken code attempts back into the memory. While it might work on your computer, it will break when co-authors, advisors or reviewers try to run your script on their computers.

To permanently adjust your settings to protect you from this trap do the following: 1. Tools > Global Options > General > Workspace heading 2. Uncheck the box that says Re-store .RData into workspace at startup. 3. on the dropdown menu next to Save workspace to .RData on exit change it to NEVER 4. Click Apply and then OK.

Advanced troubleshooting note: if RStudio is opened without opening specific projects. And notice variables or data frames are mysteriously floating in the Environment tab, an .RData file has been accidentally saved to the computer’s global user home directory.

To wipe it out entirely from inside R, execute the following in the console:

Tibbles vs. legacy tables

When tabular assets are imported using modern tidyverse commands, the resulting object is stored in the environment memory as a specialized structure (tibble; tbl_df).

While a tibble is still a base R data frame, the underlying software architecture prevents common data corruption errors and console performance issues.

To see the behavioural difference of a modern tidyverse tibble and legacy base R data.frame, initialise a native tidyverse data object and enforce comparative transformation.

Firstly we force a modern tibble to degrade into a legacy base R data frame structure

source("Workshop1.R") # load data into active memory
benthic_cover_df <- as.data.frame(benthic_cover)

Then we compare it

# Print old-style dataframe structure to view
print(benthic_cover_df)

       site_id transect_no       date depth_m hard_coral_pct macroalgae_pct
1    Nelly_Bay           1 2026-05-10     4.5           45.2           12.1
2    Nelly_Bay           2 2026-05-10     5.1           38.5           15.4
3 Geoffrey_Bay           1 2026-05-11     3.8           52.1            8.3
4 Geoffrey_Bay           2 2026-05-11     4.2           48.9           10.2
5 Florence_Bay           1 2026-05-12     6.0           61.3            5.1
6 Florence_Bay           2 2026-05-12     5.8           58.7            6.4
  bare_substrate_pct
1               42.7
2               46.1
3               39.6
4               40.9
5               33.6
6               34.9

# Compare with tibble alternative
print(benthic_cover)

# A tibble: 6 × 7
  site_id      transect_no date       depth_m hard_coral_pct macroalgae_pct
  <chr>              <dbl> <date>       <dbl>          <dbl>          <dbl>
1 Nelly_Bay              1 2026-05-10     4.5           45.2           12.1
2 Nelly_Bay              2 2026-05-10     5.1           38.5           15.4
3 Geoffrey_Bay           1 2026-05-11     3.8           52.1            8.3
4 Geoffrey_Bay           2 2026-05-11     4.2           48.9           10.2
5 Florence_Bay           1 2026-05-12     6             61.3            5.1
6 Florence_Bay           2 2026-05-12     5.8           58.7            6.4
# ℹ 1 more variable: bare_substrate_pct <dbl>

Wrangling out ecological signals using Palmer Penguins

To reveal how data wragling reveals information abour underlying ecology, We will work with the Palmer Penguins built-in dataset.

library(palmerpenguins)
data("penguins")

When loading a new dataset always examine the structure of it.

glimpse(penguins) # tidyverse version (from dplyr package)

Rows: 344
Columns: 8
$ species           <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel…
$ island            <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse…
$ bill_length_mm    <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, …
$ bill_depth_mm     <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, …
$ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186…
$ body_mass_g       <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, …
$ sex               <fct> male, female, female, NA, female, male, female, male…
$ year              <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…

str(penguins) # base R version

tibble [344 × 8] (S3: tbl_df/tbl/data.frame)
 $ species          : Factor w/ 3 levels "Adelie","Chinstrap",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ island           : Factor w/ 3 levels "Biscoe","Dream",..: 3 3 3 3 3 3 3 3 3 3 ...
 $ bill_length_mm   : num [1:344] 39.1 39.5 40.3 NA 36.7 39.3 38.9 39.2 34.1 42 ...
 $ bill_depth_mm    : num [1:344] 18.7 17.4 18 NA 19.3 20.6 17.8 19.6 18.1 20.2 ...
 $ flipper_length_mm: int [1:344] 181 186 195 NA 193 190 181 195 193 190 ...
 $ body_mass_g      : int [1:344] 3750 3800 3250 NA 3450 3650 3625 4675 3475 4250 ...
 $ sex              : Factor w/ 2 levels "female","male": 2 1 1 NA 1 2 1 2 NA NA ...
 $ year             : int [1:344] 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 ...

The glimpse() command maps out the entirety anatomy of the dataset, with the vector types listed within the arrow markers:

is a factor, refering to categorical groupings with fixed levels. refers to double, a continous numeric measurement containing decimals. is an integer, referring to whole number variables that track counts.

A statistical overview can be conducted to map out missing observations and parameter boundaries

summary() is an immediate diagnostic tool which reports any missing indicators (NA) within individual biological metrics.

# Generating an exploratory summary matrix
summary(penguins)

      species          island    bill_length_mm  bill_depth_mm  
 Adelie   :152   Biscoe   :168   Min.   :32.10   Min.   :13.10  
 Chinstrap: 68   Dream    :124   1st Qu.:39.23   1st Qu.:15.60  
 Gentoo   :124   Torgersen: 52   Median :44.45   Median :17.30  
                                 Mean   :43.92   Mean   :17.15  
                                 3rd Qu.:48.50   3rd Qu.:18.70  
                                 Max.   :59.60   Max.   :21.50  
                                 NAs    :2       NAs    :2      
 flipper_length_mm  body_mass_g       sex           year     
 Min.   :172.0     Min.   :2700   female:165   Min.   :2007  
 1st Qu.:190.0     1st Qu.:3550   male  :168   1st Qu.:2007  
 Median :197.0     Median :4050   NAs   : 11   Median :2008  
 Mean   :200.9     Mean   :4202                Mean   :2008  
 3rd Qu.:213.0     3rd Qu.:4750                3rd Qu.:2009  
 Max.   :231.0     Max.   :6300                Max.   :2009  
 NAs    :2         NAs    :2

Foundational grammar: Slicing, filtering, sorting, and transforming

The dplyr package introduces the foundational verbs required to shape tables manually

Isolating attributes with select()

Allows to slice datasets vertically, isolating or dropping columns based on variable names:

# Vertically slice specific morphometric variables by explicit name
morphology_metrics <- select(penguins, species, bill_length_mm, bill_depth_mm, body_mass_g)
glimpse(morphology_metrics)

Rows: 344
Columns: 4
$ species        <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adelie,…
$ bill_length_mm <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, 42.…
$ bill_depth_mm  <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, 20.…
$ body_mass_g    <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, 425…

# Retain a continous block of attributes using the colon operator
spatial_block <- select(penguins, species:island)

# Discard logistics tracking attributes while preserving everything else using the minus sign
clean_scientific_fields <- select(penguins, -year)

Shifting rows with filter()

This isolates records horizontally based on targeted conditional parameters. R evaluates every row against the logical expression, retaining TRUE entries and dropping FALSE or NA entries.

# Isolate observations belonging to a single categorical target group
adelie_cohort <- filter(penguins, species == "Adelie")

# Shift out individuals using continuous numerical boundary thresholds
#Preserves only large penguins with mass above 4500 g
heavy_penguins <- filter(penguins, body_mass_g >4500)

# Combine multiple conditional parameters across separate attributes
# Preserve records matching Gentoo penguins sampled on Biscoe Island
biscoe_gentoo <- filter(penguins, species == "Gentoo" & island == "Biscoe")

#Shift records matching multiple targeting flags within an explicit set
sub_islands <- filter(penguins, island %in% c("Dream", "Torgersen"))

Ordering sequences with arrange()

This alters the sorting configuration of rows within the data frame without changing individual cell values.

# Sort penguins by ascending body mass (smallest mass first)
lightest_first <- arrange(penguins, body_mass_g)

#Sort penguins in descending sequence
heaviest_first <- arrange(penguins, desc(body_mass_g))

# Execute nested sorting criteria: Group by species, then sort by descending bill length
stratified_morphology <- arrange(penguins, species, desc(bill_length_mm))

Introducing the Pipe (|>)

The pipe is used to cleanly chain operations togetehr. It can be done by passing a series of transformations through the data in a single, linear flow.

With a pipe you take your data, then you filter it, and mutate it to create a new column. Then you select the columns you need.

The syntax for the native R pipe is |>. The pipe operator built into the tidyverse package (%>)

Instead of writing:

penguins_subset <- mutate(penguins, bill_ratio = bill_length_mm / bill_depth_mm)

penguins_final <- filter(penguins_subset, species == "Adelie")

Using the pipe:

penguins_final <- penguins |>
  mutate(bill_ratio = bill_length_mm / bill_depth_mm) |>
  filter(species == "Adelie")

Computing new attributes with mutate()

We mutate a new variable to modify existing attributes or append new vectors to the data frame

# Calculate a new morphological ratio in our environment
penguin_ratios <- penguins |>
  mutate(body_mass_kg = body_mass_g / 1000, # convert g to kg
         bill_ratio = bill_length_mm / bill_depth_mm) #bill ratio 

# view newly engineered variables appended to the far right columns
glimpse(penguin_ratios)

Rows: 344
Columns: 10
$ species           <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel…
$ island            <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse…
$ bill_length_mm    <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, …
$ bill_depth_mm     <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, …
$ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186…
$ body_mass_g       <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, …
$ sex               <fct> male, female, female, NA, female, male, female, male…
$ year              <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…
$ body_mass_kg      <dbl> 3.750, 3.800, 3.250, NA, 3.450, 3.650, 3.625, 4.675,…
$ bill_ratio        <dbl> 2.090909, 2.270115, 2.238889, NA, 1.901554, 1.907767…

Data aggregation and ecological summarisation

Isolating and sorting individual rows or columns allows a clean dataset, however it does not show the broad underlying ecological story.

To extract meaningful biological conclusions, thousands of individual observations need to be compressed into explicit population summary metrics.

# Grouping active memory penguins by species
grouped_penguins <- group_by(penguins, species)

# The table looks identical, but metadata notes groups:species [3]
print(grouped_penguins)

# A tibble: 344 × 8
# Groups:   species [3]
   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
   <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
 1 Adelie  Torgersen           39.1          18.7               181        3750
 2 Adelie  Torgersen           39.5          17.4               186        3800
 3 Adelie  Torgersen           40.3          18                 195        3250
 4 Adelie  Torgersen           NA            NA                  NA          NA
 5 Adelie  Torgersen           36.7          19.3               193        3450
 6 Adelie  Torgersen           39.3          20.6               190        3650
 7 Adelie  Torgersen           38.9          17.8               181        3625
 8 Adelie  Torgersen           39.2          19.6               195        4675
 9 Adelie  Torgersen           34.1          18.1               193        3475
10 Adelie  Torgersen           42            20.2               190        4250
# ℹ 334 more rows
# ℹ 2 more variables: sex <fct>, year <int>

# Collapsing the buckets into explicit summary metrics
species_mass_summary <- summarise(grouped_penguins, mean_mass_g = mean(body_mass_g))
print(species_mass_summary) # operation failed and returned NA values for certain groups

# A tibble: 3 × 2
  species   mean_mass_g
  <fct>           <dbl>
1 Adelie            NA 
2 Chinstrap       3733.
3 Gentoo            NA

The missing value trap is the reason as to why NA values returned for certain groups. This happens if a column contains even a single NA value, any mathematical function will retrun NA to protect from miscalculating metrics on incomplete data.

To override this safely, you must explicitly declare that missing cells should be dropped during calculation using magrittr pipe to wragle the data.

# Overcoming the missing value trap using na.rm = TRUE
biological_signal <- penguins %>%
  group_by(species, sex) %>%
  summarise(
    sample_size = n(),#total individuals per category
    mean_mass_g = mean(body_mass_g,na.rm = TRUE), # mean ignoring missing cells
    sd_mass_g = sd(body_mass_g, na.rm = TRUE) #SD calculation
  )

print(biological_signal)

# A tibble: 8 × 5
# Groups:   species [3]
  species   sex    sample_size mean_mass_g sd_mass_g
  <fct>     <fct>        <int>       <dbl>     <dbl>
1 Adelie    female          73       3369.      269.
2 Adelie    male            73       4043.      347.
3 Adelie    <NA>             6       3540       477.
4 Chinstrap female          34       3527.      285.
5 Chinstrap male            34       3939.      362.
6 Gentoo    female          58       4680.      282.
7 Gentoo    male            61       5485.      313.
8 Gentoo    <NA>             5       4588.      338.

Integrating data grammar with visual diagnostics in qmd

In marine science, data is wrangled specifically to prepare it for visual display.

Wrangling and plotting in parallel

Combining the pipe and the basics of ggplot2 into a single reproducible block

Because the .qmd already loads cleaned data via source(“Workshop1.R), you can build visualizations and summary tables in real time.

Challenge: Visualizing body size and shape

Using loaded data, create a code chunk that performs the following tow tasks simultaneously: 1. Summary table: pipe data to group_by(), and summarise () to calculate the mean body mass for each species and island 2. Visualization: Pipe the same dataset into ggplot() to create a boxplot of body_mass_g by species, with the island variable mapped to fill or facet_wrap().

# Summary Table
penguins |>
  group_by(species, island) |>
  summarise(mean_mass_g = mean(body_mass_g, na.rm = TRUE)) |>
  print() # Print summary table

# A tibble: 5 × 3
# Groups:   species [3]
  species   island    mean_mass_g
  <fct>     <fct>           <dbl>
1 Adelie    Biscoe          3710.
2 Adelie    Dream           3688.
3 Adelie    Torgersen       3706.
4 Chinstrap Dream           3733.
5 Gentoo    Biscoe          5076.

# Visualization
penguins |>
  ggplot(aes(x = species, y = body_mass_g, fill = island)) +
  geom_boxplot() +
  labs(title = "Body Mass of Penguins by Species and Island",
       x = "Species",
       y = "Body Mass (g)") +
  theme_minimal()

Piping directly to visualisation

Sometimes a summary table does not need to be saved as an object. If the goal is to visualise patterns and uncertainty within your aggregated data, the pipeline can be extended into ggplot2.

Example of piping directly into a visualization

# pipe directly from aggregation to plotting with error bars
mass_compare_plot <- penguins |>
  group_by(species, island) |>
  summarise(
    mean_mass = mean(body_mass_g, na.rm = TRUE),
    sd_mass = sd(body_mass_g, na.rm = TRUE),
    n = n(),
    .groups = "drop"
  ) |>
  ggplot(aes(x = species, y = mean_mass, colour = island)) +
  geom_point(size = 3)+
  geom_errorbar(aes(ymin = mean_mass - sd_mass,
                    ymax = mean_mass + sd_mass),
                width = 0.2) +
  labs(title = "Mean Body Mass by Species and Island",
       subtitle = "Error bars represent standard deviation",
       y = "Mean Body Mass (g)",
       x = "Species") +
  theme_minimal()

mass_compare_plot

Challenge: Create a new plot. Swapping sd_mass for standard error (sd / sqrt(n)). Present the summary table alongside the associated figure

mass_compare_plot_SE <- penguins |>
  group_by(species, island) |>
  summarise(
    mean_mass = mean(body_mass_g, na.rm = TRUE),
    sd_mass = sd(body_mass_g, na.rm = TRUE),
    n = n(),
    se_mass = sd_mass / sqrt(n), # Calculate standard error
    .groups = "drop"
  ) |>
  ggplot(aes(x = species, y = mean_mass, colour = island)) +
  geom_point(size = 3)+
  geom_errorbar(aes(ymin = mean_mass - se_mass,
                    ymax = mean_mass + se_mass),
                width = 0.2) +
  labs(title = "Mean Body Mass by Species and Island",
       subtitle = "Error bars represent standard error",
       y = "Mean Body Mass (g)",
       x = "Species") +
  theme_minimal()

mass_compare_plot_SE

Saving, exporting, and version milestones

Check that the correct folders are available, and create them if not:

There are several options for saving objects.

Tables can be saved as a universal csv text file, which is ideal for sharing with collaborators who don’t use R. Or as a native R binary RDS object, which preserves the exact vector configurations, and won’t need to be re-formatted when reloaded.

Figures can be saved using ggsave(), as part of the ggplot2 built-in function.