R for Marine Science Workshop 2

Author

Miroslava Guerrero

Published

July 1, 2026

Workshop 2 - Advanced data wrangling: Extracting ecological signals from noisy systems

Load packages and housekeeping

Tidying data

Learning how to wrangle data into tidy formats is a critical skill for any data analyst. Tidy data is a standardized way of organizing data that makes it easier to work with and analyze.In this section we will tidy our data with tools in tidyverse.

While there are multiple ways of displaying a dataset, not all are easy to use in an analysis. For instance, some forms might be useful to display the data, they might not be the best format for analysis.

The process of rearranging the data to an appropriate format for analysis is known as ‘tidying’.

For instance the following tables display the same dataset but only the first table is ‘tidy’ and ready for analysis. The second and third tables are not tidy and would require some wrangling before they can be used for analysis.

table1

# A tibble: 6 × 4
  country      year  cases population
  <chr>       <dbl>  <dbl>      <dbl>
1 Afghanistan  1999    745   19987071
2 Afghanistan  2000   2666   20595360
3 Brazil       1999  37737  172006362
4 Brazil       2000  80488  174504898
5 China        1999 212258 1272915272
6 China        2000 213766 1280428583

table2

# A tibble: 12 × 4
   country      year type            count
   <chr>       <dbl> <chr>           <dbl>
 1 Afghanistan  1999 cases             745
 2 Afghanistan  1999 population   19987071
 3 Afghanistan  2000 cases            2666
 4 Afghanistan  2000 population   20595360
 5 Brazil       1999 cases           37737
 6 Brazil       1999 population  172006362
 7 Brazil       2000 cases           80488
 8 Brazil       2000 population  174504898
 9 China        1999 cases          212258
10 China        1999 population 1272915272
11 China        2000 cases          213766
12 China        2000 population 1280428583

table3

How we make our dataset tidier follows three rules: 1. Each variable must have its own column 2. Each observation must have its own row 3. Each value must have its own cell

These rules are interrelated due to it being impossible to only satisfy two of the three. This leads to a simpler set of practical instructions. 1. Put each dataset in a tibble 2. Put each variable in a column

All packages in tidyverse are design to work with tidy data.

By having tidy data we have a consistent structure which makes it easier to learn and work with the tools in tidyverse. And having variables in columns allows R to work with vectors of values and makes transformations easier to do.

These are some examples of how to work with tidy data (table1), which allows filtering and summarizing data, and using functions in ggplot2. Which would not be possible if it were not to be in this format.

# Compute rate per 10,000
table1 %>%
  mutate(rate = cases / population * 10000)

# Compute cases per year
table1 %>%
  count(year, wt = cases)

# A tibble: 2 × 2
   year      n
  <dbl>  <dbl>
1  1999 250740
2  2000 296920

# Visualise changes over time
ggplot(table1, aes(year, cases)) +
  geom_line(aes(group = country), colour = "grey50") +
  geom_point(aes(colour = country))

Therefore understanding the structure of your data frame is a fundamental skill for data analysis. This will improve your skills and make it easier to work with.

Pivoting data to make it tidy

Most data one encounters will most likely be untidy as most of the times data collection doesn’t consider the future analysis which will be conducted.

so what needs to be done to the collected datset and what transformations should be done to it to make it tidy?

Firstly we need to understand what each variable and observation mean. Secondly the two most common problems with untidy data are: 1. One variable is spread across multiple columns 2. One observation is scattered across multiple rows

To fix these problems we pivot the data (move it around) so it becomes tidy. The functions used are ‘tidyr::pivot_longer()’ and ‘tidyr::pivot_wider()’.

Lengthening datasets - pivot_longer()

This is the most common tidying issue one faces. The function pivot_longer() makes datasets longer, by increasing the number of rows and decreasing the number of columns.

For instance ‘table4’ needs to be wrangled into a long format to satisfy the 3 rules of tidy data. The issue with this table is that we want to group it by year, or use it as a factor. Though ggplot2 and most statistical packages cannot do this with data in columns, only variables should be in columns.

Including the variables we want to group by year we pass the variable name to any grouping function, like ‘facet_wrap()’

The function pivot_longer() splits the dataset by column and re-formats it into a tidy format of observations as rows, columns as variables and values as cell entries.

For instance, using the billboard dataset records. The billboard rank of songs in the year 2000:

billboard

The first step is to understand the meaning of each observation. Here, each observation is a song, and the first three columns (artist, track and date.entered) are variables that describe the song. The remaining columns (wk1 - wk76) are the rank of the song in the billboard chart for each week.

To tidy the dataset we use ’pivot_longer(). In this case actual data values (wk1 - wk76) are in the column name with each observation being a song. We need to have the data in a format where each row is an observation (long format)

billboard |>
  pivot_longer(
    cols = starts_with("wk"),
    names_to = "week",
    values_to = "rank"
  )

# A tibble: 24,092 × 5
   artist track                   date.entered week   rank
   <chr>  <chr>                   <date>       <chr> <dbl>
 1 2 Pac  Baby Don't Cry (Keep... 2000-02-26   wk1      87
 2 2 Pac  Baby Don't Cry (Keep... 2000-02-26   wk2      82
 3 2 Pac  Baby Don't Cry (Keep... 2000-02-26   wk3      72
 4 2 Pac  Baby Don't Cry (Keep... 2000-02-26   wk4      77
 5 2 Pac  Baby Don't Cry (Keep... 2000-02-26   wk5      87
 6 2 Pac  Baby Don't Cry (Keep... 2000-02-26   wk6      94
 7 2 Pac  Baby Don't Cry (Keep... 2000-02-26   wk7      99
 8 2 Pac  Baby Don't Cry (Keep... 2000-02-26   wk8      NA
 9 2 Pac  Baby Don't Cry (Keep... 2000-02-26   wk9      NA
10 2 Pac  Baby Don't Cry (Keep... 2000-02-26   wk10     NA
# ℹ 24,082 more rows

There are three key arguments in the pivot_longer() function: 1. cols: specifies columns you want to pivot which are not variables 2. names_to: names the variables stored in column names 3. values_to: names the variable stored in cell values

When creating new variables they need to be quoted like “week” and “rank”. If not, R will try to find objects with those names and will return an error if it cannot find them.

When we lengthened the data, the weeks where it wasn’t on the charts, it became NA, these NAs were forced to exist because of the structure of the dataset, not because they are not unknown.

We can fix this by asking pivot_longer to remove them by adding them by doing the following:

billboard |>
  pivot_longer(
    cols = starts_with("wk"),
    names_to = "week",
    values_to = "rank",
    values_drop_na = TRUE #removing NA values
  )

# A tibble: 5,307 × 5
   artist  track                   date.entered week   rank
   <chr>   <chr>                   <date>       <chr> <dbl>
 1 2 Pac   Baby Don't Cry (Keep... 2000-02-26   wk1      87
 2 2 Pac   Baby Don't Cry (Keep... 2000-02-26   wk2      82
 3 2 Pac   Baby Don't Cry (Keep... 2000-02-26   wk3      72
 4 2 Pac   Baby Don't Cry (Keep... 2000-02-26   wk4      77
 5 2 Pac   Baby Don't Cry (Keep... 2000-02-26   wk5      87
 6 2 Pac   Baby Don't Cry (Keep... 2000-02-26   wk6      94
 7 2 Pac   Baby Don't Cry (Keep... 2000-02-26   wk7      99
 8 2Ge+her The Hardest Part Of ... 2000-09-02   wk1      91
 9 2Ge+her The Hardest Part Of ... 2000-09-02   wk2      87
10 2Ge+her The Hardest Part Of ... 2000-09-02   wk3      92
# ℹ 5,297 more rows

This lead to fewer rows in total, meaning a lot of NAs were dropped. Now the data is tidy. However some other things can be done to improve the format to make future computation easier. This includes converting some values from string numbers using ‘mutate()’ and ‘parse_number()’ to extract the number from the string.

To further investigate what pivoting actually does to our dataset, we can use the ‘billboard’ dataset and see how it changes when we pivot it.

df <- tribble(
  ~id, ~bp1, ~bp2,
  "A", 100, 120,
  "B",  140,  115,
  "C",  120,  125
)

df

# A tibble: 3 × 3
  id      bp1   bp2
  <chr> <dbl> <dbl>
1 A       100   120
2 B       140   115
3 C       120   125

This created a dataset called ‘df’ with 3 variables and associated values. We want this dataset to have three variables: 1. id - already exists 2. measurement - the column names 3. value - the values in the cells

To make this happen we need to pivot df longer:

df|>
  pivot_longer(
    cols = bp1:bp2,
    names_to = "measurement",
    values_to = "value"
  )

# A tibble: 6 × 3
  id    measurement value
  <chr> <chr>       <dbl>
1 A     bp1           100
2 A     bp2           120
3 B     bp1           140
4 B     bp2           115
5 C     bp1           120
6 C     bp2           125

When using pivot_loner() the data pivoted and repeated an id for each measurement. The original column names in df (bp1 and bp2) now become values in a new variable, whole name is defined by the names_to argument and these values end to be repeated once for each row in the original dataset. The values in the cell also became a new variable, with the name defined by the values_to argument.

Widening datasets

In some instances, we will need to widen the dataset rather than lengthening it. widening is essentially the opposite of lengthening, it increases the number of columns and decreases the number of rows. The function used to widen datasets is ‘pivot_wider()’.

Using an example from R4DS to explore pivot_wider() looking at the following dataset from the centers of Medicare and Medicaid.

cms_patient_experience

In this format each organization is spread across six rows with one row for each measurement taken in the survey organisation. By using ‘distinct()’ we can see the complete set of values for measure_cd and measure_title.

cms_patient_experience |>
  distinct(measure_cd, measure_title)

# A tibble: 6 × 2
  measure_cd   measure_title                                                    
  <chr>        <chr>                                                            
1 CAHPS_GRP_1  CAHPS for MIPS SSM: Getting Timely Care, Appointments, and Infor…
2 CAHPS_GRP_2  CAHPS for MIPS SSM: How Well Providers Communicate               
3 CAHPS_GRP_3  CAHPS for MIPS SSM: Patient's Rating of Provider                 
4 CAHPS_GRP_5  CAHPS for MIPS SSM: Health Promotion and Education               
5 CAHPS_GRP_8  CAHPS for MIPS SSM: Courteous and Helpful Office Staff           
6 CAHPS_GRP_12 CAHPS for MIPS SSM: Stewardship of Patient Resources

measure_cd will be used as the source for the new column names, but in real analysis we would want to create our own variable names which are short and meaningful.

With pivot_wider() we need to provide the existing columns that define the values (values_from) and the column name (names_from).

cms_patient_experience |> 
  pivot_wider(
    names_from = measure_cd,
    values_from = prf_rate
  )

The above output would not be ideal, as we still have multiple rows for each organisation. Because we still need to tell pivot_wider() which column or columns have values that uniquely identify each row. In this case variables starting with “org”

cms_patient_experience |> 
  pivot_wider(
    id_cols = starts_with("org"),
    names_from = measure_cd,
    values_from = prf_rate
  )

# A tibble: 95 × 8
   org_pac_id org_nm CAHPS_GRP_1 CAHPS_GRP_2 CAHPS_GRP_3 CAHPS_GRP_5 CAHPS_GRP_8
   <chr>      <chr>        <dbl>       <dbl>       <dbl>       <dbl>       <dbl>
 1 0446157747 USC C…          63          87          86          57          85
 2 0446162697 ASSOC…          59          85          83          63          88
 3 0547164295 BEAVE…          49          NA          75          44          73
 4 0749333730 CAPE …          67          84          85          65          82
 5 0840104360 ALLIA…          66          87          87          64          87
 6 0840109864 REX H…          73          87          84          67          91
 7 0840513552 SCL H…          58          83          76          58          78
 8 0941545784 GRITM…          46          86          81          54          NA
 9 1052612785 COMMU…          65          84          80          58          87
10 1254237779 OUR L…          61          NA          NA          65          NA
# ℹ 85 more rows
# ℹ 1 more variable: CAHPS_GRP_12 <dbl>

The following example of pivoting wider will allow us to understand what the function does to our data. In this case two patients with id A and B, have 3 blood pressure (bp) measurements.

df <- tribble(
  ~id, ~measurement, ~value,
  "A",        "bp1",    100,
  "B",        "bp1",    140,
  "B",        "bp2",    115, 
  "A",        "bp2",    120,
  "A",        "bp3",    105
)

df

# A tibble: 5 × 3
  id    measurement value
  <chr> <chr>       <dbl>
1 A     bp1           100
2 B     bp1           140
3 B     bp2           115
4 A     bp2           120
5 A     bp3           105

Using names_from() and values_from() we’ll make the names from the measurement column and the values from the value column respectively.

df |> 
  pivot_wider(
    names_from = measurement,
    values_from = value
  )

# A tibble: 2 × 4
  id      bp1   bp2   bp3
  <chr> <dbl> <dbl> <dbl>
1 A       100   120   105
2 B       140   115    NA

To start pivoting, the function pivot_wider() needs to figure out what will go in the rows and columns.

df |> 
  distinct(measurement) |> #new column names will be unique values of measurement
  pull()

By default the rows in the output are determined by all the variables that are not going into the new names or values, called id_cols.

df |> 
  select(-measurement, -value) |> 
  distinct()

Pivot wider then combines these results to generate an empty data frame. Which then fills in all the missing values using data in the input. In this case, not every cell has a corresponding value, so it remains missing. This will be covered later in the workshop.

df |> 
  select(-measurement, -value) |> 
  distinct() |> 
  mutate(x = NA, y = NA, z = NA)

# A tibble: 2 × 4
  id    x     y     z    
  <chr> <lgl> <lgl> <lgl>
1 A     NA    NA    NA   
2 B     NA    NA    NA

Pivoting using Palmer Penguins

Lengthening data with pivot_longer()

To create a single plot that shows the distribution of all morpho-metric measurements (bill length and depth, flipper length, and body mass) for the penguins. In the current tidy format, these variables are spread across four columns.

To plot them using facet_wrap(), we need them stacked into just two columns, one to identify the measurement name, and one with the numerical value. This can be achieved my making it into a long format:

library(palmerpenguins)

# Create long version
penguins_long <- penguins |>
  pivot_longer(
    cols = c(bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g),
    names_to = "measurement_type",
    values_to = "value"
  )

head(penguins_long) # View the result

# A tibble: 6 × 6
  species island    sex     year measurement_type   value
  <fct>   <fct>     <fct>  <int> <chr>              <dbl>
1 Adelie  Torgersen male    2007 bill_length_mm      39.1
2 Adelie  Torgersen male    2007 bill_depth_mm       18.7
3 Adelie  Torgersen male    2007 flipper_length_mm  181  
4 Adelie  Torgersen male    2007 body_mass_g       3750  
5 Adelie  Torgersen female  2007 bill_length_mm      39.5
6 Adelie  Torgersen female  2007 bill_depth_mm       17.4

This new format is what ggplot expects for multi-panel faceting.

penguins_long |>
  drop_na(value) |>
  ggplot(aes(x = value, fill = species)) +
  geom_histogram(bins = 30, alpha = 0.7, colour = "black") +
  facet_wrap(~ measurement_type, scales = "free_x") +
  theme_minimal() +
  labs(
    title = "Morphometric distributions across penguin species",
    x = "Measurement value",
    y = "Frequency"
  )

Widening data with pivot_wider()

Often in marine science literature, we need to present summary statistics across a cross-tabulated matrix. It is easier to parse a wide table than a long repetitive list.

First calculating the mean body mass for each species on each island group using group_by() and summarise()

mass_summary <- penguins |>
  drop_na(body_mass_g) |>
  group_by(species, island) |>
  summarise(mean_mass = mean(body_mass_g))

head(mass_summary)

# A tibble: 5 × 3
# Groups:   species [3]
  species   island    mean_mass
  <fct>     <fct>         <dbl>
1 Adelie    Biscoe        3710.
2 Adelie    Dream         3688.
3 Adelie    Torgersen     3706.
4 Chinstrap Dream         3733.
5 Gentoo    Biscoe        5076.

While the output is technically tidy and useful, it is not ideal for publication. Using pivot_wider() we can pull the island names into columns and create a clean metric of species by island.

mass_matrix <- mass_summary |>
  pivot_wider(
    names_from = island, # indicates which column provides new column header
    values_from = mean_mass # dictates which column populates the cells beneath them
  )

head(mass_matrix)

# A tibble: 3 × 4
# Groups:   species [3]
  species   Biscoe Dream Torgersen
  <fct>      <dbl> <dbl>     <dbl>
1 Adelie     3710. 3688.     3706.
2 Chinstrap    NA  3733.       NA 
3 Gentoo     5076.   NA        NA

NA values automatically populate where a specific species doesn’t occur on a particular island.

Separating and uniting data tables

In table3 we can see one column that contains two variables. This needs to be addressed by using the function separate(), which separates one column into multiple ones.

table3

# A tibble: 6 × 3
  country      year rate             
  <chr>       <dbl> <chr>            
1 Afghanistan  1999 745/19987071     
2 Afghanistan  2000 2666/20595360    
3 Brazil       1999 37737/172006362  
4 Brazil       2000 80488/174504898  
5 China        1999 212258/1272915272
6 China        2000 213766/1280428583

We need to split the rate column into two variables (cases and population). the function separte() will take the name of the column we want to split and the name of the columns we want to split into.

table3 %>% 
  separate(rate, into = c("cases", "population"))

By default this will split values wherever it sees a character that is not a number or letter. to use a specific character to separate a column, the character can be passed to a step argument of separate().

table3 %>% 
  separate(rate, into = c("cases", "population"), sep = "/")

# A tibble: 6 × 4
  country      year cases  population
  <chr>       <dbl> <chr>  <chr>     
1 Afghanistan  1999 745    19987071  
2 Afghanistan  2000 2666   20595360  
3 Brazil       1999 37737  172006362 
4 Brazil       2000 80488  174504898 
5 China        1999 212258 1272915272
6 China        2000 213766 1280428583

in the table above, both cases and population are listed as character types. This is a default by using the separate function. However since the values in those columns are numbers we want separate() to convert the into better types.

table3 %>% 
  separate(rate, into = c("cases", "population"), convert = TRUE #to convert into better data type
           )

You can also pass a vector of integers to separate, the function will interpret the integers as positions to split at. The following arrangement can be used to separate the last two digits of each year, making the data less tidy, but useful in other cases.

table3 %>% 
  separate(year, into = c("century", "year"), sep = 2)

# A tibble: 6 × 4
  country     century year  rate             
  <chr>       <chr>   <chr> <chr>            
1 Afghanistan 19      99    745/19987071     
2 Afghanistan 20      00    2666/20595360    
3 Brazil      19      99    37737/172006362  
4 Brazil      20      00    80488/174504898  
5 China       19      99    212258/1272915272
6 China       20      00    213766/1280428583

Using the function unite()

This performs the inverse of separate(). It combines multiple columns into a single column. In the following example we use it to rejoin century and year. this function takes the data frame, the name of the new variable and a set of columns to combine using ‘dplyr::select()’

table5 %>% 
  unite(new, century, year, sep = "" # to not add a separator
        )

# A tibble: 6 × 3
  country     new   rate             
  <chr>       <chr> <chr>            
1 Afghanistan 1999  745/19987071     
2 Afghanistan 2000  2666/20595360    
3 Brazil      1999  37737/172006362  
4 Brazil      2000  80488/174504898  
5 China       1999  212258/1272915272
6 China       2000  213766/1280428583

Wrangling strings and dates

When combining data from different sources, we run into formatting mismatches. For instance “Nelly Bay”, “nelly_bay” and “NELLY BAY”, which are the same location. Additionally, date formats can also vary depending on the instrument or regional settings.

Before merging datasets or plotting timelines, we must standardize our text strings and date-times. We can do this by using ‘stringr’ and ‘lubridate’ from ‘tidyverse’.

Standardizing text with ‘stringr’

All functions begin with ‘str_’ making them easy to find. Here we will be cleaning up a messy vector of site names before conducting a data summary.

library(tidyverse)

# Messy dataframe of field sites
messy_sites <- tibble(
  site_id = c(" Nelly Bay", "nelly_bay", "NELLY BAY", " Geoffrey_Bay ", "geoffrey bay")
)

# Using stringr within mutate to standarize the text
clean_sites <- messy_sites |>
  mutate(
    # 1. Convert everything to lowercase
    site_clean = str_to_lower(site_id),
    # 2. Replace any spaces with underscores
    site_clean = str_replace_all(site_clean, pattern = " ", replacement = "_"),
    # 3. Trim any leading or trailing whitespace (invisible spaces at the ends)
    site_clean = str_trim(site_clean)
  )

print(clean_sites)

# A tibble: 5 × 2
  site_id          site_clean    
  <chr>            <chr>         
1 " Nelly Bay"     _nelly_bay    
2 "nelly_bay"      nelly_bay     
3 "NELLY BAY"      nelly_bay     
4 " Geoffrey_Bay " _geoffrey_bay_
5 "geoffrey bay"   geoffrey_bay

Doing this will have converted five uniquely messy inputs into just two perfectly standardised categories.

Mastering time with lubridate

The lubridate package simplifies dates and times by simply spelling out the order of the components into year, month, day, hour, minute and second.

library(lubridate)

# Parsing different date formats
date_1 <- dmy("25/12/2026") # use this function for day/month/year
date_2 <- ymd("2026-12-25") # this function for year/month/day

# Now R recognises these as idential date objects
date_1 == date_2

[1] TRUE

Oftentimes we also need to parse exact time stamps, which is done by appending _hms (hours, minutes, seconds) to our functions

# A tibble of raw sensor data with a messy character timestamp.
sensor_data <- tibble(
  raw_time = c("14-05-2026 08:30:00", "14-05-2026 08:45:00", "14-05-2026 09:00:00"),
  temperature = c(24.5, 24.6, 24.4)
)

# Converting character strings to true POSIXct datefime objects
sensor_clean <- sensor_data |>
  mutate(
    true_time = dmy_hms(raw_time)
  )

print(sensor_clean)

# A tibble: 3 × 3
  raw_time            temperature true_time          
  <chr>                     <dbl> <dttm>             
1 14-05-2026 08:30:00        24.5 2026-05-14 08:30:00
2 14-05-2026 08:45:00        24.6 2026-05-14 08:45:00
3 14-05-2026 09:00:00        24.4 2026-05-14 09:00:00

Once a column is a recognized datetime, ggplot2 will automatically understand how to format it chronologically. And specific components can be easily extracted, for instance using month(true_time) to get the month integer.

Relational data: Joining tables

In robust data architecture, information is often split across multiple tables to reduce redundancy.

To model the following data, the tables must be brought together based on a common identifying column, known as a key.

# Table 1: Biological observation data
observations <- tibble(
  site_code = c("NB", "GB", "MI", "NB", "HB"),
  species = c("Trout", "Snapper", "Trout", "Cod", "Trout"),
  count = c(5, 2, 1, 3, 8)
)

# Table 2: Spatial metadata
site_metadata <- tibble(
  site_code = c("NB", "GB", "MI", "RP", "WP"),
  zone = c("Marine National Park", "Conservation Park", "Habitat Protection", "General Use", "Other Use"),
  lat = c(-19.16, -19.15, -19.14, -19.12, -19.11)
)

left_joint() and right_joint()

These functions are the most common and safest joins to use. left_join() keeps all the rows from the left table (observations) and matches data from the right table (site_metadata) wherever the key matches. Vice versa for right_joint(). If a match is missing, the gap is filled with NA

# Joining metadata to our observations
joined_data <- observations |>
  left_join(site_metadata, by = join_by(site_code))

print(joined_data)

# A tibble: 5 × 5
  site_code species count zone                   lat
  <chr>     <chr>   <dbl> <chr>                <dbl>
1 NB        Trout       5 Marine National Park -19.2
2 GB        Snapper     2 Conservation Park    -19.2
3 MI        Trout       1 Habitat Protection   -19.1
4 NB        Cod         3 Marine National Park -19.2
5 HB        Trout       8 <NA>                  NA

The inner_join()

This only keeps rows that have complete matches in both tables. If an observation occurs at an unlisted site from the metadata, the observation is dropped from the dataset.

Only use this when analyzing data with complete spiral or environmental context.

matched_data <- observations |>
  inner_join(site_metadata, by = join_by(site_code))
glimpse(matched_data)

Rows: 4
Columns: 5
$ site_code <chr> "NB", "GB", "MI", "NB"
$ species   <chr> "Trout", "Snapper", "Trout", "Cod"
$ count     <dbl> 5, 2, 1, 3
$ zone      <chr> "Marine National Park", "Conservation Park", "Habitat Protec…
$ lat       <dbl> -19.16, -19.15, -19.14, -19.16

The anti_join()

This does not add new columns, instead it filters the left table to only show rows that don’t have a match to the right. This can be used after ending with NA values from a left_join() to find out exactly which codes are misspelled or missing.

# Find missing observations from the metadata dictionary
missing_context <- observations |>
  anti_join(site_metadata, by = join_by(site_code))

missing_context

# A tibble: 1 × 3
  site_code species count
  <chr>     <chr>   <dbl>
1 HB        Trout       8

Handling missing values

Missing data is a common occurrence, which can happen due to several things, such as weather preventing sampling, equipment failing, cells left blank to indicate zero values, and others.

Converting legacy errors with na_if()

Equipment sometimes can’t properly generate NAs, instead it splits out an impossible placeholder like -99 or -999. Which if left untouched can ruin the dataset.

These type of errors should be fixed immediately after importing data.

# Raw data from an old temperature logger 
logger_data <- tibble(
  depth_m = c(10, 20, 30, 40),
  temp_c = c(24.5, 24.1, -999, 23.5)) # -999 is a known sensor error code

# Convert error codes to true NA values
fixed_logger <- logger_data |>
  mutate(temp_c = na_if(temp_c, -999))

print(fixed_logger)

# A tibble: 4 × 2
  depth_m temp_c
    <dbl>  <dbl>
1      10   24.5
2      20   24.1
3      30   NA  
4      40   23.5

Replacing missing values with coalesce()

Sometimes unknown zero values become NAs. the function dplyr::coalesce() insite a mutate() to replace every NA with a 0 (or any other fixed value)

# Count data
shark_counts <- tibble(
  site = c("Reef_A", "Reef_B", "Reef_C"),
  shark_count = c(3, NA, 5)) # The NA here actually means 0 sharks were seen

# Replace NA with 0 
shark_fixed <- shark_counts |>
  mutate(shark_count = coalesce(shark_count, 0))

print(shark_fixed)

# A tibble: 3 × 2
  site   shark_count
  <chr>        <dbl>
1 Reef_A           3
2 Reef_B           0
3 Reef_C           5

The special case of NaN (Not a Number)

NaN refers to values that are mathematically impossible, after a mathematical operation is performed and results in an indeterminate result, such as zero by zero. This can happen in marine science when calculating CPUE (catch per unit effort) for a site with 0 as both catch and effort.

Generally, R treats NaN the same as NAs. Though if the data needs to be specifically flagged for mathematical errors without flagging missing data, it can be isolated with the function is.nan()

cpue_data <- tibble(
  site = c("Bay_1", "Bay_2"),
  catch = c(10, 0),
  effort_hours = c(2, 0))

# Calculating CPUE (catch / effort)
cpue_calc <- cpue_data |>
  mutate(
    cpue = catch / effort_hours)

print(cpue_calc)

# A tibble: 2 × 4
  site  catch effort_hours  cpue
  <chr> <dbl>        <dbl> <dbl>
1 Bay_1    10            2     5
2 Bay_2     0            0   NaN

Implicit missing values and the zero-data framework: complete()

In community ecology knowing where a species was not found is important too. However, raw data exclusively records the presence only, though missing some data.

# Raw catch log of coral trout species - only records the presence
raw_catch <- tibble(
  site = c("Reef_1", "Reef_1", "Reef_2"),
  species = c("Pmaculatus", "Pleopardus", "Pmaculatus"),
  count = c(5, 2, 8))

raw_catch

# A tibble: 3 × 3
  site   species    count
  <chr>  <chr>      <dbl>
1 Reef_1 Pmaculatus     5
2 Reef_1 Pleopardus     2
3 Reef_2 Pmaculatus     8

While it is implied that Reef_2 was surveyed, no Plectopompus was found there. If we calculate the mean catch of this species across all sites, R will say it is 2 as it only knows about those in Reef_1. However, the mean should be 1, as there were 2 found in Reef_1 and 0 at Reef_2.

The function tidyr::complete() can be used to build a zero-data framework. This function finds all unique combinations of supplied variables. This creates rows for the missing combinations, which fills the resulting NAs with 0.

The following will list Pleopardus with a count of 0 in reef 2. Building these frameworks is a mandatory prerequisite before feeding the data into linear models and other analyses.

# Force inclusion of hidden zero-catch data
full_catch_matrix <- raw_catch |>
  complete(site, species, fill = list(count = 0))

print(full_catch_matrix)

# A tibble: 4 × 3
  site   species    count
  <chr>  <chr>      <dbl>
1 Reef_1 Pleopardus     2
2 Reef_1 Pmaculatus     5
3 Reef_2 Pleopardus     0
4 Reef_2 Pmaculatus     8

The nuclear option: drop_na()

If NAs represent a failed instrument reading, and the reading is the core response variable of the analysis, the entire row needs to be removed by using drop_na()

sensor_log <- tibble(
  day = 1:4,
  salinity = c(35.2, 35.1, NA, 35.3)
)

sensor_log

# A tibble: 4 × 2
    day salinity
  <int>    <dbl>
1     1     35.2
2     2     35.1
3     3     NA  
4     4     35.3

# Remove any row where salinity is missing
clean_log <- sensor_log |>
  drop_na(salinity)

clean_log

# A tibble: 3 × 2
    day salinity
  <int>    <dbl>
1     1     35.2
2     2     35.1
3     4     35.3

Practical exercises: Penguins, pivots, and relational data.

Using the clean penguins dataset with a supplementary metadata file containing the GPS coordinates and the date the primary weather stations were installed on each island.

Cleaning the messy metadata

# The messy metadata provided by your colleague
island_metadata <- tibble(
  island_name = c(" biscoe", "Dream ", "Torgersen"),
  station_install = c("15/01/2003", "22-03-2004", "05/11/2001"),
  latitude = c(-64.81, -64.73, -64.76)
)

print(island_metadata)

# A tibble: 3 × 3
  island_name station_install latitude
  <chr>       <chr>              <dbl>
1 " biscoe"   15/01/2003         -64.8
2 "Dream "    22-03-2004         -64.7
3 "Torgersen" 05/11/2001         -64.8

# Create a new object called clean_metadata
# Using mutate(), stringr, and lubridate fix the island_name column so it matches the island column in the penguins dataset
# Then convert the station_install column into a Date object.

clean_metadata <- island_metadata |>
  mutate(
    island_name = island_name |>
      stringr::str_trim() |>
      stringr::str_to_title(),

    station_install = lubridate::dmy(station_install)
  )

clean_metadata

# A tibble: 3 × 3
  island_name station_install latitude
  <chr>       <date>             <dbl>
1 Biscoe      2003-01-15         -64.8
2 Dream       2004-03-22         -64.7
3 Torgersen   2001-11-05         -64.8

The relational join

# Create a new object called penguins_spatial
# Use left_join() to attach clean_metadata to penguins dataset
# specify how island and island_name match using by=join_by(island==island_name)

penguins_spatial <- penguins |>
  left_join(
    clean_metadata,
    by = join_by(island == island_name)
  )

head(penguins_spatial)

# A tibble: 6 × 10
  species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
  <fct>   <chr>              <dbl>         <dbl>             <int>       <int>
1 Adelie  Torgersen           39.1          18.7               181        3750
2 Adelie  Torgersen           39.5          17.4               186        3800
3 Adelie  Torgersen           40.3          18                 195        3250
4 Adelie  Torgersen           NA            NA                  NA          NA
5 Adelie  Torgersen           36.7          19.3               193        3450
6 Adelie  Torgersen           39.3          20.6               190        3650
# ℹ 4 more variables: sex <fct>, year <int>, station_install <date>,
#   latitude <dbl>

The wide summary matrix

A collaborator asked for a clean, publication-ready table showing the maximum body mass recorded for each species, separated by island

# using penguins_spatial dataset
# drop any missing values in the body_mass column
# group by species and island.
# summarise the data to find the maximum body mass
#pivot the dataset wider so that the species form the rows, and the island form the columns.

max_mass_table <- penguins_spatial |>
  filter(!is.na(body_mass_g)) |>
  group_by(species, island) |>
  summarise(
    max_body_mass_g = max(body_mass_g),
    .groups = "drop"
  ) |>
  pivot_wider(
    names_from = island,
    values_from = max_body_mass_g
  )

max_mass_table

# A tibble: 3 × 4
  species   Biscoe Dream Torgersen
  <fct>      <int> <int>     <int>
1 Adelie      4775  4650      4700
2 Chinstrap     NA  4800        NA
3 Gentoo      6300    NA        NA