Sample Data Merge – Relevant Abundance Project

tldr;

Combined some testing sized data info one (hopefully) useful dataset. Proof of concept, now we tweak.

Setup

# install.packages("tidyverse")
library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.2     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.4     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

# install.packages("readxl")
library(readxl)

Basic Merge Process

Created data frames for each of the xlsx files to be processed.

collection_data <- read_excel("/cloud/project/Sample_Collection_Data.xlsx")
height_data <- read_excel("/cloud/project/Sample_Height_Data.xlsx")

Used libray(dplyr) shown in video: https://www.youtube.com/watch?v=rVFcUfipz2M

Created new data frame ‘merged’ using inner_join function; finds similar columns and appends them together.

# install.packages("dplyr")
# library(dplyr)

merged_data <- inner_join(collection_data, height_data)

## Joining with `by = join_by(collection, pjid)`

wrote the dataframe to a new xlsx file for export and examination

# install.packages("writexl")
writexl::write_xlsx(merged_data,"/cloud/project/merged_data.xlsx")

Overview

Using original TRAY_tblCollection_Data.xlsx and TRAY_tblHeight_Data.xlsx sheets, the data was reworked into simplified ‘sample’ versions, chopping out some columns of incomplete / unclear data, plus slimming the data rows to a more manageable testing size (~28,000 entries). Some of the column headers were re-titled for consistency across files. These files were read into R as dataframes.

Using R, I references the collection# column as the common value to match the datasets:

In the Collection_Data, it references collection #’s like 20020001 with descriptions, lat / long, tray type, etc.

collection_data

## # A tibble: 19 × 14
##    collection pjid  c_dep_date          c_ret_date          c_tag c_site_code
##    <chr>      <chr> <dttm>              <dttm>              <chr> <chr>      
##  1 20020001   R11   2002-04-26 00:00:00 2003-03-04 00:00:00 14    INL        
##  2 20020002   R11   2002-04-26 00:00:00 2003-03-04 00:00:00 479   INL        
##  3 20020003   R11   2002-04-26 00:00:00 2003-03-04 00:00:00 78    INL        
##  4 20020004   R11   2002-04-26 00:00:00 2003-03-04 00:00:00 28    TOL        
##  5 20020005   R11   2002-04-26 00:00:00 2003-03-04 00:00:00 90    TOL        
##  6 20020006   R11   2002-04-26 00:00:00 2003-03-04 00:00:00 487   TOL        
##  7 20020007   T23   2002-04-11 00:00:00 2003-03-04 00:00:00 18    HAM        
##  8 20020008   T23   2002-04-11 00:00:00 2003-03-04 00:00:00 1     HAM        
##  9 20020009   T23   2002-04-11 00:00:00 2003-03-04 00:00:00 26    HAM        
## 10 20020010   T23   2002-04-25 00:00:00 2003-03-05 00:00:00 107   KIA        
## 11 20020011   T23   2002-04-25 00:00:00 2003-03-05 00:00:00 482   KIA        
## 12 20020012   T23   2002-04-25 00:00:00 2003-03-05 00:00:00 125   KIA        
## 13 20020013   T23   2002-04-24 00:00:00 2003-03-03 00:00:00 474   LEE        
## 14 20020014   T23   2002-04-24 00:00:00 2003-03-03 00:00:00 47    LEE        
## 15 20020015   T23   2002-04-24 00:00:00 2003-03-03 00:00:00 49    LEE        
## 16 20020016   T23   2002-04-24 00:00:00 2003-03-03 00:00:00 31    LEW        
## 17 20020017   T23   2002-04-24 00:00:00 2003-03-03 00:00:00 40    LEW        
## 18 20020018   T23   2002-04-24 00:00:00 2003-03-03 00:00:00 480   LEW        
## 19 20020019   T23   2002-04-23 00:00:00 2003-03-17 00:00:00 188   WAL        
## # ℹ 8 more variables: c_site_name <chr>, C_PSG_Reef <chr>, c_tray_rep <chr>,
## #   c_lat_dec <dbl>, c_lon_dec <dbl>, `Tray style` <chr>, ID <dbl>,
## #   c_description <chr>

In Height_Data, all of the measurements from the collection \#'s (like 20020001) show the measured height data points. For each collection #, there are dozens to hundreds of data points.

height_data

## # A tibble: 28,847 × 4
##    collection pjid  h_sheight   h_ID
##    <chr>      <chr>     <dbl>  <dbl>
##  1 20020001   R11       25.2  294800
##  2 20020001   R11       22.9  294801
##  3 20020001   R11       31.3  294802
##  4 20020001   R11       17.1  294803
##  5 20020001   R11       16.6  294804
##  6 20020001   R11       12.0  294805
##  7 20020001   R11       45.2  294806
##  8 20020001   R11        8.49 294807
##  9 20020001   R11       14.9  294808
## 10 20020001   R11       14.7  294809
## # ℹ 28,837 more rows

By matching the collection #’s for each Height_Data point to the more general Collection_Data, I combined those into a single row of information.
The result: each Height Data point now has both the height data and the more descriptive location, tray type, description, etc from Collection_Data

merged_data

## # A tibble: 28,847 × 16
##    collection pjid  c_dep_date          c_ret_date          c_tag c_site_code
##    <chr>      <chr> <dttm>              <dttm>              <chr> <chr>      
##  1 20020001   R11   2002-04-26 00:00:00 2003-03-04 00:00:00 14    INL        
##  2 20020001   R11   2002-04-26 00:00:00 2003-03-04 00:00:00 14    INL        
##  3 20020001   R11   2002-04-26 00:00:00 2003-03-04 00:00:00 14    INL        
##  4 20020001   R11   2002-04-26 00:00:00 2003-03-04 00:00:00 14    INL        
##  5 20020001   R11   2002-04-26 00:00:00 2003-03-04 00:00:00 14    INL        
##  6 20020001   R11   2002-04-26 00:00:00 2003-03-04 00:00:00 14    INL        
##  7 20020001   R11   2002-04-26 00:00:00 2003-03-04 00:00:00 14    INL        
##  8 20020001   R11   2002-04-26 00:00:00 2003-03-04 00:00:00 14    INL        
##  9 20020001   R11   2002-04-26 00:00:00 2003-03-04 00:00:00 14    INL        
## 10 20020001   R11   2002-04-26 00:00:00 2003-03-04 00:00:00 14    INL        
## # ℹ 28,837 more rows
## # ℹ 10 more variables: c_site_name <chr>, C_PSG_Reef <chr>, c_tray_rep <chr>,
## #   c_lat_dec <dbl>, c_lon_dec <dbl>, `Tray style` <chr>, ID <dbl>,
## #   c_description <chr>, h_sheight <dbl>, h_ID <dbl>

The output is an xlsx file for review ## Next Steps

Review the original datasets to determine which columns are useful for analysis, or if there is enough rich data to create a few different variations of data output.

Notes

uploaded xlsx files into File panel via upload button
path is “/cloud/project/…”