Combined some testing sized data info one (hopefully) useful dataset. Proof of concept, now we tweak.
# install.packages("tidyverse")
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.2 ✔ tibble 3.3.0
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.0.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
# install.packages("readxl")
library(readxl)
Created data frames for each of the xlsx files to be processed.
collection_data <- read_excel("/cloud/project/Sample_Collection_Data.xlsx")
height_data <- read_excel("/cloud/project/Sample_Height_Data.xlsx")
Used libray(dplyr) shown in video: https://www.youtube.com/watch?v=rVFcUfipz2M
Created new data frame ‘merged’ using inner_join function; finds similar columns and appends them together.
# install.packages("dplyr")
# library(dplyr)
merged_data <- inner_join(collection_data, height_data)
## Joining with `by = join_by(collection, pjid)`
wrote the dataframe to a new xlsx file for export and examination
# install.packages("writexl")
writexl::write_xlsx(merged_data,"/cloud/project/merged_data.xlsx")
Using original TRAY_tblCollection_Data.xlsx and TRAY_tblHeight_Data.xlsx sheets, the data was reworked into simplified ‘sample’ versions, chopping out some columns of incomplete / unclear data, plus slimming the data rows to a more manageable testing size (~28,000 entries). Some of the column headers were re-titled for consistency across files. These files were read into R as dataframes.
Using R, I references the collection# column as the common value to match the datasets:
collection_data
## # A tibble: 19 × 14
## collection pjid c_dep_date c_ret_date c_tag c_site_code
## <chr> <chr> <dttm> <dttm> <chr> <chr>
## 1 20020001 R11 2002-04-26 00:00:00 2003-03-04 00:00:00 14 INL
## 2 20020002 R11 2002-04-26 00:00:00 2003-03-04 00:00:00 479 INL
## 3 20020003 R11 2002-04-26 00:00:00 2003-03-04 00:00:00 78 INL
## 4 20020004 R11 2002-04-26 00:00:00 2003-03-04 00:00:00 28 TOL
## 5 20020005 R11 2002-04-26 00:00:00 2003-03-04 00:00:00 90 TOL
## 6 20020006 R11 2002-04-26 00:00:00 2003-03-04 00:00:00 487 TOL
## 7 20020007 T23 2002-04-11 00:00:00 2003-03-04 00:00:00 18 HAM
## 8 20020008 T23 2002-04-11 00:00:00 2003-03-04 00:00:00 1 HAM
## 9 20020009 T23 2002-04-11 00:00:00 2003-03-04 00:00:00 26 HAM
## 10 20020010 T23 2002-04-25 00:00:00 2003-03-05 00:00:00 107 KIA
## 11 20020011 T23 2002-04-25 00:00:00 2003-03-05 00:00:00 482 KIA
## 12 20020012 T23 2002-04-25 00:00:00 2003-03-05 00:00:00 125 KIA
## 13 20020013 T23 2002-04-24 00:00:00 2003-03-03 00:00:00 474 LEE
## 14 20020014 T23 2002-04-24 00:00:00 2003-03-03 00:00:00 47 LEE
## 15 20020015 T23 2002-04-24 00:00:00 2003-03-03 00:00:00 49 LEE
## 16 20020016 T23 2002-04-24 00:00:00 2003-03-03 00:00:00 31 LEW
## 17 20020017 T23 2002-04-24 00:00:00 2003-03-03 00:00:00 40 LEW
## 18 20020018 T23 2002-04-24 00:00:00 2003-03-03 00:00:00 480 LEW
## 19 20020019 T23 2002-04-23 00:00:00 2003-03-17 00:00:00 188 WAL
## # ℹ 8 more variables: c_site_name <chr>, C_PSG_Reef <chr>, c_tray_rep <chr>,
## # c_lat_dec <dbl>, c_lon_dec <dbl>, `Tray style` <chr>, ID <dbl>,
## # c_description <chr>
In Height_Data, all of the measurements from the collection \#'s (like 20020001) show the measured height data points. For each collection #, there are dozens to hundreds of data points.
height_data
## # A tibble: 28,847 × 4
## collection pjid h_sheight h_ID
## <chr> <chr> <dbl> <dbl>
## 1 20020001 R11 25.2 294800
## 2 20020001 R11 22.9 294801
## 3 20020001 R11 31.3 294802
## 4 20020001 R11 17.1 294803
## 5 20020001 R11 16.6 294804
## 6 20020001 R11 12.0 294805
## 7 20020001 R11 45.2 294806
## 8 20020001 R11 8.49 294807
## 9 20020001 R11 14.9 294808
## 10 20020001 R11 14.7 294809
## # ℹ 28,837 more rows
By matching the collection #’s for each Height_Data point to the more general Collection_Data, I combined those into a single row of information.
The result: each Height Data point now has both the height data and the more descriptive location, tray type, description, etc from Collection_Data
merged_data
## # A tibble: 28,847 × 16
## collection pjid c_dep_date c_ret_date c_tag c_site_code
## <chr> <chr> <dttm> <dttm> <chr> <chr>
## 1 20020001 R11 2002-04-26 00:00:00 2003-03-04 00:00:00 14 INL
## 2 20020001 R11 2002-04-26 00:00:00 2003-03-04 00:00:00 14 INL
## 3 20020001 R11 2002-04-26 00:00:00 2003-03-04 00:00:00 14 INL
## 4 20020001 R11 2002-04-26 00:00:00 2003-03-04 00:00:00 14 INL
## 5 20020001 R11 2002-04-26 00:00:00 2003-03-04 00:00:00 14 INL
## 6 20020001 R11 2002-04-26 00:00:00 2003-03-04 00:00:00 14 INL
## 7 20020001 R11 2002-04-26 00:00:00 2003-03-04 00:00:00 14 INL
## 8 20020001 R11 2002-04-26 00:00:00 2003-03-04 00:00:00 14 INL
## 9 20020001 R11 2002-04-26 00:00:00 2003-03-04 00:00:00 14 INL
## 10 20020001 R11 2002-04-26 00:00:00 2003-03-04 00:00:00 14 INL
## # ℹ 28,837 more rows
## # ℹ 10 more variables: c_site_name <chr>, C_PSG_Reef <chr>, c_tray_rep <chr>,
## # c_lat_dec <dbl>, c_lon_dec <dbl>, `Tray style` <chr>, ID <dbl>,
## # c_description <chr>, h_sheight <dbl>, h_ID <dbl>
Review the original datasets to determine which columns are useful for analysis, or if there is enough rich data to create a few different variations of data output.