Month written as full word instead of as an integer in B29, B37, B48, and B70
Name of month misspelled in B70 (will not be found with find and replace search)
Column C: Year and Month have been separated into their own columns, but date has been left in MM/DD/YYYY format.
Date format is different in C11 and C14
Station column contains both location and site number (could be separated depending on the actual context of the stations; might be alright as is)
Depth values, D1, DO, Temp all listed at inconsistent decimal places
D1 label unclear
Inconsistent notation for unmeasured values: nd in Time Towed column; NA and blank spaces in Salinity column, both nd and blank spaces in species count columns; one -9999 count in Blackfin.Poacher column
-9999 error value in PDO (K103)
Blank spaces leave ambiguity on if a measurement was taken or not (Salinity and Species Caught columns)
Notes in random cell at the end of the spreadsheet: not only unclear, but lend some distrust to the entire datasheet with “Species IDs may not be correct.” References another spreadsheet with data not present to be compared with.
Unlabeled averages included at the bottom of D1, DO, Temp, Salinity, and PDO columns; will be included if run through any analysis; should be separated from the datasheet and included in analysis, not in the raw data.
Year value in A115 is not correct based on Date for that observation.
Two columns are present for big skate counts (Big.Skate and Big.skate).
I replaced any blank spaces, “nd”, or “-9999” with NA to be consistent throughout the spreadsheet.
I copied notes to the Metadata sheet for cleanliness in the data while maintaining any important observations from the field.
I aligned data to have equal significant digits depending on the variable and measurements taken.
I created a column for day and replaced any spelled-out months with integers, and I removed the combined date column. I copied any formula cells I used to determine year, month, and day from the dates, then pasted as values so I could remove the date column without receiving errors.
I combined the two big skate columns into a single tally.
I did not exclude any extreme outliers (i.e., counts greater than 900) from the data. It was hard to say if the high counts came from input error or if there really were a couple of days where the field researchers counted a whole lot of fish. I might change my mind if I were doing any analysis, but for data tidying I thought it important to include them for now.
Make a summary table of mean temperature and dissolved oxygen from May through September across years.
## # A tibble: 5 × 3
## Month mean_Temp mean_DO
## <int> <dbl> <dbl>
## 1 5 8.07 2.71
## 2 6 8.58 2.94
## 3 7 7.52 1.85
## 4 8 7.74 2.12
## 5 9 7.81 1.38
Plot the counts of Slender Sole against temperature (Slender sole ~ Temperature).
I used ggplot to compare slender sole counts to temperature. I used the Sailboat palette from PNWColors with 114 shades and reordered a subsetted, slender sole only dataframe by temperature to include a color gradient from low to high temps.
According to the data, slender sole are found in waters ranging in temperature from 7 to 9 degrees C. The majority of slender sole are found in water around 8 degrees C. Most sole counts ranged between 0 and 259, while a few outliers ranged from 500 to over 1000.
Make a new data frame that includes all the environmental and site data (columns A:L, 1:12) and the fish species with 20% frequency of occurrence or more.
present_func <- function(x) sum(x != 0, na.rm = TRUE)
n_func <- function(x) sum(x != "", na.rm = TRUE)
freq_func <- function(x) present_func(x)/n_func(x)
## Year Month Day
## 1.000000000 1.000000000 1.000000000
## Station Depth Time.Towed
## 1.000000000 1.000000000 1.000000000
## D1 DO Temp
## 1.000000000 1.000000000 1.000000000
## Salinity PDO Avg.Upwelling.Index
## 1.000000000 1.000000000 1.000000000
## Alligatorfish.spp. Arrowtooth.Flounder Big.Skate
## 0.017857143 0.017543860 0.044247788
## Bigeye.Poacher Black.Rockfish Blackbelly.Eelpout
## 0.008849558 0.017699115 0.008771930
## Blackfin.Poacher Slender.Sole Unknown
## 0.008849558 0.666666667 0.008771930
## Warty.Poacher Wattled.Eelpout Whitebait.Smelt
## 0.947368421 0.017543860 0.008771930
## Whitebarred.Prickleback Yellowtail.Rockfish
## 0.008771930 0.008771930
## Year Month Day Station Depth Time.Towed D1 DO Temp Salinity PDO
## 1 2008 7 26 MB30 30 10 496.29 1.99 7.37 NA -1.67
## 2 2008 8 29 MB30 33 10 502.68 3.44 9.04 NA -1.70
## 3 2008 8 11 MB30 30 10 496.29 NA 11.88 NA -1.70
## 4 2008 7 26 MB40 40 10 517.59 1.73 7.26 NA -1.67
## 5 2008 8 29 MB40 44 13 674.91 2.53 8.19 NA -1.70
## 6 2008 8 11 MB40 40 10 517.59 4.09 8.28 NA -1.70
## Avg.Upwelling.Index Slender.Sole Warty.Poacher
## 1 62 67 4
## 2 35 0 17
## 3 35 0 1
## 4 62 0 24
## 5 35 0 29
## 6 35 100 20
## [1] 114 14
The dimensions of my resulting table matches the expected dimensions.
Make your data long-form tidy data.
I used the gather function to create two columns (species and count) that include the slender sole and warty poacher data.
## Year Month Day Station Depth Time.Towed D1 DO Temp Salinity PDO
## 1 2008 7 26 MB30 30 10 496.29 1.99 7.37 NA -1.67
## 2 2008 8 29 MB30 33 10 502.68 3.44 9.04 NA -1.70
## 3 2008 8 11 MB30 30 10 496.29 NA 11.88 NA -1.70
## 4 2008 7 26 MB40 40 10 517.59 1.73 7.26 NA -1.67
## 5 2008 8 29 MB40 44 13 674.91 2.53 8.19 NA -1.70
## 6 2008 8 11 MB40 40 10 517.59 4.09 8.28 NA -1.70
## Avg.Upwelling.Index Species Count
## 1 62 Slender.Sole 67
## 2 35 Slender.Sole 0
## 3 35 Slender.Sole 0
## 4 62 Slender.Sole 0
## 5 35 Slender.Sole 0
## 6 35 Slender.Sole 100
## [1] 228 14
The dimensions of my long-form table match the expected dimensions.
Plot fish counts by depth for the fish species.
I used ggplot to compare fish counts to depth, then colored each species using the Lake palette from PNWColors.
Log-transforming the data reduces the scale and makes points easier to distinguish.
To remove the effect of log-transforming zero values, I also performed a log(x + c) transformation, where c = 0.5 (one half of lowest non-zero count, 1). I’m not entirely sure that this is the right approach for these data, but in my searches it was a common way to account for lots of zeros.
Both slender sole and warty poacher fish are found less frequently in deeper waters than in shallower waters. Warty poachers are found in increasing numbers until approximately 30 fathoms, after which they are found in fairly consistent numbers. Slender sole are found in a wide variety of numbers at various depths, though with greater range in and higher frequencies at shallower depths than in deeper waters.
Time Taken: This assignment took me about 5 hours in total, maybe a little more. While moving the data from wide form to long form took a good bit of searching, I spent most of the extra time learning R markdown and messing around with palettes.