Part 1.

Dataset errors.

  1. Month written as full word instead of as an integer in B29, B37, B48, and B70

  2. Name of month misspelled in B70 (will not be found with find and replace search)

  3. Column C: Year and Month have been separated into their own columns, but date has been left in MM/DD/YYYY format.

  4. Date format is different in C11 and C14

  5. Station column contains both location and site number (could be separated depending on the actual context of the stations; might be alright as is)

  6. Depth values, D1, DO, Temp all listed at inconsistent decimal places

  7. D1 label unclear

  8. Inconsistent notation for unmeasured values: nd in Time Towed column; NA and blank spaces in Salinity column, both nd and blank spaces in species count columns; one -9999 count in Blackfin.Poacher column

  9. -9999 error value in PDO (K103)

  10. Blank spaces leave ambiguity on if a measurement was taken or not (Salinity and Species Caught columns)

  11. Notes in random cell at the end of the spreadsheet: not only unclear, but lend some distrust to the entire datasheet with “Species IDs may not be correct.” References another spreadsheet with data not present to be compared with.

  12. Unlabeled averages included at the bottom of D1, DO, Temp, Salinity, and PDO columns; will be included if run through any analysis; should be separated from the datasheet and included in analysis, not in the raw data.

  13. Year value in A115 is not correct based on Date for that observation.

  14. Two columns are present for big skate counts (Big.Skate and Big.skate).


Error Correction.

  1. I replaced any blank spaces, “nd”, or “-9999” with NA to be consistent throughout the spreadsheet.

  2. I copied notes to the Metadata sheet for cleanliness in the data while maintaining any important observations from the field.

  3. I aligned data to have equal significant digits depending on the variable and measurements taken.

  4. I created a column for day and replaced any spelled-out months with integers, and I removed the combined date column. I copied any formula cells I used to determine year, month, and day from the dates, then pasted as values so I could remove the date column without receiving errors.

  5. I combined the two big skate columns into a single tally.

  6. I did not exclude any extreme outliers (i.e., counts greater than 900) from the data. It was hard to say if the high counts came from input error or if there really were a couple of days where the field researchers counted a whole lot of fish. I might change my mind if I were doing any analysis, but for data tidying I thought it important to include them for now.


Part 2.

1. Summary: Temperature and DO

Make a summary table of mean temperature and dissolved oxygen from May through September across years.

## # A tibble: 5 × 3
##   Month mean_Temp mean_DO
##   <int>     <dbl>   <dbl>
## 1     5      8.07    2.71
## 2     6      8.58    2.94
## 3     7      7.52    1.85
## 4     8      7.74    2.12
## 5     9      7.81    1.38

2. Plot Slender Sole ~ Temperature.

Plot the counts of Slender Sole against temperature (Slender sole ~ Temperature).


I used ggplot to compare slender sole counts to temperature. I used the Sailboat palette from PNWColors with 114 shades and reordered a subsetted, slender sole only dataframe by temperature to include a color gradient from low to high temps.


According to the data, slender sole are found in waters ranging in temperature from 7 to 9 degrees C. The majority of slender sole are found in water around 8 degrees C. Most sole counts ranged between 0 and 259, while a few outliers ranged from 500 to over 1000.


3. Tidy the Data.

Make a new data frame that includes all the environmental and site data (columns A:L, 1:12) and the fish species with 20% frequency of occurrence or more.


  1. Custom function to count the number of days in which a fish was counted.
present_func <- function(x) sum(x != 0, na.rm = TRUE)
  1. Custom function to count the total number of days on which data was recorded (excludes days where no data on a particular fish was collected, zero or otherwise; i.e., NA values removed).
n_func <- function(x) sum(x != "", na.rm = TRUE)
  1. Custom function to determine the proportion of collections in which a fish was counted.
freq_func <- function(x) present_func(x)/n_func(x)
  1. Using “sapply” function to determine proportion of days in which each variable was counted.
##                    Year                   Month                     Day 
##             1.000000000             1.000000000             1.000000000 
##                 Station                   Depth              Time.Towed 
##             1.000000000             1.000000000             1.000000000 
##                      D1                      DO                    Temp 
##             1.000000000             1.000000000             1.000000000 
##                Salinity                     PDO     Avg.Upwelling.Index 
##             1.000000000             1.000000000             1.000000000 
##      Alligatorfish.spp.     Arrowtooth.Flounder               Big.Skate 
##             0.017857143             0.017543860             0.044247788 
##          Bigeye.Poacher          Black.Rockfish      Blackbelly.Eelpout 
##             0.008849558             0.017699115             0.008771930 
##        Blackfin.Poacher            Slender.Sole                 Unknown 
##             0.008849558             0.666666667             0.008771930 
##           Warty.Poacher         Wattled.Eelpout         Whitebait.Smelt 
##             0.947368421             0.017543860             0.008771930 
## Whitebarred.Prickleback     Yellowtail.Rockfish 
##             0.008771930             0.008771930
  1. From my new proportion table, I saw that only slender sole and water poachers were collected on 20% or more days of the study. I used select to choose all variable columns and only the two fish columns I needed.
##   Year Month Day Station Depth Time.Towed     D1   DO  Temp Salinity   PDO
## 1 2008     7  26    MB30    30         10 496.29 1.99  7.37       NA -1.67
## 2 2008     8  29    MB30    33         10 502.68 3.44  9.04       NA -1.70
## 3 2008     8  11    MB30    30         10 496.29   NA 11.88       NA -1.70
## 4 2008     7  26    MB40    40         10 517.59 1.73  7.26       NA -1.67
## 5 2008     8  29    MB40    44         13 674.91 2.53  8.19       NA -1.70
## 6 2008     8  11    MB40    40         10 517.59 4.09  8.28       NA -1.70
##   Avg.Upwelling.Index Slender.Sole Warty.Poacher
## 1                  62           67             4
## 2                  35            0            17
## 3                  35            0             1
## 4                  62            0            24
## 5                  35            0            29
## 6                  35          100            20
## [1] 114  14

The dimensions of my resulting table matches the expected dimensions.


4. Long-Form Tidy Data.

Make your data long-form tidy data.


I used the gather function to create two columns (species and count) that include the slender sole and warty poacher data.

##   Year Month Day Station Depth Time.Towed     D1   DO  Temp Salinity   PDO
## 1 2008     7  26    MB30    30         10 496.29 1.99  7.37       NA -1.67
## 2 2008     8  29    MB30    33         10 502.68 3.44  9.04       NA -1.70
## 3 2008     8  11    MB30    30         10 496.29   NA 11.88       NA -1.70
## 4 2008     7  26    MB40    40         10 517.59 1.73  7.26       NA -1.67
## 5 2008     8  29    MB40    44         13 674.91 2.53  8.19       NA -1.70
## 6 2008     8  11    MB40    40         10 517.59 4.09  8.28       NA -1.70
##   Avg.Upwelling.Index      Species Count
## 1                  62 Slender.Sole    67
## 2                  35 Slender.Sole     0
## 3                  35 Slender.Sole     0
## 4                  62 Slender.Sole     0
## 5                  35 Slender.Sole     0
## 6                  35 Slender.Sole   100
## [1] 228  14

The dimensions of my long-form table match the expected dimensions.


5. Fish Counts by Depth.

Plot fish counts by depth for the fish species.


I used ggplot to compare fish counts to depth, then colored each species using the Lake palette from PNWColors.



Log-transforming the data reduces the scale and makes points easier to distinguish.


To remove the effect of log-transforming zero values, I also performed a log(x + c) transformation, where c = 0.5 (one half of lowest non-zero count, 1). I’m not entirely sure that this is the right approach for these data, but in my searches it was a common way to account for lots of zeros.

Both slender sole and warty poacher fish are found less frequently in deeper waters than in shallower waters. Warty poachers are found in increasing numbers until approximately 30 fathoms, after which they are found in fairly consistent numbers. Slender sole are found in a wide variety of numbers at various depths, though with greater range in and higher frequencies at shallower depths than in deeper waters.


Time Taken: This assignment took me about 5 hours in total, maybe a little more. While moving the data from wide form to long form took a good bit of searching, I spent most of the extra time learning R markdown and messing around with palettes.