Problem set 3: Question 8

This will help you figure how to address a question like Q8!

8. [2.5] Please generate a summary table that provides the average (mean) number of caterpillars observed in “Beat sheet” and “Visual” surveys (observation_method).

Zeros are important!

To properly address this question you will have to do something about surveys in which no caterpillars were observed. Excluding zero values will lead to very inflated count data!

Below is how Tara and I would figure out how to address a question like this …

Read the question carefully!

The question often gives a hint of the functions to use how to address the question. Here, we see that:

  • We are making a summary table – that almost always requires us to use summarize() with either group_by() or .by == .... It does not make sense to use summarize() without a grouping variable.
  • The question asks for the average, and specifies that the average is a “mean”. This suggests that mean() should be used to calculate the summary statistic for each group.
  • We see that caterpillars are our target taxa. Given that there are a bunch of different arthropod taxa in the data, this suggests that a filter() is necessary!
  • The question describes what the resultant object should be. Here, we see that we want a tibble with variable that describes the observation method and a variable that describes the mean number of caterpillars observed.

Before planning how to address this question, we should take a moment to visualize our resultant object. Given that it is a summary table that describes the mean number of caterpillars for two different methods, we can visualize a table with two columns (method and average number of caterpillars) and two rows (one for each survey type).

Identify which objects to use

The next step is to figure out which dataset(s) to use. We currently have five objects stored in our computer’s memory:

observations
# A tibble: 251,048 × 4
   observation_id survey_id arthropod  arthropod_quantity
   <chr>          <chr>     <chr>                   <dbl>
 1 2Ejl0W0        LslF1uF   leafhopper                  1
 2 daAH0hD        acYQwsU   ant                         1
 3 5jXfgg0        sADxf3v   spider                      1
 4 amdUtIo        wlVEaKe   none                        0
 5 i09LhUC        SMiwqKo   beetle                      2
 6 WXtc3pP        SMiwqKo   ant                         1
 7 uw3H9A1        SMiwqKo   leafhopper                  1
 8 AdiFmpt        klgId4y   beetle                      1
 9 oc9soja        klgId4y   ant                         1
10 JuvJ6SC        LtTlMU2   ant                         2
# ℹ 251,038 more rows
sites
# A tibble: 261 × 6
   site_id site_name                  site_description latitude longitude region
   <chr>   <chr>                      <chr>               <dbl>     <dbl> <chr> 
 1 XaCkrZW 16 Old Paris Mtn           Nikki Day's bac…     34.9     -82.4 SC    
 2 V2LDBQd ABQ BioPark Botanical Gar… The ABQ Biopark…     35.1    -107.  NM    
 3 x2AEX70 ASRI Caratunk              Caratunk Wildli…     41.9     -71.3 MA    
 4 bO5MLKw ASRI Claire McIntosh       Claire D McInto…     41.7     -71.3 RI    
 5 LZqkqK5 ASRI Davis Memorial        Davis Memorial …     41.6     -71.5 RI    
 6 CmZfVQZ ASRI Emilie Ruecker        Emilie Ruecker …     41.6     -71.2 RI    
 7 YLUSVRc ASRI Fisherville Brook     Fisherville Bro…     41.6     -71.6 RI    
 8 VM5fmih ASRI Fort                  Fort Wildlife R…     42.0     -71.6 RI    
 9 7Gwds4l ASRI Long Pond             Long Pond Woods…     41.5     -71.8 RI    
10 Gh0r1yJ ASRI Maxwell Mays          Maxwell Mays Wi…     41.7     -71.7 RI    
# ℹ 251 more rows
sites_dmv
# A tibble: 74 × 6
   site_id site_name     site_description              latitude longitude region
   <chr>   <chr>         <chr>                            <dbl>     <dbl> <chr> 
 1 j2auEBA BARBJERMD1    Suburban backyard NN              39.1     -76.5 MD    
 2 BNFZIez BBS 88-900-09 Breeding Bird Survey in Virg…     37.7     -79.3 VA    
 3 1CytMDf BBS 88-900-29 Breeding Bird Survey in Virg…     37.7     -79.2 VA    
 4 fQyKVYF BBS 88-902-23 Breeding Bird Survey in Virg…     37.5     -79.6 VA    
 5 tb2K6hK BBS 88-902-36 Breeding Bird Survey in Virg…     37.5     -79.6 VA    
 6 pzciBi6 BBS 88-905-17 Breeding Bird Survey in Virg…     37.5     -80.2 VA    
 7 O8VV7r8 BBS 88-905-38 Breeding Bird Survey in Virg…     37.6     -80.2 VA    
 8 9SUjmd4 BBS 88-907-21 Breeding Bird Survey in Virg…     38.4     -78.5 VA    
 9 PgBl1Lj BBS 88-907-45 Breeding Bird Survey in Virg…     38.3     -78.6 VA    
10 crXEjUe BBS 88-920-25 Breeding Bird Survey in Virg…     36.7     -81.6 VA    
# ℹ 64 more rows
survey_locations
# A tibble: 10,255 × 3
   branch_id site_id plant         
   <chr>     <chr>   <chr>         
 1 FUG       XaCkrZW Privet spp.   
 2 FUF       XaCkrZW Peach         
 3 FUE       XaCkrZW Peach         
 4 FUD       XaCkrZW Persimmon spp.
 5 FUC       XaCkrZW Persimmon spp.
 6 FUB       XaCkrZW American plum 
 7 FUA       XaCkrZW Black locust  
 8 FTZ       XaCkrZW Ash spp.      
 9 FTY       XaCkrZW Eastern redbud
10 FTX       XaCkrZW Chokecherry   
# ℹ 10,245 more rows
surveys
# A tibble: 181,313 × 4
   survey_id branch_id date       observation_method
   <chr>     <chr>     <date>     <chr>             
 1 cQpOyd5   CFDGUVJ   2002-05-29 Visual            
 2 bMesgNV   CFDGUVK   2002-05-29 Visual            
 3 im4Nhbv   CFDGUVL   2002-05-29 Visual            
 4 10s2A3O   CFDGUVM   2002-05-29 Visual            
 5 SBbk49o   CFDGUVP   2002-05-29 Visual            
 6 qFSKksv   CFDGUVQ   2002-05-29 Visual            
 7 a5hGsU3   CFDGUVZ   2002-05-29 Visual            
 8 HfSJC9L   CFDGUWA   2002-05-29 Visual            
 9 FBban39   CFDGUWD   2002-05-29 Visual            
10 L4AfdSp   CFDGUWE   2002-05-29 Visual            
# ℹ 181,303 more rows

Based on the question and the output above, we can see that the variables of interest are arthropod and arthropod_quantity, which are referenced in the observations table, and observation_method, which is referenced in surveys.

Identify whether a join is necessary

We now know that we need information from two different tables and which tables to use. Because we need information across tables, we know that we need to use a join.

Because our resultant object will include variables that are currently contained in multiple tables, we know that we need to use a mutating join (left_join, inner_join, or full_join) as opposed to a filtering join (semi_join or anti_join).

At this point, we should have a sense that our code block will contain:

  • summarize()
  • group_by() or .by = ...
  • mean()
  • filter()
  • Either left_join, inner_join, or full_join

We are narrowing things down!

Identify key variables

Joins are based on a common variable (typically a key variable). So we have to look for what variables the tables have in common:

observations
# A tibble: 251,048 × 4
   observation_id survey_id arthropod  arthropod_quantity
   <chr>          <chr>     <chr>                   <dbl>
 1 2Ejl0W0        LslF1uF   leafhopper                  1
 2 daAH0hD        acYQwsU   ant                         1
 3 5jXfgg0        sADxf3v   spider                      1
 4 amdUtIo        wlVEaKe   none                        0
 5 i09LhUC        SMiwqKo   beetle                      2
 6 WXtc3pP        SMiwqKo   ant                         1
 7 uw3H9A1        SMiwqKo   leafhopper                  1
 8 AdiFmpt        klgId4y   beetle                      1
 9 oc9soja        klgId4y   ant                         1
10 JuvJ6SC        LtTlMU2   ant                         2
# ℹ 251,038 more rows
surveys
# A tibble: 181,313 × 4
   survey_id branch_id date       observation_method
   <chr>     <chr>     <date>     <chr>             
 1 cQpOyd5   CFDGUVJ   2002-05-29 Visual            
 2 bMesgNV   CFDGUVK   2002-05-29 Visual            
 3 im4Nhbv   CFDGUVL   2002-05-29 Visual            
 4 10s2A3O   CFDGUVM   2002-05-29 Visual            
 5 SBbk49o   CFDGUVP   2002-05-29 Visual            
 6 qFSKksv   CFDGUVQ   2002-05-29 Visual            
 7 a5hGsU3   CFDGUVZ   2002-05-29 Visual            
 8 HfSJC9L   CFDGUWA   2002-05-29 Visual            
 9 FBban39   CFDGUWD   2002-05-29 Visual            
10 L4AfdSp   CFDGUWE   2002-05-29 Visual            
# ℹ 181,303 more rows

We can see that the variable that is shared between the tables is survey_id, so that is our key column.

Identify the target and source tables

Our next step is to determine which table is the target table (the table that will receive a new variable) and which is the source table (the table that will donate one or more variables).

A good rule of thumb for safely joining tables is to ensure that the key variable in the …

  • Source table is a primary key, which is a unique identifier for each observation/row;
  • Target table is a foreign key, which refers to the primary key of another table.

Let’s use a base R method to figure this out. To do so, we will look at the number of rows in each table and compare it with the number of unique key values:

nrow(observations)
[1] 251048
observations$survey_id %>% 
  unique() %>% 
  length()
[1] 181313

We can see that, because the number of unique survey_id values is less than the number of rows, the variable cannot be a primary key!

Let’s see about surveys:

nrow(surveys)
[1] 181313
surveys$survey_id %>% 
  unique() %>% 
  length()
[1] 181313

Because the number of rows of surveys is equivalent to the number of unique survey_id values, this can be used as the primary key of the table.

Tibbles make life easier!

The way tibbles are printed make this exploratory step pretty easy. Rather than extracting a unique vector and calculated its length, we could have determined the above with:

observations
# A tibble: 251,048 × 4
   observation_id survey_id arthropod  arthropod_quantity
   <chr>          <chr>     <chr>                   <dbl>
 1 2Ejl0W0        LslF1uF   leafhopper                  1
 2 daAH0hD        acYQwsU   ant                         1
 3 5jXfgg0        sADxf3v   spider                      1
 4 amdUtIo        wlVEaKe   none                        0
 5 i09LhUC        SMiwqKo   beetle                      2
 6 WXtc3pP        SMiwqKo   ant                         1
 7 uw3H9A1        SMiwqKo   leafhopper                  1
 8 AdiFmpt        klgId4y   beetle                      1
 9 oc9soja        klgId4y   ant                         1
10 JuvJ6SC        LtTlMU2   ant                         2
# ℹ 251,038 more rows
observations %>% 
  distinct(survey_id)
# A tibble: 181,313 × 1
   survey_id
   <chr>    
 1 LslF1uF  
 2 acYQwsU  
 3 sADxf3v  
 4 wlVEaKe  
 5 SMiwqKo  
 6 klgId4y  
 7 LtTlMU2  
 8 duoZH6I  
 9 RCrH6P6  
10 fjmT4Ei  
# ℹ 181,303 more rows
surveys
# A tibble: 181,313 × 4
   survey_id branch_id date       observation_method
   <chr>     <chr>     <date>     <chr>             
 1 cQpOyd5   CFDGUVJ   2002-05-29 Visual            
 2 bMesgNV   CFDGUVK   2002-05-29 Visual            
 3 im4Nhbv   CFDGUVL   2002-05-29 Visual            
 4 10s2A3O   CFDGUVM   2002-05-29 Visual            
 5 SBbk49o   CFDGUVP   2002-05-29 Visual            
 6 qFSKksv   CFDGUVQ   2002-05-29 Visual            
 7 a5hGsU3   CFDGUVZ   2002-05-29 Visual            
 8 HfSJC9L   CFDGUWA   2002-05-29 Visual            
 9 FBban39   CFDGUWD   2002-05-29 Visual            
10 L4AfdSp   CFDGUWE   2002-05-29 Visual            
# ℹ 181,303 more rows
surveys %>% 
  distinct(survey_id)
# A tibble: 181,313 × 1
   survey_id
   <chr>    
 1 cQpOyd5  
 2 bMesgNV  
 3 im4Nhbv  
 4 10s2A3O  
 5 SBbk49o  
 6 qFSKksv  
 7 a5hGsU3  
 8 HfSJC9L  
 9 FBban39  
10 L4AfdSp  
# ℹ 181,303 more rows

At this point, we should have a sense that we need to use:

  • summarize()
  • group_by() or .by = ...
  • mean()
  • filter()
  • Either left_join, inner_join, or full_join to join surveys (source table) to observations (target table). We know this because survey_id is the primary key of surveys and the foreign key of observations.

Subset the data to what you are interested in

To make things easier, this is a good time to subset the data to our goal.

At first, we might perceive that we want the variables survey_id (foreign key), arthropod, and arthropod_quantity from the observations table:

observations %>% 
  select(
    survey_id,
    arthropod,
    arthropod_quantity
  )
# A tibble: 251,048 × 3
   survey_id arthropod  arthropod_quantity
   <chr>     <chr>                   <dbl>
 1 LslF1uF   leafhopper                  1
 2 acYQwsU   ant                         1
 3 sADxf3v   spider                      1
 4 wlVEaKe   none                        0
 5 SMiwqKo   beetle                      2
 6 SMiwqKo   ant                         1
 7 SMiwqKo   leafhopper                  1
 8 klgId4y   beetle                      1
 9 klgId4y   ant                         1
10 LtTlMU2   ant                         2
# ℹ 251,038 more rows

… on closer look, however, the question is only asking for counts of caterpillars. This suggests that we need to filter the data:

observations %>% 
  filter(arthropod == "caterpillar") %>% 
  select(
    survey_id,
    arthropod,
    arthropod_quantity
  )
# A tibble: 15,725 × 3
   survey_id arthropod   arthropod_quantity
   <chr>     <chr>                    <dbl>
 1 7ouv2sn   caterpillar                  1
 2 32RWZRe   caterpillar                  1
 3 sekYbz6   caterpillar                  1
 4 mVryFP7   caterpillar                  1
 5 mBe7g8c   caterpillar                  1
 6 XnYVQlx   caterpillar                  1
 7 jLD9YL0   caterpillar                  1
 8 NnBf2gC   caterpillar                  1
 9 rxO2tzk   caterpillar                  1
10 SrhPmmt   caterpillar                  1
# ℹ 15,715 more rows

Because every value in the column arthropod is now "caterpillar", this could be simplified to:

observations %>% 
  filter(arthropod == "caterpillar") %>% 
  select(survey_id, arthropod_quantity)
# A tibble: 15,725 × 2
   survey_id arthropod_quantity
   <chr>                  <dbl>
 1 7ouv2sn                    1
 2 32RWZRe                    1
 3 sekYbz6                    1
 4 mVryFP7                    1
 5 mBe7g8c                    1
 6 XnYVQlx                    1
 7 jLD9YL0                    1
 8 NnBf2gC                    1
 9 rxO2tzk                    1
10 SrhPmmt                    1
# ℹ 15,715 more rows

For our surveys data, we know that we need to maintain the key column (survey_id) and type of survey observation_method:

surveys %>% 
  select(survey_id, observation_method)
# A tibble: 181,313 × 2
   survey_id observation_method
   <chr>     <chr>             
 1 cQpOyd5   Visual            
 2 bMesgNV   Visual            
 3 im4Nhbv   Visual            
 4 10s2A3O   Visual            
 5 SBbk49o   Visual            
 6 qFSKksv   Visual            
 7 a5hGsU3   Visual            
 8 HfSJC9L   Visual            
 9 FBban39   Visual            
10 L4AfdSp   Visual            
# ℹ 181,303 more rows

Now we know that our code will contain:

  • summarize()
  • group_by() or .by = ...
  • mean()
  • filter() – done!
  • Either left_join, inner_join, or full_join to join surveys (source table) to observations (target table)

Which join should we use?

We are now ready to address the question of which mutating join will generate the desired result.

When we were identifying our key variable, we may have noticed that the length of unique survey_id values in observations is the same as the number of rows in surveys. This is a clue that all observations have a corresponding survey_id in surveys and that all surveys have a corresponding survey_id in observations.

We can verify this by using anti_join() to subset observations based on non-matching key values. This will filter observations to only those that do not have a corresponding survey_id in surveys:

observations %>% 
  anti_join(surveys, by = "survey_id")
# A tibble: 0 × 4
# ℹ 4 variables: observation_id <chr>, survey_id <chr>, arthropod <chr>,
#   arthropod_quantity <dbl>

… or, we can subset surveys for which the survey_id does not have a corresponding value in observations:

surveys %>% 
  anti_join(observations, by = "survey_id")
# A tibble: 0 × 4
# ℹ 4 variables: survey_id <chr>, branch_id <chr>, date <date>,
#   observation_method <chr>

So far so good!

Is the same true, however, when we subset observations to "caterpillars"?

observations %>% 
  filter(arthropod == "caterpillar") %>% 
  select(survey_id, arthropod_quantity) %>% 
  anti_join(
    surveys %>% 
      select(survey_id, observation_method), 
    by = "survey_id"
  )
# A tibble: 0 × 2
# ℹ 2 variables: survey_id <chr>, arthropod_quantity <dbl>

That one checks out … how about joining observations to surveys?

surveys %>% 
  select(survey_id, observation_method) %>% 
  anti_join(
    observations %>% 
      filter(arthropod == "caterpillar") %>% 
      select(survey_id, arthropod_quantity),
    by = "survey_id"
  )
# A tibble: 165,588 × 2
   survey_id observation_method
   <chr>     <chr>             
 1 cQpOyd5   Visual            
 2 im4Nhbv   Visual            
 3 10s2A3O   Visual            
 4 SBbk49o   Visual            
 5 qFSKksv   Visual            
 6 a5hGsU3   Visual            
 7 HfSJC9L   Visual            
 8 FBban39   Visual            
 9 MafeUzO   Visual            
10 rhNsHqd   Visual            
# ℹ 165,578 more rows

Certainly not! This means that caterpillars were not recorded for every survey_id (aka every survey).

Remember that our goal is to get the average number of caterpillars per survey for each observation method. In order to do this, we are still missing a few pieces of information:

  • The type of survey
  • Surveys in which no caterpillars were observed … Note that this clue was provided in the hint for this question!

Here is where we need to figure out which join to use that will add the missing survey_id values from the surveys table to the observations table while also adding the variable observation_method. Our choices are:

  • left_join(): All rows in the target table are maintained and values are only donated from the source table for matching key values.
  • inner_join(): The source table donates values to the target table for matching key values and the target table is subset to only rows with matching key values.
  • full_join(): All key values in the source and target table are maintained. When the keys do not match, the values in the target table are given the value NA.

We know that our resultant object at this stage will have more rows than the observations currently contains. The resultant object of left_join() contains the same number of rows as the target table. With inner_join(), we would end up with the same or less rows than the target table currently contains. Only a full_join() will add rows … we now know the join to use!

observations %>% 
  filter(arthropod == "caterpillar") %>% 
  select(survey_id, arthropod_quantity) %>% 
  full_join(
    surveys %>% 
      select(survey_id, observation_method), 
    by = "survey_id"
  )
# A tibble: 181,313 × 3
   survey_id arthropod_quantity observation_method
   <chr>                  <dbl> <chr>             
 1 7ouv2sn                    1 Beat sheet        
 2 32RWZRe                    1 Visual            
 3 sekYbz6                    1 Visual            
 4 mVryFP7                    1 Visual            
 5 mBe7g8c                    1 Visual            
 6 XnYVQlx                    1 Visual            
 7 jLD9YL0                    1 Visual            
 8 NnBf2gC                    1 Visual            
 9 rxO2tzk                    1 Visual            
10 SrhPmmt                    1 Visual            
# ℹ 181,303 more rows

With the above choices, only a full_join() will add NA values to arthropod_quantity for the non-matching survey_id key values. We can verify this by using summary() for each of the different types of mutating joins:

observations %>% 
  filter(arthropod == "caterpillar") %>% 
  select(survey_id, arthropod_quantity) %>% 
  left_join(
    surveys %>% 
      select(survey_id, observation_method), 
    by = "survey_id"
  ) %>% 
  summary()
  survey_id         arthropod_quantity observation_method
 Length:15725       Min.   :  0.000    Length:15725      
 Class :character   1st Qu.:  1.000    Class :character  
 Mode  :character   Median :  1.000    Mode  :character  
                    Mean   :  1.635                      
                    3rd Qu.:  1.000                      
                    Max.   :320.000                      
observations %>% 
  filter(arthropod == "caterpillar") %>% 
  select(survey_id, arthropod_quantity) %>% 
  inner_join(
    surveys %>% 
      select(survey_id, observation_method), 
    by = "survey_id"
  ) %>% 
  summary()
  survey_id         arthropod_quantity observation_method
 Length:15725       Min.   :  0.000    Length:15725      
 Class :character   1st Qu.:  1.000    Class :character  
 Mode  :character   Median :  1.000    Mode  :character  
                    Mean   :  1.635                      
                    3rd Qu.:  1.000                      
                    Max.   :320.000                      
observations %>% 
  filter(arthropod == "caterpillar") %>% 
  select(survey_id, arthropod_quantity) %>% 
  full_join(
    surveys %>% 
      select(survey_id, observation_method), 
    by = "survey_id"
  ) %>% 
  summary()
  survey_id         arthropod_quantity observation_method
 Length:181313      Min.   :  0.00     Length:181313     
 Class :character   1st Qu.:  1.00     Class :character  
 Mode  :character   Median :  1.00     Mode  :character  
                    Mean   :  1.63                       
                    3rd Qu.:  1.00                       
                    Max.   :320.00                       
                    NA's   :165588                       

Notice that with our full_join() the length is much greater than that of other two joins, the length now matches the number of rows in our surveys table, and now we have a load of NA values in our arthropod_quantity variable.

So, at this point we know that our code will contain:

  • summarize()
  • group_by() or .by = ...
  • mean()
  • filter() – done!
  • full_join() to join surveys (source table) to observations (target table) – done!

Are we ready to summarize?

Now that we have those missing values, how can we create a summary table that describes the mean() number of caterpillars counted for each observation method?

The summarize() function calculates a summary statistic for each group within the grouping variable that you define with group_by() or .by = ....

Here, we know that our summary statistic is “mean”, so we can calculate that with mean(). We want to conduct this operation for each type of survey, so observation_method is our grouping variable.

Surely we can summarize the mean() for each survey type, yeah? Let’s have a look:

observations %>% 
  filter(arthropod == "caterpillar") %>% 
  select(survey_id, arthropod_quantity) %>% 
  full_join(
    surveys %>% 
      select(survey_id, observation_method), 
    by = "survey_id"
  ) %>% 
  summarize(
    mean_caterpillars = mean(arthropod_quantity),
    .by = observation_method
  )
# A tibble: 2 × 2
  observation_method mean_caterpillars
  <chr>                          <dbl>
1 Beat sheet                        NA
2 Visual                            NA

Ack! Both values were reported as NA! What do we do?

We need to change those NA values to zeros! We can modify the values in the arthropod_quantity column using mutate(). Within our mutate() we use replace_na() to specify that any arthropod_quantity value that is currently NA should be replaced with 0. We can verify that this worked with summary():

observations %>% 
  filter(arthropod == "caterpillar") %>% 
  select(survey_id, arthropod_quantity) %>% 
  full_join(
    surveys %>% 
      select(survey_id, observation_method), 
    by = "survey_id"
  ) %>% 
  mutate(
    arthropod_quantity = replace_na(arthropod_quantity, 0)
  ) %>% 
  summary()
  survey_id         arthropod_quantity observation_method
 Length:181313      Min.   :  0.0000   Length:181313     
 Class :character   1st Qu.:  0.0000   Class :character  
 Mode  :character   Median :  0.0000   Mode  :character  
                    Mean   :  0.1418                     
                    3rd Qu.:  0.0000                     
                    Max.   :320.0000                     

… and then apply this to our summarize() operation:

observations %>% 
  filter(arthropod == "caterpillar") %>% 
  select(survey_id, arthropod_quantity) %>% 
  full_join(
    surveys %>% 
      select(survey_id, observation_method), 
    by = "survey_id"
  ) %>% 
  mutate(
    arthropod_quantity = replace_na(arthropod_quantity, 0)
  ) %>% 
  summarize(
    mean_caterpillars = mean(arthropod_quantity),
    .by = observation_method
  )
# A tibble: 2 × 2
  observation_method mean_caterpillars
  <chr>                          <dbl>
1 Beat sheet                    0.0889
2 Visual                        0.163 

You did it! You created a summary table of the average number of caterpillars observed for each survey type. You did this by exploring the data along the way to find the right steps to take. The steps were:

  1. Subset observations to where the arthropod value was "caterpillar"
  2. Subset observations and surveys to your variables of interest
  3. Conduct a full_join() to add rows that represent surveys in which no caterpillars were observed and add NA values to arthropod_quantity when this was the case
  4. Modify the arthropod_quantity variable, using mutate() and replace_na() to replace NA values with 0
  5. Summarize the resultant object with summarize()
  6. Calculate the mean with mean()
  7. Define the grouping variable with .by = observation_method to calculate the mean for each group