This will help you figure how to address a question like Q8!
8. [2.5] Please generate a summary table that provides the average (mean) number of caterpillars observed in “Beat sheet” and “Visual” surveys (observation_method).
Zeros are important!
To properly address this question you will have to do something about surveys in which no caterpillars were observed. Excluding zero values will lead to very inflated count data!
Below is how Tara and I would figure out how to address a question like this …
Read the question carefully!
The question often gives a hint of the functions to use how to address the question. Here, we see that:
We are making a summary table – that almost always requires us to use summarize() with either group_by() or .by == .... It does not make sense to use summarize() without a grouping variable.
The question asks for the average, and specifies that the average is a “mean”. This suggests that mean() should be used to calculate the summary statistic for each group.
We see that caterpillars are our target taxa. Given that there are a bunch of different arthropod taxa in the data, this suggests that a filter() is necessary!
The question describes what the resultant object should be. Here, we see that we want a tibble with variable that describes the observation method and a variable that describes the mean number of caterpillars observed.
Before planning how to address this question, we should take a moment to visualize our resultant object. Given that it is a summary table that describes the mean number of caterpillars for two different methods, we can visualize a table with two columns (method and average number of caterpillars) and two rows (one for each survey type).
Identify which objects to use
The next step is to figure out which dataset(s) to use. We currently have five objects stored in our computer’s memory:
Based on the question and the output above, we can see that the variables of interest are arthropod and arthropod_quantity, which are referenced in the observations table, and observation_method, which is referenced in surveys.
Identify whether a join is necessary
We now know that we need information from two different tables and which tables to use. Because we need information across tables, we know that we need to use a join.
Because our resultant object will include variables that are currently contained in multiple tables, we know that we need to use a mutating join (left_join, inner_join, or full_join) as opposed to a filtering join (semi_join or anti_join).
At this point, we should have a sense that our code block will contain:
summarize()
group_by() or .by = ...
mean()
filter()
Either left_join, inner_join, or full_join
We are narrowing things down!
Identify key variables
Joins are based on a common variable (typically a key variable). So we have to look for what variables the tables have in common:
We can see that the variable that is shared between the tables is survey_id, so that is our key column.
Identify the target and source tables
Our next step is to determine which table is the target table (the table that will receive a new variable) and which is the source table (the table that will donate one or more variables).
A good rule of thumb for safely joining tables is to ensure that the key variable in the …
Source table is a primary key, which is a unique identifier for each observation/row;
Target table is a foreign key, which refers to the primary key of another table.
Let’s use a base R method to figure this out. To do so, we will look at the number of rows in each table and compare it with the number of unique key values:
nrow(observations)
[1] 251048
observations$survey_id %>%unique() %>%length()
[1] 181313
We can see that, because the number of unique survey_id values is less than the number of rows, the variable cannot be a primary key!
Let’s see about surveys:
nrow(surveys)
[1] 181313
surveys$survey_id %>%unique() %>%length()
[1] 181313
Because the number of rows of surveys is equivalent to the number of unique survey_id values, this can be used as the primary key of the table.
Tibbles make life easier!
The way tibbles are printed make this exploratory step pretty easy. Rather than extracting a unique vector and calculated its length, we could have determined the above with:
At this point, we should have a sense that we need to use:
summarize()
group_by() or .by = ...
mean()
filter()
Either left_join, inner_join, or full_join to join surveys (source table) to observations (target table). We know this because survey_id is the primary key of surveys and the foreign key of observations.
Subset the data to what you are interested in
To make things easier, this is a good time to subset the data to our goal.
At first, we might perceive that we want the variables survey_id (foreign key), arthropod, and arthropod_quantity from the observations table:
Either left_join, inner_join, or full_join to join surveys (source table) to observations (target table)
Which join should we use?
We are now ready to address the question of which mutating join will generate the desired result.
When we were identifying our key variable, we may have noticed that the length of unique survey_id values in observations is the same as the number of rows in surveys. This is a clue that all observations have a corresponding survey_id in surveysand that all surveys have a corresponding survey_id in observations.
We can verify this by using anti_join() to subset observations based on non-matching key values. This will filter observations to only those that do not have a corresponding survey_id in surveys:
observations %>%anti_join(surveys, by ="survey_id")
Certainly not! This means that caterpillars were not recorded for every survey_id (aka every survey).
Remember that our goal is to get the average number of caterpillars per survey for each observation method. In order to do this, we are still missing a few pieces of information:
The type of survey
Surveys in which no caterpillars were observed … Note that this clue was provided in the hint for this question!
Here is where we need to figure out which join to use that will add the missing survey_id values from the surveys table to the observations table while also adding the variable observation_method. Our choices are:
left_join(): All rows in the target table are maintained and values are only donated from the source table for matching key values.
inner_join(): The source table donates values to the target table for matching key values and the target table is subset to only rows with matching key values.
full_join(): All key values in the source and target table are maintained. When the keys do not match, the values in the target table are given the value NA.
We know that our resultant object at this stage will have more rows than the observations currently contains. The resultant object of left_join() contains the same number of rows as the target table. With inner_join(), we would end up with the same or less rows than the target table currently contains. Only a full_join() will add rows … we now know the join to use!
With the above choices, only a full_join() will add NA values to arthropod_quantity for the non-matching survey_id key values. We can verify this by using summary() for each of the different types of mutating joins:
survey_id arthropod_quantity observation_method
Length:181313 Min. : 0.00 Length:181313
Class :character 1st Qu.: 1.00 Class :character
Mode :character Median : 1.00 Mode :character
Mean : 1.63
3rd Qu.: 1.00
Max. :320.00
NA's :165588
Notice that with our full_join() the length is much greater than that of other two joins, the length now matches the number of rows in our surveys table, and now we have a load of NA values in our arthropod_quantity variable.
So, at this point we know that our code will contain:
summarize()
group_by() or .by = ...
mean()
filter() – done!
full_join() to join surveys (source table) to observations (target table) – done!
Are we ready to summarize?
Now that we have those missing values, how can we create a summary table that describes the mean() number of caterpillars counted for each observation method?
The summarize() function calculates a summary statistic for each group within the grouping variable that you define with group_by() or .by = ....
Here, we know that our summary statistic is “mean”, so we can calculate that with mean(). We want to conduct this operation for each type of survey, so observation_method is our grouping variable.
Surely we can summarize the mean() for each survey type, yeah? Let’s have a look:
# A tibble: 2 × 2
observation_method mean_caterpillars
<chr> <dbl>
1 Beat sheet NA
2 Visual NA
Ack! Both values were reported as NA! What do we do?
We need to change those NA values to zeros! We can modify the values in the arthropod_quantity column using mutate(). Within our mutate() we use replace_na() to specify that any arthropod_quantity value that is currently NA should be replaced with 0. We can verify that this worked with summary():
You did it! You created a summary table of the average number of caterpillars observed for each survey type. You did this by exploring the data along the way to find the right steps to take. The steps were:
Subset observations to where the arthropod value was "caterpillar"
Subset observations and surveys to your variables of interest
Conduct a full_join() to add rows that represent surveys in which no caterpillars were observed and add NA values to arthropod_quantity when this was the case
Modify the arthropod_quantity variable, using mutate() and replace_na() to replace NA values with 0
Summarize the resultant object with summarize()
Calculate the mean with mean()
Define the grouping variable with .by = observation_method to calculate the mean for each group