Be sure to change the author in the YAML to your name. Remember to keep it inside the quotes.
Questions that require the use of R will have an R code chunk below it.
Download the datasets about birds (titled OrdwayBirds.csv and OrdwaySpeciesNames.csv) from the Canvas page for this assignment and save these files to the folder where the RMD file is located.
There is less hand-holding in this assignment.
For this Challenge Problem assignment, you are going to be using dataset, titled OrdwayBirds, on Minnesota bird species. This data was collected as part of a historical record of birds captured and released at a nature preserve in Inver Grove Heights, Minnesota (i.e., the Katharine Ordway Natural History Study Area, which is owned and managed by Macalester College in St. Paul, Minnesota). Originally written by hand in a field notebook, the entries have been transcribed into electronic format under the supervision of Jerald Dosch, Dept. of Biology, Macalester College.1
Due to mistakes in data entry, the SpeciesName variable in the OrdwayBirds dataset needs some fixing. SpeciesName is intended to identify the species of each of the birds, but the spelling often varies among birds of the same biological species. This leads to misclassification of birds.
Fortunately, this error is easy to fix. A different dataset, titled OrdwaySpeciesNames, was created to take into account all of the original variations in the spelling of the species names (SpeciesName) and translate them into a unified spelling (SpeciesNameCleaned). That is, this other dataset provides a cross-reference between the original spelling and a common or more appropriate one.
The information from the two datasets, OrdwayBirds and
OrdwaySpeciesNames, can be merged using a
join_function() to correct the original spellings and then
carry out further explorations of the Minnesota birds dataset.
Practice Problems
Now let’s practice joining tables and describing the result of the
specific join_function().
left_join() between the OrdwayBirds
dataset and the OrdwaySpeciesNames dataset, saving it to a new
object (so all the rows aren’t knitted to your HTML), and examine the
dimensions of the new merged dataset.OrdwayBirds_merged <- OrdwayBirds %>%
left_join(OrdwaySpeciesNames, by = "SpeciesName")
## Warning in left_join(., OrdwaySpeciesNames, by = "SpeciesName"): Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 4 of `x` matches multiple rows in `y`.
## ℹ Row 211 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
## "many-to-many"` to silence this warning.
dim(OrdwayBirds)
## [1] 15829 26
dim(OrdwaySpeciesNames)
## [1] 265 2
dim(OrdwayBirds_merged)
## [1] 17120 27
left_join() do what you
thought it would do? Explain.[Hint: Are the number of rows in the OrdwayBirds dataset the same as your new merged dataset?]
sn.sn <- OrdwaySpeciesNames %>%
distinct()
left_join() again, but this time between
the OrdwayBirds dataset and the sn dataset, saving
it to a new object, and examine the dimensions of the new merged
dataset. Is this new merged dataset what you would expect from doing a
left_join()? Explain.sn_OrdwayBirds <- OrdwayBirds %>%
left_join(sn, by = "SpeciesName")
dim(sn)
## [1] 251 2
view(sn)
dim(OrdwayBirds)
## [1] 15829 26
dim(sn_OrdwayBirds)
## [1] 15829 27
inner_join() between the
OrdwayBirds dataset and the sn dataset, saving it
to a new object. Examine the dimensions of the new merged dataset.
Describe what modifications were made when the two datasets were joined
together into one (e.g., row changes? column changes?).OrdwayBirds_inner <- OrdwayBirds %>%
inner_join(sn, by = "SpeciesName")
dim(OrdwayBirds)
## [1] 15829 26
dim(OrdwayBirds_inner)
## [1] 15724 27
full_join() between the OrdwayBirds
dataset and the sn dataset, saving it to a new object.
Examine the dimensions of the new merged dataset. Describe what
modifications were made when the two datasets were joined together into
one (e.g., row changes? column changes?).OrdwayBirds_full <- OrdwayBirds %>%
full_join(sn, by = "SpeciesName")
dim(OrdwayBirds)
## [1] 15829 26
dim(OrdwayBirds_full)
## [1] 15830 27
semi_join() between the OrdwayBirds
dataset and the sn dataset, saving it to a new object.
Examine the dimensions of the new merged dataset. Describe what
modifications were made when the two datasets were joined together into
one (e.g., row changes? column changes?).OrdwayBirds_semi <- OrdwayBirds %>%
semi_join(sn, by = "SpeciesName")
dim(OrdwayBirds)
## [1] 15829 26
dim(OrdwayBirds_semi)
## [1] 15724 26
anti_join() between the OrdwayBirds
dataset and the sn dataset, saving it to a new object.
Examine the dimensions of the new merged dataset. Describe what
modifications were made when the two datasets were joined together into
one (e.g., row changes? column changes?).OrdwayBirds_anti <- OrdwayBirds %>%
anti_join(sn, by = "SpeciesName")
dim(OrdwayBirds)
## [1] 15829 26
dim(OrdwayBirds_anti)
## [1] 105 26
Putting It All Together
Now let’s put the joins to use (in addition to other data verbs and data visualization) to answer a question.
What is month-to-month presence of the most common bird species in the Ordway nature preserve area?
Think of this assignment as creating a resource for birders on the ideal time of year to visit Ordway to see a particular species.
In addition to the errors in SpeciesName, there are also problems with the Month and Day variables. They are supposed to be numerical, but mistakes prevent them from being correctly identified as such. The following code will take care of this issue with Month and Day: [Note: Take out eval=FALSE in the options of the code chunk so that the code executes in your assignment.]
birds <- OrdwayBirds %>%
mutate(Month = as.numeric(as.character(Month)),
Day = as.numeric(as.character(Day)))
The next set of questions are going to walk you through the process of how to explore the data to answer the question.
birds dataset?birds %>%
distinct(SpeciesName) %>%
count()
sn dataset?sn %>%
distinct(SpeciesNameCleaned) %>%
count()
birds dataset with the sn dataset, such that
only the matching rows from sn and birds are
included and all columns from birds and sn are
included. Also use the na.omit() function to remove missing
data from your dataset (Note: This will remove all rows with ANY NAs).
Save this merged dataset to a new object called
birds_sn.birds_sn <- birds %>%
inner_join(sn, by = "SpeciesName") %>%
na.omit()
birds_sn %>%
count(SpeciesNameCleaned) %>%
arrange(desc(n))
top_species that contains only the top
6 most common bird species.top_species <- birds_sn %>%
count(SpeciesNameCleaned, sort = TRUE) %>%
head(6)
top_species
top_species_month
that contains the top species and a month-by-month count of each of the
most common species. (Hint: use a specific type of join to limit the
birds_sn entries to only the birds included in the top_species dataset,
then count the number of sightings by species and month).top_species_month <- birds_sn %>%
semi_join(top_species, by = "SpeciesNameCleaned") %>%
group_by(SpeciesNameCleaned, Month) %>%
count() %>%
arrange(SpeciesNameCleaned, Month)
top_species_month %>%
ggplot(aes(x = factor(Month), y = n, group = SpeciesNameCleaned, color = SpeciesNameCleaned)) +
geom_line() +
facet_wrap(~SpeciesNameCleaned) +
theme_bw() +
theme(legend.position = "none",
strip.text.x = element_text(size = 6),
axis.text = element_text(size = 6),
axis.title = element_text(size = 12),
plot.title = element_text(hjust = 0.5)) +
labs(x = "Month", y = "# of Captures", title = "Most Common Bird Species: Month by Month Count ")
Use the data visualization to answer these questions for the birders: