The objective of this first homework assignment is to (re)familiarize you with some common R functions that we will be using throughout the course.
For homework assignments, I’ll enter instructions and hints in bold using double asterisks. When you’re all done and ready to knit your final output, you should delete these sections from the text blocks.
I encourage you to begin by having a consistent file directory structure for this course. Path names will be easier to deal with if you keep everything short and sweet. Avoid spaces and capital letters whenever possible. I recommend using “fw5051” as a main directory with a subdirectory called “homework”. Within the homework subdirectory, create a 2nd level subdirectory called “hw01”. Notice that I “padded” the 1 with an additional 0, e.g. hw01 instead of hw1. If I expect to have 100 or more subfolders (probably not a good idea!), I should pad with 2 zeroes. This is helpful because alphabetically hw2 comes after hw12.
If you copy the initial R script(s) and data file(s) into the subdirectory you have created, you will have a consistent structure for path names and you’ll only need to worry about it once. And if you open the R script from inside this subdirectory, you’ll be all set.
R Projects are also a great way to set things up, but require that I can share entire folders with you, and alas, Canvas doesn’t allow this.
[Use text to describe the directory structure you have created for this course. I will use square brackets within text block sections where I want you to provide your own descriptions.]
# use getwd() to identify the current working directory for R
getwd()
## [1] "/Users/BradAngstadt/Downloads/FW 5051"
^^ Notice that the slashes (/) face the opposite direction of what you would see if were getting a path name from Windows Explorer (). In R we need to use / or \ to denote subdirectories.
There are many ways to load data into R. Here I showcase methods that we might use in class, with read.csv() being the most common way we will load data.
The code below demonstrates loading a canned R data set (Fisher’s Iris data), a csv file (walleye data from “Mystery Lake” in Minnesota), and a text file (a small amount of American Robin data from VertNet, an online repository of collection data from natural history museums around the world).
# load the "canned" iris data
data(iris) # data() only works for example data sets included in base R
# load a csv file from your computer using read.csv() using full path name. Remember to use quotation marks and .csv at the end of your file name.
walleye <- read.csv("hw01_WAE.csv")
# we can omit the path name if the file is located in our current working directory
walleye <- read.csv("hw01_WAE.csv",
header = TRUE)
# load a text file (.txt) using fread from the data.table package
AMRO <- fread("hw01_AMRO.txt",
quote = "",
header = TRUE)
In the code chunk below, we will utilize a variety of R functions to visualize and explore the walleye data.
Try using some basic exploratory function such as dim(), head(), tail(), str(). Once you feel familiar with the dataset, create a new variable called “n_obs” for how many observations there are in the dataset. Remember that when specifying columns and rows, R uses the following format: dataframe[rows, columns]
#dim() can be used to get the number of rows and columns for a data frame
#head() can be used to display the first six rows in a dataset
#tail() can be used to display the last six rows in a dataset
#str() can be used to get the internal structure of a dataframe including a list of all variables and their object class
dim(walleye)
## [1] 8282 10
head(walleye)
## lake_n month day year station spp length_mm weight_g age sex
## 1 1 9 7 1983 1 WAE 154 45 0 <NA>
## 2 1 9 7 1983 1 WAE 180 45 1 <NA>
## 3 1 9 7 1983 3 WAE 198 45 1 <NA>
## 4 1 9 7 1983 2 WAE 208 45 1 <NA>
## 5 1 9 7 1983 3 WAE 218 45 1 <NA>
## 6 1 9 7 1983 4 WAE 223 45 1 <NA>
tail(walleye)
## lake_n month day year station spp length_mm weight_g age sex
## 8277 1 9 29 2016 14 WAE 382 482 3 M
## 8278 1 9 29 2016 14 WAE 456 864 6 M
## 8279 1 9 29 2016 13 WAE 515 1252 6 F
## 8280 1 9 29 2016 14 WAE 546 1286 7 F
## 8281 1 9 29 2016 14 WAE 557 1378 8 F
## 8282 1 9 29 2016 14 WAE 630 2360 12 F
str(walleye)
## 'data.frame': 8282 obs. of 10 variables:
## $ lake_n : int 1 1 1 1 1 1 1 1 1 1 ...
## $ month : int 9 9 9 9 9 9 9 9 9 9 ...
## $ day : int 7 7 7 7 7 7 7 7 7 7 ...
## $ year : int 1983 1983 1983 1983 1983 1983 1983 1983 1983 1983 ...
## $ station : int 1 1 3 2 3 4 4 3 3 4 ...
## $ spp : chr "WAE" "WAE" "WAE" "WAE" ...
## $ length_mm: int 154 180 198 208 218 223 223 228 243 269 ...
## $ weight_g : int 45 45 45 45 45 45 45 91 91 136 ...
## $ age : int 0 1 1 1 1 1 1 1 1 2 ...
## $ sex : chr NA NA NA NA ...
# create a variable called n_obs for how many observations are in the dataset
n_obs <- nrow(walleye)
R is a vector-based program that will perform the same function on all values stored within a vector (e.g., a single series of values, or one column within a dataframe).
Use this property to create new variables that are functions of existing variables in the walleye data. Show log tranformations and recombinations of 2 or more variables including some common mathematical operators below. There are some examples to get you started
# Growth per year: if walleyes grew at a constant rate, we could estimate growth per year of age by dividing the length of a walleye by their age
str(walleye)
## 'data.frame': 8282 obs. of 10 variables:
## $ lake_n : int 1 1 1 1 1 1 1 1 1 1 ...
## $ month : int 9 9 9 9 9 9 9 9 9 9 ...
## $ day : int 7 7 7 7 7 7 7 7 7 7 ...
## $ year : int 1983 1983 1983 1983 1983 1983 1983 1983 1983 1983 ...
## $ station : int 1 1 3 2 3 4 4 3 3 4 ...
## $ spp : chr "WAE" "WAE" "WAE" "WAE" ...
## $ length_mm: int 154 180 198 208 218 223 223 228 243 269 ...
## $ weight_g : int 45 45 45 45 45 45 45 91 91 136 ...
## $ age : int 0 1 1 1 1 1 1 1 1 2 ...
## $ sex : chr NA NA NA NA ...
walleye$length <- as.numeric(as.character(walleye$length))
walleye$age <- as.numeric(as.character(walleye$age))
growth_per_year <- transform(walleye, growth_per_year = length / age)
head(growth_per_year)
## lake_n month day year station spp length_mm weight_g age sex length
## 1 1 9 7 1983 1 WAE 154 45 0 <NA> 154
## 2 1 9 7 1983 1 WAE 180 45 1 <NA> 180
## 3 1 9 7 1983 3 WAE 198 45 1 <NA> 198
## 4 1 9 7 1983 2 WAE 208 45 1 <NA> 208
## 5 1 9 7 1983 3 WAE 218 45 1 <NA> 218
## 6 1 9 7 1983 4 WAE 223 45 1 <NA> 223
## growth_per_year
## 1 Inf
## 2 180
## 3 198
## 4 208
## 5 218
## 6 223
# Condition: to better understand the "condition" of walleye in our dataset, we might try comparing the ratio of body mass (weight) to body size (length)
walleye_condition <- transform(walleye, condition_factor = (weight_g / (length^3)) * 100000)
head(walleye_condition)
## lake_n month day year station spp length_mm weight_g age sex length
## 1 1 9 7 1983 1 WAE 154 45 0 <NA> 154
## 2 1 9 7 1983 1 WAE 180 45 1 <NA> 180
## 3 1 9 7 1983 3 WAE 198 45 1 <NA> 198
## 4 1 9 7 1983 2 WAE 208 45 1 <NA> 208
## 5 1 9 7 1983 3 WAE 218 45 1 <NA> 218
## 6 1 9 7 1983 4 WAE 223 45 1 <NA> 223
## condition_factor
## 1 1.2321125
## 2 0.7716049
## 3 0.5797182
## 4 0.5000605
## 5 0.4343532
## 6 0.4057868
# Log length: if we wanted to transform our data, we could try taking the natural log of one of our variables such as length, weight, etc
walleye_log_length <- transform(walleye, log_length = log(length))
head(walleye_log_length)
## lake_n month day year station spp length_mm weight_g age sex length
## 1 1 9 7 1983 1 WAE 154 45 0 <NA> 154
## 2 1 9 7 1983 1 WAE 180 45 1 <NA> 180
## 3 1 9 7 1983 3 WAE 198 45 1 <NA> 198
## 4 1 9 7 1983 2 WAE 208 45 1 <NA> 208
## 5 1 9 7 1983 3 WAE 218 45 1 <NA> 218
## 6 1 9 7 1983 4 WAE 223 45 1 <NA> 223
## log_length
## 1 5.036953
## 2 5.192957
## 3 5.288267
## 4 5.337538
## 5 5.384495
## 6 5.407172
# Feel free to try out any additional functions!
Use plotting functions in base R to visually explore patterns in the data. For “quick and dirty” plots that are for your eyes only, base R plotting functions are fully adequate and much simpler than ggplot. But for complex graphing tasks, or for high quality products that will be shared with others, try to use ggplot.
Make a plot using either base R plots or ggplot to evaluate the relationships between any of the newly created variables above.
##install.packages("ggplot2")
##library(ggplot2)
##ggplot2(walleye, aes(x = log(length), y = log(weight)))
##having trouble updating rlang most likely due to older version of rstudio running
#plot() OR #ggplot()
plot(walleye$age, walleye$growth_per_year,
main = "walleye Growth Rate vs. Age",
xlab = "Age (Years)",
ylab = "Growth Per Year (mm/Year)",
pch = 16,
col = "blue")
If not pursuing B-level homework, you can delete everything from here on down.
Duplicate the C-level homework sections on explore, manipulate, and plot data using the AMRO dataset. This dataset is a collection of observations on the American Robin (Turdus migratorius) from various natural history collections across the world.
Select 3 numeric variables and 3 character values to explore using summary and table functions. Provide your best interpretation of what these variables represent based on the data you can observe (note that any good data set should have a help file that clearly explains what each variable represents, but that isn’t always the case). Create a new variable from an existing variable (try to create something sensible). Create a plot of two or more variables in the data using either ggplot or base R plotting functions. Be sure to label axes and include any necessary legend descriptions.
Use text blocks to describe what you are doing (using complete sentences), and use short comments in the code (# like this) to describe what is happening line by line.
#names is used to see all the variables
names(AMRO)
## [1] "type"
## [2] "modified"
## [3] "license"
## [4] "rightsholder"
## [5] "accessrights"
## [6] "bibliographiccitation"
## [7] "references"
## [8] "institutionid"
## [9] "collectionid"
## [10] "datasetid"
## [11] "institutioncode"
## [12] "collectioncode"
## [13] "datasetname"
## [14] "basisofrecord"
## [15] "informationwithheld"
## [16] "datageneralizations"
## [17] "dynamicproperties"
## [18] "occurrenceid"
## [19] "catalognumber"
## [20] "recordnumber"
## [21] "recordedby"
## [22] "individualcount"
## [23] "sex"
## [24] "lifestage"
## [25] "reproductivecondition"
## [26] "behavior"
## [27] "establishmentmeans"
## [28] "occurrencestatus"
## [29] "preparations"
## [30] "disposition"
## [31] "associatedmedia"
## [32] "associatedreferences"
## [33] "associatedsequences"
## [34] "associatedtaxa"
## [35] "othercatalognumbers"
## [36] "occurrenceremarks"
## [37] "organismid"
## [38] "organismname"
## [39] "organismscope"
## [40] "associatedoccurrences"
## [41] "associatedorganisms"
## [42] "previousidentifications"
## [43] "organismremarks"
## [44] "materialsampleid"
## [45] "eventid"
## [46] "fieldnumber"
## [47] "eventdate"
## [48] "eventtime"
## [49] "startdayofyear"
## [50] "enddayofyear"
## [51] "year"
## [52] "month"
## [53] "day"
## [54] "verbatimeventdate"
## [55] "habitat"
## [56] "samplingprotocol"
## [57] "samplingeffort"
## [58] "fieldnotes"
## [59] "eventremarks"
## [60] "locationid"
## [61] "highergeographyid"
## [62] "highergeography"
## [63] "continent"
## [64] "waterbody"
## [65] "islandgroup"
## [66] "island"
## [67] "country"
## [68] "countrycode"
## [69] "stateprovince"
## [70] "county"
## [71] "municipality"
## [72] "locality"
## [73] "verbatimlocality"
## [74] "minimumelevationinmeters"
## [75] "maximumelevationinmeters"
## [76] "verbatimelevation"
## [77] "minimumdepthinmeters"
## [78] "maximumdepthinmeters"
## [79] "verbatimdepth"
## [80] "minimumdistanceabovesurfaceinmeters"
## [81] "maximumdistanceabovesurfaceinmeters"
## [82] "locationaccordingto"
## [83] "locationremarks"
## [84] "decimallatitude"
## [85] "decimallongitude"
## [86] "geodeticdatum"
## [87] "coordinateuncertaintyinmeters"
## [88] "coordinateprecision"
## [89] "verbatimcoordinates"
## [90] "verbatimlatitude"
## [91] "verbatimlongitude"
## [92] "verbatimcoordinatesystem"
## [93] "verbatimsrs"
## [94] "footprintwkt"
## [95] "footprintsrs"
## [96] "georeferencedby"
## [97] "georeferenceddate"
## [98] "georeferenceprotocol"
## [99] "georeferencesources"
## [100] "georeferenceverificationstatus"
## [101] "georeferenceremarks"
## [102] "geologicalcontextid"
## [103] "earliesteonorlowesteonothem"
## [104] "latesteonorhighesteonothem"
## [105] "earliesteraorlowesterathem"
## [106] "latesteraorhighesterathem"
## [107] "earliestperiodorlowestsystem"
## [108] "latestperiodorhighestsystem"
## [109] "earliestepochorlowestseries"
## [110] "latestepochorhighestseries"
## [111] "earliestageorloweststage"
## [112] "latestageorhigheststage"
## [113] "lowestbiostratigraphiczone"
## [114] "highestbiostratigraphiczone"
## [115] "lithostratigraphicterms"
## [116] "group"
## [117] "formation"
## [118] "member"
## [119] "bed"
## [120] "identificationid"
## [121] "identificationqualifier"
## [122] "typestatus"
## [123] "identifiedby"
## [124] "dateidentified"
## [125] "identificationreferences"
## [126] "identificationverificationstatus"
## [127] "identificationremarks"
## [128] "scientificnameid"
## [129] "namepublishedinid"
## [130] "scientificname"
## [131] "acceptednameusage"
## [132] "originalnameusage"
## [133] "namepublishedin"
## [134] "namepublishedinyear"
## [135] "higherclassification"
## [136] "kingdom"
## [137] "phylum"
## [138] "class"
## [139] "order"
## [140] "family"
## [141] "genus"
## [142] "subgenus"
## [143] "specificepithet"
## [144] "infraspecificepithet"
## [145] "taxonrank"
## [146] "verbatimtaxonrank"
## [147] "scientificnameauthorship"
## [148] "vernacularname"
## [149] "nomenclaturalcode"
## [150] "taxonomicstatus"
## [151] "taxonremarks"
## [152] "lengthinmm"
## [153] "lengthtype"
## [154] "lengthunitsinferred"
## [155] "massing"
## [156] "massunitsinferred"
## [157] "underivedlifestage"
## [158] "underivedsex"
## [159] "dataset_url"
## [160] "dataset_citation"
## [161] "gbifdatasetid"
## [162] "gbifpublisherid"
## [163] "dataset_contact_email"
## [164] "dataset_contact"
## [165] "dataset_pubdate"
## [166] "lastindexed"
## [167] "migrator_version"
## [168] "hasmedia"
## [169] "hastissue"
## [170] "wascaptive"
## [171] "isfossil"
## [172] "isarch"
## [173] "vntype"
## [174] "haslength"
#checking the ranges of numeric dates
summary(AMRO[, c("year", "month", "day")])
## year month day
## Min. :1878 Min. : 1.000 Min. : 1.00
## 1st Qu.:1902 1st Qu.: 2.000 1st Qu.: 8.00
## Median :1940 Median : 4.000 Median :15.00
## Mean :1948 Mean : 4.903 Mean :15.33
## 3rd Qu.:2001 3rd Qu.: 7.000 3rd Qu.:23.00
## Max. :2011 Max. :12.000 Max. :31.00
#checking geographic areas of data
table(AMRO$country)
##
## United States USA
## 22 214
table(AMRO$institutioncode)
##
## AMNH CAS
## 22 214
table(AMRO$basisofrecord)
##
## PreservedSpecimen
## 236
table(AMRO$countrycode)
##
## US
## 214 22
table(AMRO$locationid)
## < table of extent 0 >
#creating a new variable to represent seasons by month
AMRO$season <- ifelse(AMRO$month %in% c(3, 4, 5), "Spring",
ifelse(AMRO$month %in% c(6, 7, 8), "Summer",
ifelse(AMRO$month %in% c(9, 10, 11), "Fall", "Winter")))
#and make sure it worked
table(AMRO$season)
##
## Fall Spring Summer Winter
## 17 70 58 91
#main = title, xlab = x axis, ylab = y axis, col = bar color, border = border color, breaks = detail
hist(AMRO$year,
main = "Distrobution of American Robin Records Over Time",
xlab = "Year of Collection",
ylab = "Number of Records",
col = "blue",
border = "white",
breaks = 50)
#average year of collection line
abline(v = mean(AMRO$year, na.rm = TRUE), col = "red", lwd = 2, lty = 2)
#This is the volume of American Robin Data stored in natural history collections by year. A concentration of bars in more recent years would suggest a shift to digital observation data while 20th century observations would indicate physical specimens.
L If not pursuing A-level homework, you can delete everything from here on down.
Create a “spiffy looking” graph of length (y axis) vs. age (x axis) for the walleye data with each sex graphed separately. Try to include both males and females in the same figure (rather than 2 separate figures) and try to use color to differentiate them. You might want to explore “jitter” to prevent all of the data points from piling on top of each other along the integer values of age. Try also to include trend lines for each sex (geom_smooth in ggplot2 would be great for this - look back on the HW00 assignment for clues).
#added color transparency so the overlapping points are visible
col_m <- adjustcolor("steelblue", alpha.f = 0.5)
col_f <- adjustcolor("firebrick", alpha.f = 0.5)
#Jitter also help swith the overlapping and piling up
plot(jitter(walleye$age, factor = 0.8), walleye$length,
col = ifelse(walleye$sex == "M", col_m, col_f),
pch = 16,
cex = 0.8,
main = "Walleye Length vs. Age by Sex (2026)",
xlab = "Age (Years)",
ylab = "Length (mm)")
#create a subset for the males so they have their specific trend line
#can't use geom_smooth and still working to remember/learn manual LOESS
abline(lm(length ~ age, data = subset(walleye, sex == "M")), col = "steelblue", lwd = 3)
#created a subset for the female best fit line
abline(lm(length ~ age, data = subset(walleye, sex == "F")),
col = "firebrick", lwd = 3)
#added the legend and used bty to remove its border
legend("topleft",
legend = c("Male", "Female"),
col = c("steelblue", "firebrick"),
pch = 16,
lwd = 2,
title = "Sex",
bty = "n")