HW 01: Introduction to R

The objective of this first homework assignment is to (re)familiarize you with some common R functions that we will be using throughout the course.

Directory structure

For homework assignments, I’ll enter instructions and hints in bold using double asterisks. When you’re all done and ready to knit your final output, you should delete these sections from the text blocks.

I encourage you to begin by having a consistent file directory structure for this course. Path names will be easier to deal with if you keep everything short and sweet. Avoid spaces and capital letters whenever possible. I recommend using “fw5051” as a main directory with a subdirectory called “homework”. Within the homework subdirectory, create a 2nd level subdirectory called “hw01”. Notice that I “padded” the 1 with an additional 0, e.g. hw01 instead of hw1. If I expect to have 100 or more subfolders (probably not a good idea!), I should pad with 2 zeroes. This is helpful because alphabetically hw2 comes after hw12.

If you copy the initial R script(s) and data file(s) into the subdirectory you have created, you will have a consistent structure for path names and you’ll only need to worry about it once. And if you open the R script from inside this subdirectory, you’ll be all set.

R Projects are also a great way to set things up, but require that I can share entire folders with you, and alas, Canvas doesn’t allow this.

[Use text to describe the directory structure you have created for this course. I will use square brackets within text block sections where I want you to provide your own descriptions.]

# use getwd() to identify the current working directory for R
getwd()

## [1] "/Users/BradAngstadt/Downloads/FW 5051"

^^ Notice that the slashes (/) face the opposite direction of what you would see if were getting a path name from Windows Explorer (). In R we need to use / or \ to denote subdirectories.

Loading data into R

There are many ways to load data into R. Here I showcase methods that we might use in class, with read.csv() being the most common way we will load data.

The code below demonstrates loading a canned R data set (Fisher’s Iris data), a csv file (walleye data from “Mystery Lake” in Minnesota), and a text file (a small amount of American Robin data from VertNet, an online repository of collection data from natural history museums around the world).

# load the "canned" iris data
data(iris) # data() only works for example data sets included in base R

# load a csv file from your computer using read.csv() using full path name. Remember to use quotation marks and .csv at the end of your file name.
walleye <- read.csv("hw01_WAE.csv")

# we can omit the path name if the file is located in our current working directory
walleye <- read.csv("hw01_WAE.csv",
                    header = TRUE)

# load a text file (.txt) using fread from the data.table package
AMRO <- fread("hw01_AMRO.txt", 
              quote = "", 
              header = TRUE)

Explore, manipulate, and plot data (C level, 10 points)

In the code chunk below, we will utilize a variety of R functions to visualize and explore the walleye data.

Try using some basic exploratory function such as dim(), head(), tail(), str(). Once you feel familiar with the dataset, create a new variable called “n_obs” for how many observations there are in the dataset. Remember that when specifying columns and rows, R uses the following format: dataframe[rows, columns]

#dim() can be used to get the number of rows and columns for a data frame
#head() can be used to display the first six rows in a dataset
#tail() can be used to display the last six rows in a dataset
#str() can be used to get the internal structure of a dataframe including a list of all variables and their object class

dim(walleye)

## [1] 8282   10

head(walleye)

##   lake_n month day year station spp length_mm weight_g age  sex
## 1      1     9   7 1983       1 WAE       154       45   0 <NA>
## 2      1     9   7 1983       1 WAE       180       45   1 <NA>
## 3      1     9   7 1983       3 WAE       198       45   1 <NA>
## 4      1     9   7 1983       2 WAE       208       45   1 <NA>
## 5      1     9   7 1983       3 WAE       218       45   1 <NA>
## 6      1     9   7 1983       4 WAE       223       45   1 <NA>

tail(walleye)

##      lake_n month day year station spp length_mm weight_g age sex
## 8277      1     9  29 2016      14 WAE       382      482   3   M
## 8278      1     9  29 2016      14 WAE       456      864   6   M
## 8279      1     9  29 2016      13 WAE       515     1252   6   F
## 8280      1     9  29 2016      14 WAE       546     1286   7   F
## 8281      1     9  29 2016      14 WAE       557     1378   8   F
## 8282      1     9  29 2016      14 WAE       630     2360  12   F

str(walleye)

## 'data.frame':    8282 obs. of  10 variables:
##  $ lake_n   : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ month    : int  9 9 9 9 9 9 9 9 9 9 ...
##  $ day      : int  7 7 7 7 7 7 7 7 7 7 ...
##  $ year     : int  1983 1983 1983 1983 1983 1983 1983 1983 1983 1983 ...
##  $ station  : int  1 1 3 2 3 4 4 3 3 4 ...
##  $ spp      : chr  "WAE" "WAE" "WAE" "WAE" ...
##  $ length_mm: int  154 180 198 208 218 223 223 228 243 269 ...
##  $ weight_g : int  45 45 45 45 45 45 45 91 91 136 ...
##  $ age      : int  0 1 1 1 1 1 1 1 1 2 ...
##  $ sex      : chr  NA NA NA NA ...

# create a variable called n_obs for how many observations are in the dataset

n_obs <- nrow(walleye)

Manipulating data

R is a vector-based program that will perform the same function on all values stored within a vector (e.g., a single series of values, or one column within a dataframe).

Use this property to create new variables that are functions of existing variables in the walleye data. Show log tranformations and recombinations of 2 or more variables including some common mathematical operators below. There are some examples to get you started

# Growth per year: if walleyes grew at a constant rate, we could estimate growth per year of age by dividing the length of a walleye by their age
str(walleye)

## 'data.frame':    8282 obs. of  10 variables:
##  $ lake_n   : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ month    : int  9 9 9 9 9 9 9 9 9 9 ...
##  $ day      : int  7 7 7 7 7 7 7 7 7 7 ...
##  $ year     : int  1983 1983 1983 1983 1983 1983 1983 1983 1983 1983 ...
##  $ station  : int  1 1 3 2 3 4 4 3 3 4 ...
##  $ spp      : chr  "WAE" "WAE" "WAE" "WAE" ...
##  $ length_mm: int  154 180 198 208 218 223 223 228 243 269 ...
##  $ weight_g : int  45 45 45 45 45 45 45 91 91 136 ...
##  $ age      : int  0 1 1 1 1 1 1 1 1 2 ...
##  $ sex      : chr  NA NA NA NA ...

walleye$length <- as.numeric(as.character(walleye$length))
walleye$age <- as.numeric(as.character(walleye$age))

growth_per_year <- transform(walleye, growth_per_year = length / age)
head(growth_per_year)

##   lake_n month day year station spp length_mm weight_g age  sex length
## 1      1     9   7 1983       1 WAE       154       45   0 <NA>    154
## 2      1     9   7 1983       1 WAE       180       45   1 <NA>    180
## 3      1     9   7 1983       3 WAE       198       45   1 <NA>    198
## 4      1     9   7 1983       2 WAE       208       45   1 <NA>    208
## 5      1     9   7 1983       3 WAE       218       45   1 <NA>    218
## 6      1     9   7 1983       4 WAE       223       45   1 <NA>    223
##   growth_per_year
## 1             Inf
## 2             180
## 3             198
## 4             208
## 5             218
## 6             223

# Condition: to better understand the "condition" of walleye in our dataset, we might try comparing the ratio of body mass (weight) to body size (length)

walleye_condition <- transform(walleye, condition_factor = (weight_g / (length^3)) * 100000)
head(walleye_condition)

##   lake_n month day year station spp length_mm weight_g age  sex length
## 1      1     9   7 1983       1 WAE       154       45   0 <NA>    154
## 2      1     9   7 1983       1 WAE       180       45   1 <NA>    180
## 3      1     9   7 1983       3 WAE       198       45   1 <NA>    198
## 4      1     9   7 1983       2 WAE       208       45   1 <NA>    208
## 5      1     9   7 1983       3 WAE       218       45   1 <NA>    218
## 6      1     9   7 1983       4 WAE       223       45   1 <NA>    223
##   condition_factor
## 1        1.2321125
## 2        0.7716049
## 3        0.5797182
## 4        0.5000605
## 5        0.4343532
## 6        0.4057868

# Log length: if we wanted to transform our data, we could try taking the natural log of one of our variables such as length, weight, etc

walleye_log_length <- transform(walleye, log_length = log(length))
head(walleye_log_length)

##   lake_n month day year station spp length_mm weight_g age  sex length
## 1      1     9   7 1983       1 WAE       154       45   0 <NA>    154
## 2      1     9   7 1983       1 WAE       180       45   1 <NA>    180
## 3      1     9   7 1983       3 WAE       198       45   1 <NA>    198
## 4      1     9   7 1983       2 WAE       208       45   1 <NA>    208
## 5      1     9   7 1983       3 WAE       218       45   1 <NA>    218
## 6      1     9   7 1983       4 WAE       223       45   1 <NA>    223
##   log_length
## 1   5.036953
## 2   5.192957
## 3   5.288267
## 4   5.337538
## 5   5.384495
## 6   5.407172

# Feel free to try out any additional functions!

Plotting data

Use plotting functions in base R to visually explore patterns in the data. For “quick and dirty” plots that are for your eyes only, base R plotting functions are fully adequate and much simpler than ggplot. But for complex graphing tasks, or for high quality products that will be shared with others, try to use ggplot.

Make a plot using either base R plots or ggplot to evaluate the relationships between any of the newly created variables above.

##install.packages("ggplot2")
##library(ggplot2)
##ggplot2(walleye, aes(x = log(length), y = log(weight)))

##having trouble updating rlang most likely due to older version of rstudio running
#plot() OR #ggplot()

plot(walleye$age, walleye$growth_per_year,
     main = "walleye Growth Rate vs. Age",
     xlab = "Age (Years)",
     ylab = "Growth Per Year (mm/Year)",
     pch = 16,
     col = "blue")

If not pursuing B-level homework, you can delete everything from here on down.

B level homework (10 points)

Duplicate the C-level homework sections on explore, manipulate, and plot data using the AMRO dataset. This dataset is a collection of observations on the American Robin (Turdus migratorius) from various natural history collections across the world.

Select 3 numeric variables and 3 character values to explore using summary and table functions. Provide your best interpretation of what these variables represent based on the data you can observe (note that any good data set should have a help file that clearly explains what each variable represents, but that isn’t always the case). Create a new variable from an existing variable (try to create something sensible). Create a plot of two or more variables in the data using either ggplot or base R plotting functions. Be sure to label axes and include any necessary legend descriptions.

Use text blocks to describe what you are doing (using complete sentences), and use short comments in the code (# like this) to describe what is happening line by line.

#names is used to see all the variables
names(AMRO)

##   [1] "type"                               
##   [2] "modified"                           
##   [3] "license"                            
##   [4] "rightsholder"                       
##   [5] "accessrights"                       
##   [6] "bibliographiccitation"              
##   [7] "references"                         
##   [8] "institutionid"                      
##   [9] "collectionid"                       
##  [10] "datasetid"                          
##  [11] "institutioncode"                    
##  [12] "collectioncode"                     
##  [13] "datasetname"                        
##  [14] "basisofrecord"                      
##  [15] "informationwithheld"                
##  [16] "datageneralizations"                
##  [17] "dynamicproperties"                  
##  [18] "occurrenceid"                       
##  [19] "catalognumber"                      
##  [20] "recordnumber"                       
##  [21] "recordedby"                         
##  [22] "individualcount"                    
##  [23] "sex"                                
##  [24] "lifestage"                          
##  [25] "reproductivecondition"              
##  [26] "behavior"                           
##  [27] "establishmentmeans"                 
##  [28] "occurrencestatus"                   
##  [29] "preparations"                       
##  [30] "disposition"                        
##  [31] "associatedmedia"                    
##  [32] "associatedreferences"               
##  [33] "associatedsequences"                
##  [34] "associatedtaxa"                     
##  [35] "othercatalognumbers"                
##  [36] "occurrenceremarks"                  
##  [37] "organismid"                         
##  [38] "organismname"                       
##  [39] "organismscope"                      
##  [40] "associatedoccurrences"              
##  [41] "associatedorganisms"                
##  [42] "previousidentifications"            
##  [43] "organismremarks"                    
##  [44] "materialsampleid"                   
##  [45] "eventid"                            
##  [46] "fieldnumber"                        
##  [47] "eventdate"                          
##  [48] "eventtime"                          
##  [49] "startdayofyear"                     
##  [50] "enddayofyear"                       
##  [51] "year"                               
##  [52] "month"                              
##  [53] "day"                                
##  [54] "verbatimeventdate"                  
##  [55] "habitat"                            
##  [56] "samplingprotocol"                   
##  [57] "samplingeffort"                     
##  [58] "fieldnotes"                         
##  [59] "eventremarks"                       
##  [60] "locationid"                         
##  [61] "highergeographyid"                  
##  [62] "highergeography"                    
##  [63] "continent"                          
##  [64] "waterbody"                          
##  [65] "islandgroup"                        
##  [66] "island"                             
##  [67] "country"                            
##  [68] "countrycode"                        
##  [69] "stateprovince"                      
##  [70] "county"                             
##  [71] "municipality"                       
##  [72] "locality"                           
##  [73] "verbatimlocality"                   
##  [74] "minimumelevationinmeters"           
##  [75] "maximumelevationinmeters"           
##  [76] "verbatimelevation"                  
##  [77] "minimumdepthinmeters"               
##  [78] "maximumdepthinmeters"               
##  [79] "verbatimdepth"                      
##  [80] "minimumdistanceabovesurfaceinmeters"
##  [81] "maximumdistanceabovesurfaceinmeters"
##  [82] "locationaccordingto"                
##  [83] "locationremarks"                    
##  [84] "decimallatitude"                    
##  [85] "decimallongitude"                   
##  [86] "geodeticdatum"                      
##  [87] "coordinateuncertaintyinmeters"      
##  [88] "coordinateprecision"                
##  [89] "verbatimcoordinates"                
##  [90] "verbatimlatitude"                   
##  [91] "verbatimlongitude"                  
##  [92] "verbatimcoordinatesystem"           
##  [93] "verbatimsrs"                        
##  [94] "footprintwkt"                       
##  [95] "footprintsrs"                       
##  [96] "georeferencedby"                    
##  [97] "georeferenceddate"                  
##  [98] "georeferenceprotocol"               
##  [99] "georeferencesources"                
## [100] "georeferenceverificationstatus"     
## [101] "georeferenceremarks"                
## [102] "geologicalcontextid"                
## [103] "earliesteonorlowesteonothem"        
## [104] "latesteonorhighesteonothem"         
## [105] "earliesteraorlowesterathem"         
## [106] "latesteraorhighesterathem"          
## [107] "earliestperiodorlowestsystem"       
## [108] "latestperiodorhighestsystem"        
## [109] "earliestepochorlowestseries"        
## [110] "latestepochorhighestseries"         
## [111] "earliestageorloweststage"           
## [112] "latestageorhigheststage"            
## [113] "lowestbiostratigraphiczone"         
## [114] "highestbiostratigraphiczone"        
## [115] "lithostratigraphicterms"            
## [116] "group"                              
## [117] "formation"                          
## [118] "member"                             
## [119] "bed"                                
## [120] "identificationid"                   
## [121] "identificationqualifier"            
## [122] "typestatus"                         
## [123] "identifiedby"                       
## [124] "dateidentified"                     
## [125] "identificationreferences"           
## [126] "identificationverificationstatus"   
## [127] "identificationremarks"              
## [128] "scientificnameid"                   
## [129] "namepublishedinid"                  
## [130] "scientificname"                     
## [131] "acceptednameusage"                  
## [132] "originalnameusage"                  
## [133] "namepublishedin"                    
## [134] "namepublishedinyear"                
## [135] "higherclassification"               
## [136] "kingdom"                            
## [137] "phylum"                             
## [138] "class"                              
## [139] "order"                              
## [140] "family"                             
## [141] "genus"                              
## [142] "subgenus"                           
## [143] "specificepithet"                    
## [144] "infraspecificepithet"               
## [145] "taxonrank"                          
## [146] "verbatimtaxonrank"                  
## [147] "scientificnameauthorship"           
## [148] "vernacularname"                     
## [149] "nomenclaturalcode"                  
## [150] "taxonomicstatus"                    
## [151] "taxonremarks"                       
## [152] "lengthinmm"                         
## [153] "lengthtype"                         
## [154] "lengthunitsinferred"                
## [155] "massing"                            
## [156] "massunitsinferred"                  
## [157] "underivedlifestage"                 
## [158] "underivedsex"                       
## [159] "dataset_url"                        
## [160] "dataset_citation"                   
## [161] "gbifdatasetid"                      
## [162] "gbifpublisherid"                    
## [163] "dataset_contact_email"              
## [164] "dataset_contact"                    
## [165] "dataset_pubdate"                    
## [166] "lastindexed"                        
## [167] "migrator_version"                   
## [168] "hasmedia"                           
## [169] "hastissue"                          
## [170] "wascaptive"                         
## [171] "isfossil"                           
## [172] "isarch"                             
## [173] "vntype"                             
## [174] "haslength"

#checking the ranges of numeric dates
summary(AMRO[, c("year", "month", "day")])

##       year          month             day       
##  Min.   :1878   Min.   : 1.000   Min.   : 1.00  
##  1st Qu.:1902   1st Qu.: 2.000   1st Qu.: 8.00  
##  Median :1940   Median : 4.000   Median :15.00  
##  Mean   :1948   Mean   : 4.903   Mean   :15.33  
##  3rd Qu.:2001   3rd Qu.: 7.000   3rd Qu.:23.00  
##  Max.   :2011   Max.   :12.000   Max.   :31.00

#checking geographic areas of data
table(AMRO$country)

## 
## United States           USA 
##            22           214

table(AMRO$institutioncode)

## 
## AMNH  CAS 
##   22  214

table(AMRO$basisofrecord)

## 
## PreservedSpecimen 
##               236

table(AMRO$countrycode)

## 
##      US 
## 214  22

table(AMRO$locationid)

## < table of extent 0 >

#creating a new variable to represent seasons by month
AMRO$season <- ifelse(AMRO$month %in% c(3, 4, 5), "Spring",
                ifelse(AMRO$month %in% c(6, 7, 8), "Summer",
                ifelse(AMRO$month %in% c(9, 10, 11), "Fall", "Winter")))
#and make sure it worked
table(AMRO$season)

## 
##   Fall Spring Summer Winter 
##     17     70     58     91

#main = title, xlab = x axis, ylab = y axis, col = bar color, border = border color, breaks = detail
hist(AMRO$year,
     main = "Distrobution of American Robin Records Over Time",
     xlab = "Year of Collection",
     ylab = "Number of Records",
     col = "blue",
     border = "white",
     breaks = 50)
#average year of collection line
abline(v = mean(AMRO$year, na.rm = TRUE), col = "red", lwd = 2, lty = 2)

#This is the volume of American Robin Data stored in natural history collections by year. A concentration of bars in more recent years would suggest a shift to digital observation data while 20th century observations would indicate physical specimens.

L If not pursuing A-level homework, you can delete everything from here on down.

A level homework (10 points)

Create a “spiffy looking” graph of length (y axis) vs. age (x axis) for the walleye data with each sex graphed separately. Try to include both males and females in the same figure (rather than 2 separate figures) and try to use color to differentiate them. You might want to explore “jitter” to prevent all of the data points from piling on top of each other along the integer values of age. Try also to include trend lines for each sex (geom_smooth in ggplot2 would be great for this - look back on the HW00 assignment for clues).

#added color transparency so the overlapping points are visible
col_m <- adjustcolor("steelblue", alpha.f = 0.5)
col_f <- adjustcolor("firebrick", alpha.f = 0.5)

#Jitter also help swith the overlapping and piling up
plot(jitter(walleye$age, factor = 0.8), walleye$length,
     col = ifelse(walleye$sex == "M", col_m, col_f),
     pch = 16,
     cex = 0.8,
     main = "Walleye Length vs. Age by Sex (2026)",
     xlab = "Age (Years)",
     ylab = "Length (mm)")

#create a subset for the males so they have their specific trend line
#can't use geom_smooth and still working to remember/learn manual LOESS
abline(lm(length ~ age, data = subset(walleye, sex == "M")), col = "steelblue", lwd = 3)

#created a subset for the female best fit line
abline(lm(length ~ age, data = subset(walleye, sex == "F")),
       col = "firebrick", lwd = 3)

#added the legend and used bty to remove its border
legend("topleft",
       legend = c("Male", "Female"),
       col = c("steelblue", "firebrick"),
       pch = 16,
       lwd = 2,
       title = "Sex",
       bty = "n")