#use the following code to install tinytex. Paste it into your console and run it before trying to trying to knit a PDF.
install.packages(‘tinytex’) tinytex::install_tinytex()
Before we proceed, make sure that you have watched this introductory video to R and R studio https://www.youtube.com/watch?v=zqUQL8OOtMQ&ab_channel=weecology
When you are new to R, it can be helpful to enable rainbow parentheses so that you can see if each open parenthesis matches a closed parenthesis in the appropriate place. If you want to do so, open Global Options from the Tools menu. Select Code -> Display. Enable the Rainbow Parentheses option at the bottom.
In this course we will be using a feature of R called “Rmarkdown”. This is an R Markdown document. Markdown is a formatting syntax for authoring HTML, PDF, and MS Word documents within R. For more details on using R Markdown see http://rmarkdown.rstudio.com.
When you click the Knit button in the toolbar above a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. For our class, you can turn in a html or PDF. R Studio will automatically make an html file if you change “output: pdf_document” to “output: html_document”, which you should do if you are having trouble, but PDF files look much better. If you click Knit now, a PDF or html will render that will show you roughly what the output will look like.
Once you complete the lab, the output will include your code and plots. Note how headings are controlled with hash marks “#”. The number of hashes tells markdown how large to make the heading with fewer hashes indicating larger font size. Note also that the word “Knit” above is bolded because of the “**” on either side of it. As you become more familiar with R and Rmarkdown, you will learn many additional tricks for formatting document as well as for coding.
Note that small coding errors will cause your render to fail and Rmarkdown is not as easy to debug as a typical R script that can be run one line at a time. We recommend (a) rendering your lab before you start to make sure that this is working, and then either (b) rendering after each answer to make sure you have not made mistakes, and/or (c) using a separate script to debug your code by running one line at a time and then transferring the working code into the Rmarkdown code chunk before rendering.
This lab has multiple parts. First you will read through the Introduction to R documentation that we have provided and answer the exercises therein. We will then practice estimating the abundance of animals using several estimators. We will estimate density using a belt transect assuming perfect detection, then we will use the Lincoln-Petersen estimator, the Chapman modification of the LP estimator, and the Schnabel estimator. We will do some of the coding for you in the first lab, but the bandaid will come off quickly. We expect you to closely read the code and understand how it works so that you can reproduce it. Use Rs help features and copious Googling to learn about the uses and options associated with R functions.
Read the FW 320 Introduction to R PDF that we have provided. As you read, there will be 15 bold sections saying “EXERCISE”. These are assigned. I recommend that you first do this work in an R script (Select “New File” then “R Script”) and then transfer that working script into the code chunk below to hand in the exercise as part of the document produced by R Markdown. I created this code chunk using the green “Insert” button on the R Studio toolbar above and selecting “R”. All text outside of the code chunk will be treated like text when producing the markdown report. Text inside the code chunk will be treated like R code to run and the output will also be presented. But you can also use add text using comments indicated by “#”. Answer the questions in the exercises using comments within the code chunk below.
#1.
x <- 3
x*3 +6
## [1] 15
y <- 5
x*4 +y
## [1] 17
#2.
#3.
#4.
#5.
#6.
#7.
#8.
#9.
#10.
#11.
#12.
#13.
#14.
#15.
Carl George Johannes Petersen, a Danish fisheries biologist, was the first person to publish a method for estimating population sizes from a mark-recapture method. In 1896, he described a simple method for marking fish, releasing them back into the population, and then using the ratio of marked and unmarked fish in a sample recaptured from that same population to estimate the number of individuals in that population. This was called the Petersen Estimate and gained wide use for a variety of applications in fisheries management. Decades later, wildlife biologist Frederick C. Lincoln developed the same equation for estimating the size of waterfowl populations from banding/recovery studies conducted by his agency, the Biological Survey (now the U.S. Fish and Wildlife Service). However, most wildlife managers at the time did not read literature related to fisheries management (but, to be fair, most fisheries managers did not read literature pertaining to wildlife management – and still don’t), so they were unaware that this same estimator had been developed and employed by fisheries professionals for some time. For this reason, many wildlife managers and terrestrial ecologists refer to this as the Lincoln Index - or Lincoln-Petersen Index. Acknowledging the contributions made by Lincoln as well as the first report by Peterson, fisheries biologists now refer to this as the Petersen-Lincoln Index as well as the Petersen Estimate.
The Basic Idea This is a very simple method that relies on the assumption that a sample of animals from a population that includes a known number of marked individuals will have the same ratio of marked to unmarked animals as the population as a whole:
Proportion marked in population = Proportion marked in sample M/n = R/C
Where: n = an estimate of the true unknown number (N) of individuals in the population at the time of initial marking. M = the number of individuals in the population that have been marked at the time of the current sampling. C = the number of individuals captured at the current sampling. R = the number of individuals recaptured at the current sampling – that is, those captured individuals that are marked.
Assumptions Although many variations exist on the basic methodology employed and the calculations performed, all variation of the Lincoln-Petersen Index are designed to be used on closed populations – that is, populations that experience no births, deaths, immigration, or emigration during the period of study. Several assumptions are made whenever using this method of population estimation. Reliable estimates can only be made if we do not violate these most important assumptions:
The methods of sampling and marking individuals do not alter the rates of mortality or survival over uncollected and / or unmarked individuals.
The population under study is a closed system, with no immigration, emigration, or births during the study period. Deaths may occur, but must affect marked and unmarked animals equally.
The method of marking is permanent and readily detected throughout the duration of the study.
Previously-captured individuals are just as susceptible to recapture as all other individuals in the population.
The period of time between sampling is adequate to allow mixing and distribution of marked animals throughout the entire population.
General Methodology After an initial capture, marking, and release, one returns to the same site, capturing additional individuals from the population. After noting the number of marked individuals and the total number of individuals captured, one can estimate the size of the population by using the following formula:
n = M (C / R)
which is just a re-arrangement of the ratio above, solving for n.
When dealing with small capture samples, or capture samples which are small relative to the population size, it is possible to make rounds of capture effort in which no marked animals are recaptured. This will put ‘0’ in the denominator of the equation, resulting in the value of infinity for a population estimate. To avoid this problem, one uses the following equation often called the “Chapman modification”:
(M+1)/(n+1) = (R+1)/(C+1)
re-arrange and solve for n:
n = [(M+1)* [(C+1) / (R+1)]] - 1
By adding one to all components of the equation, then subtracting it back off, you do not change the overall estimate drastically, but you also don’t estimate an infinite population size if recaptures = 0.
The Schnabel estimator makes another assumption in addition to the ones made for the Petersen Index; that is that there is absolutely no mortality of either marked or unmarked animals, and therefore this technique works best when applied to closed populations over short periods of time. The advantage of using the Schnabel estimate is that the population is subjected to multiple rounds of sampling and marking, leading to an increase the number of marked animals in the population and more precise estimates of the population size.
General methodology After an initial capture, marking, and release of the animals, you return to the same site and capture additional individuals from the population. The number of marked individuals and the total number of individuals captured is recorded, and any unmarked individuals are given marks and released back into the population. Later, the entire process is repeated, keeping track of the running total number of marks in the population. After several repeated rounds of recapture and marking, the size of the population is estimated using the following formula where “sum()” indicates summing over the sample sessions indicated by “t”:
n = sum(Mt*Ct) / sum(Rt)
Where: n = an estimate of the true number (N) of individuals in the population at the time of initial marking. Mt = the number of individuals in the population that have been marked at the time the t sample was taken. Ct = the number of individuals captured at the sampling that occurred at time t. Rt = the number of recaptured, marked individuals taken at time t.
Below is an example that should help you understand the process. Note that in the first capture, no animals are yet marked and so this session is just for marking animals. All 32 animals caught in the first session are marked at the time of the second, and some of those are also captured (i.e. recaptures) during the second session. Any unmarked animals caught during the second session are marked and released for possible capture in the third session and beyond.
Imagine you are trying to estimate salamander abundance and density and abundance (abundance per unit area) in a watershed in the Oregon Cascades. You survey the stream banks using a 1 m by 1 m quadrat. You have two sites of suitable habitat, a low elevation site, and a high elevation site. At each site, you throw 40 quadrats (each 1 m^2) and search for salamanders under logs, rocks, and other debris. At the end of your survey, you transfer your data to a spreadsheet, which is saved as Comma-Separated-Values (.csv). Import your data and find the mean, standard deviation (the square root of the variance, both of which are measures of the spread or “central tendency” of the data), and estimated number of salamanders at each site. I will do this for the low elevation site as an example and then you will have to do something similar for the high elevation site. Do not gloss over any lines. Try to understand every bit of syntax so that you can reproduce it. That is how you will learn to code. You will have to start coding yourself where it says “Dataset 1 Assignment”
#this line imports your data into a dataframe that we get to name. Make sure the csv file is in the same folder as this Rmarkdown file. Note that "<-" and "=" both work for assigning something to an object.
count_data <- read.csv("salamander_counts.csv")
count_data #Let's output the data so you can see it
## Quadrat_number lower_elev_site high_elev_site
## 1 1 1 5
## 2 2 0 2
## 3 3 7 8
## 4 4 6 5
## 5 5 6 2
## 6 6 1 3
## 7 7 4 4
## 8 8 6 3
## 9 9 7 2
## 10 10 4 3
## 11 11 7 2
## 12 12 7 5
## 13 13 4 4
## 14 14 10 3
## 15 15 9 2
## 16 16 7 6
## 17 17 6 3
## 18 18 8 4
## 19 19 7 2
## 20 20 7 5
## 21 21 9 3
## 22 22 6 6
## 23 23 4 7
## 24 24 8 3
## 25 25 1 5
## 26 26 5 2
## 27 27 10 3
## 28 28 1 7
## 29 29 3 7
## 30 30 9 0
## 31 31 10 6
## 32 32 3 2
## 33 33 5 2
## 34 34 7 5
## 35 35 9 4
## 36 36 0 2
## 37 37 4 6
## 38 38 7 8
## 39 39 5 4
## 40 40 1 3
#note column names have been added already in the csv. Note that column names can't contain spaces.
#use sum() to find the total number of salamanders counted and and length() to find the number of
#observations, which we can compare with the mean() function. Remember to use the $ to specify which column in the dataframe to use
Mean_density_low <- sum(count_data$lower_elev_site)/length(count_data$lower_elev_site)
Mean_density_low2 <- mean(count_data$lower_elev_site)
#Compare the two summary statistics - type the object name to ask R to return its value
Mean_density_low
## [1] 5.525
Mean_density_low2
## [1] 5.525
#For the manual expression, we can take advantage of the way R handles vectors
#The object count_data$lower_elev_site is a vector (a list of numbers):
count_data$lower_elev_site
## [1] 1 0 7 6 6 1 4 6 7 4 7 7 4 10 9 7 6 8 7 7 9 6 4 8 1
## [26] 5 10 1 3 9 10 3 5 7 9 0 4 7 5 1
#we use the object storing the mean (Mean_density_low) to find the raw
#difference from the mean of each observation:
count_data$lower_elev_site - Mean_density_low
## [1] -4.525 -5.525 1.475 0.475 0.475 -4.525 -1.525 0.475 1.475 -1.525
## [11] 1.475 1.475 -1.525 4.475 3.475 1.475 0.475 2.475 1.475 1.475
## [21] 3.475 0.475 -1.525 2.475 -4.525 -0.525 4.475 -4.525 -2.525 3.475
## [31] 4.475 -2.525 -0.525 1.475 3.475 -5.525 -1.525 1.475 -0.525 -4.525
#We can then square every one of these values:
(count_data$lower_elev_site - Mean_density_low)^2
## [1] 20.475625 30.525625 2.175625 0.225625 0.225625 20.475625 2.325625
## [8] 0.225625 2.175625 2.325625 2.175625 2.175625 2.325625 20.025625
## [15] 12.075625 2.175625 0.225625 6.125625 2.175625 2.175625 12.075625
## [22] 0.225625 2.325625 6.125625 20.475625 0.275625 20.025625 20.475625
## [29] 6.375625 12.075625 20.025625 6.375625 0.275625 2.175625 12.075625
## [36] 30.525625 2.325625 2.175625 0.275625 20.475625
#Now we van do this in one expression: we create an object called Variance_low, find
#the difference from the mean of every data point, square them, add them up, and divide by the total number of counts (the length of our dataset) - 1. This is the formula for the variance.
Variance_low <- sum((count_data$lower_elev_site -
Mean_density_low)^2)/(length(count_data$lower_elev_site) - 1)
#note the multiple sets of parentheses! These are necessary to close functions and to define the order of operations (PEMDAS)
#OR we can do this the easy way, using the internal function var()
Variance_low2 <- var(count_data$lower_elev_site)
#Compare the outputs:
Variance_low
## [1] 8.460897
Variance_low2
## [1] 8.460897
It is best practice to make exploratory plots of your data and compare them to your summary statistics (i.e., mean and SD, which is the square root of the variance). Histograms are especially useful to determine if your data are normal (aka Gaussian, or bell curve), and are useful to summarize the values. A histogram is an approximate representation of the distribution of numerical data. Histograms are made by dividing the entire range of values into a series of intervals and then counting how many values fall into each interval. The bins are usually specified as consecutive, non-overlapping intervals of a variable.
#plot histograms (frequency plots) of our data for the low evelation site
#HINT: remember to use the $ to specify which column in the dataframe to use in hist()
#We can also set the number of breaks (bins) to be equal in both plots
#We use some optional arguments of hist() to label the x-axis, and title each plot.
#now we use the function hist() with three arguments - xlab, main, and breaks to label the x axis, title, and define how many bins we want in our histogram.:
hist(count_data$lower_elev_site, xlab = "Counts", main = "Low Elevation Site", breaks = 10)
#we will use the function mtext() to add the mean and standard deviation (SD) to the histogram for the low elevation site. The SD is the square root of the variance, so we can wrap that into our line of code. We also use the function round() because R will return an overly precise version of the SD. Do you think maybe R has a function to find the standard deviation? Of course it does. Instead of using sqrt(Variance_low), you can see what happens when using the sd() function.
mtext(paste("Mean is", Mean_density_low, "salamanders per square meter"),side = 3, line = 0.5, cex=0.75)
mtext(paste("SD is", round(sqrt(Variance_low), 2), "salamanders per square meter"),side = 3, line = -0.5, cex=0.75)
#Look at the help documentation for mtext()! You can use ?mtext or help(mtext). What does the "side" argument do? Look at the help documentation for paste(). What does this do? We are intentionally not telling you exactly how to use mtext because the most important thing to learn when coding is how to find answers to questions. That is how your skillset will grow.
#use sum() to find the total number of salamanders counted and and length() to find the number of #observations, which we can compare with the mean() function. Mean_density_high <- sum(count_data\(high_elev_site)/length(count_data\)high_elev_site)
Mean_density_high2 <- mean(count_data$high_elev_site)
#Compare the two summary statistics - type the object name to ask R to return its value Mean_density_high Mean_density_high2
#For the manual expression, we can take advantage of the way R handles vectors #The object count_data\(lower_elev_site is a vector (a list of numbers): count_data\)high_elev_site
#we use the object storing the mean (Mean_density_low) to find the raw #difference from the mean of each observation: count_data$high_elev_site - Mean_density_high
#We can then square every one of these values: (count_data$high_elev_site - Mean_density_high)^2
#Now we van do this in one expression: we create an object called Variance_low, find #the difference from the mean of every data point, square them, add them up, and divide by the total number of counts (the length of our dataset) - 1. This is the formula for the variance.
Variance_high <- sum((count_data\(high_elev_site - Mean_density_high)^2)/(length(count_data\)high_elev_site) - 1)
#or we can do this the easy way, using the internal function
var()
Variance_high2 <- var(count_data$high_elev_site)
#Compare the outputs: Variance_high Variance_high2
#now we use the function hist() with three arguments - xlab, main, and breaks to label the x axis, title, and define how many bins we want in our histogram.: hist(count_data$high_elev_site, xlab = “Counts”, main = “High Elevation Site”, breaks = 10)
#we will use the function mtext() to add the mean and standard deviation (SD) to the histogram for the low elevation site. The SD is the square root of the variance, so we can wrap that into our line of code. We also use the function round() because R will return an overly precise version of the SD. Do you think maybe R has a function to find the standard deviation? Of course it does. Instead of using sqrt(Variance_low), you can see what happens when using the sd() function.
mtext(paste(“Mean is”, Mean_density_high, “salamanders per square meter”),side = 3, line = 0.5, cex=0.75)
mtext(paste(“SD is”, round(sqrt(Variance_high), 2), “salamanders per square meter”),side = 3, line = -0.5, cex=0.75)
hist(count_data$lower_elev_site,xlab =“Counts”,main =“Low Elevation Site”,breaks =10)
mtext(paste(“Mean is”, Mean_density_low,“salamanders per square meter”),side =3,line =0.5,cex =0.75)
mtext(paste(“SD is”, round(sqrt(Variance_low),2),“salamanders per square meter”),side =3,line =-0.5,cex =0.75)
#NOTE: We can plot two histograms (for each site) with the plot-partitioning #function par(). The par function has tricky syntax, but the mfrow option takes a vector with the number of rows and columns to plot figure panels in. We can create 1 row and 2 column panels with: par(mfrow=c(1, 2))
hist (count_data$high_elev_site, xlab =“Counts”, main =“High Elevation Site”, breaks =10) mtext(paste(“Mean is”, Mean_density_high2,“salamanders per square meter”),side =3,line = 0.5,cex =0.75) mtext(paste(“SD is”, round(sqrt(Variance_high2),2),“salamanders per square meter”),side = 3,line = -0.5,cex = 0.75)
hist (count_data$lower_elev_site,xlab =“Counts”,main =“Low Elevation Site”,breaks =10) mtext(paste(“Mean is”, Mean_density_low,“salamanders per square meter”),side =3,line =0.5,cex =0.75) mtext(paste(“SD is”, round(sqrt(Variance_low),2),“salamanders per square meter”),side =3,line =-0.5,cex =0.75)
## 2. (4 pts) The total area of the low-elevation habitat is two hectares. There are 10,000 square meters in a hectare, so the estimated population size at the low-elevation site is (Mean density in number/m^2) * (10,000 m^2/hectare) * (2 hectares). In R, this looks like:
```r
Mean_density_low*10000*2
## [1] 110500
#units are now number of salamanders
Mean_density_high 100003.5
Sd_Dens_Low <- sd(count_data$lower_elev_site) Sd_Dens_Low
#extrapolate to entire area Sd_Dens_low_entire <- Sd_Dens_Low10002 Sd_Dens_low_entire
Sd_Dens_high <- sd(count_data$high_elev_site) Sd_Dens_high
#extrapolate to entire area Sd_Dens_High_entire <- Sd_Dens_high10003.5 Sd_Dens_High_entire
Imagine you are part of a project estimating sloth abundance in a remote river valley in Costa Rica. You know there are four 500-m permanent transect lines that cross the valley. You are able to walk all four transects on two days in the last month of the dry season, before the road becomes impassable. You are careful to look in trees that are within 10 meters of the transect on both sides of the line, so each transect is 500 x 20 meters.
Fortunately, the sloths stay in their trees. The first day, you find 13 sloths, and you are able to mark the trees they are in. On the second day surveying, you find 10 of those sloths in their trees, but you find an additional 6 sloths, for a total of 16 sloth sightings on Day 2.
A <- 4(50020) A D1 <- 13/A #first day 13 sloths D1 D2 <- 16/A #second day 16 sloths D2
You know it is probably safe to assume that the 6 sloths in new tress on the second survey day were not any of the animals you found on the first day, and that it is a closed population at this time of year (no births, immigration, emigration, or deaths) so you decide to use the Lincoln-Peterson estimation method to assess abundance.
#First create objects M, R, and C using the information given above on the two days
#of surveying.
M <- 13 #13 sloths marked first sample day
R <- 10 #total recaptures second sample day
C <- 16 #total captures second sample day
#Calculate and store the estimate as N_hat and print the value N_hat
#Note that when a line just has an object on it like N_hat, R interprets it as print(N_hat)
N_hat <- (M*C/R)
N_hat
## [1] 20.8
You report to the project PI that you have estimated sloth numbers, but you only saw 13 sloths and 16 sloths on each of your survey days. You are bummed because the road closed and you cannot go back to survey for more. But your PI suggests you use something called the Chapman modification to the L-P method. It adjusts for low sample sizes (less than 30).
N_hat_Chap <- ((M+1)*((C+1)/(R+1)))-1
N_hat_Chap
## [1] 20.63636
Imagine you are leading a Capture-Mark-Recapture study to assess the population size of meadow voles in a region of the Rocky Mountains. The vole habitat is bounded by the steep terrain and by a large area that burned the year before, so you are fairly certain the population is closed. You have enough time to trap and mark individuals over 5 consecutive days. You enter the total number caught, the number recaptured, and the number of new marks for each day.
vole_data <-read.csv("voleCMR.csv")
vole_data
## Day Catch Recaptures Newmarks
## 1 1 30 0 30
## 2 2 35 15 20
## 3 3 28 10 18
## 4 4 20 8 12
## 5 5 25 9 16
#use the salamander example above as a model for the syntax
#remember you can easily view the data and its column headers by typing the
#name of the dataframe in the console.
The dataset includes the number of individuals caught each day (Catch), the number of those that are recaptures (Recaptures), and number of individuals newly marked (Newmarks). You need to create new objects to store the numbers caught each day (Ct), recaptures (Rt), and the new individuals marked each day (for example, called new_i) as vectors. As an example, an example for Ct is given below. You can use this expression by uncommenting it, and substituting the name of your dataframe if different. Note that the Schnabel estimate does not use the NEW marked each day, but the total number of previously marked individuals. We will have to create that from these data in the next step.
Asa reminder:
n = sum(Mt*Ct) / sum(Rt)
Where: n = an estimate of the true number (N) of individuals in the population at the time of initial marking. Mt = the number of individuals in the population that have been marked at the time the t sample was taken. Ct = the number of individuals captured at the sampling that occurred at time t. Rt = the number of recaptured, marked individuals taken at time t.
Ct <- vole_data$Catch
Rt <- vole_data$Recaptures
new_i <- vole_data$Newmarks
The only remaining variable to calculate is Mt, which is the cumulative sum of prior marks on each day. We need to calculate the cumulative sum for sampling days 1-4 for this reason. We use the function cumsum() with a little bit of R magic called indexing to do this. We set M_i to be equal to 0 when i = 1, because there are no previously marked individuals on day 1.
You can create a vector of length 5 by concatenating (combining) 0 and the cumulative sum of new marks on days 1-4. This expression uses the index [1:4] after the object new_i to tell the function cumsum() to only evaluate the first four elements in the vector.
This is relatively advanced, so I have provided an expression that you can uncomment, and change the name of the object if it is not new_i. You could also do this by hand with multiple lines of code.
Mt <- c(0, cumsum(new_i)[1:4])
Mt
## [1] 0 30 50 68 80
Use the Schnabel estimator formula to find the population size N_hat, which we also call n above. The “hat” is often used to denote an estimator, which is something that is estimated with uncertainty, and looks like a little caret above the variable when written in an equation. In the text, please report the estimate and the units.
N_hat <- sum(Mt*Ct)/sum(Rt)
N_hat
## [1] 138.3333