GESC 258- Geographical Research Methods
Our lab this week will examine trees in a coastal forest in British
Columbia. In particular we have a small dataset of tree circumferences
which we want to analyze. While tree circumference is easy to measure
and tells us something about tree size and age, it is more customary to
work with tree diameters. A typical measure used to measure tree is
diameter-at-breast-height (DBH)
which is measured
4.5
feet from the ground using a DBH tape.We can get
started by converting our measures of circumference to diameter using
the relationship between these two quantities which is
\[diameter=Circumference/\pi\] recalling that π is ratio of a circle’s circumference to its diameter which we can use \(3.14159\) as a value. So assuming we have a circumference of 214cm, the diameter would be
\[214 /3.14159= 68.11837\] which in our dataset our expressed in units of centimeters. ### Tree Circumference Data The data we collected in the video were as follows:
272
272
236
154
256
156
143
269
205
175
In order to bring the data into R
, we need to assign
these numbers to a data frame which is like a table, or to a vector
which is just a bunch of numbers stored in order. A data frame can be
made up of multiple vectors (as individual columns in a table). We will
make a vector, since we do not have anything other than tree
circumference to store. If we had collected information on tree species,
condition, site characteristics, etc. we would want to store the data as
a table, so we would create a data frame.
<- c(272,272,236,154,256,156,143,269,205,175) x
What is happening in the command line above? First of all we are
making an assignment using the <-
operator. This means
we are assigning whatever is on the right side of the<-
to what is on the left side. In this case we are creating a vector of
numbers representing our tree dataset to a variable called
x
. The function c
is being used which all it
does is take some values and create a vector from them.
Before going further, as we noted above we want to convert these
measures of circumference to diameters. R
has a special
number for π which we can use by just typing pi
in the
console. Try it:
pi
## [1] 3.141593
So if we wanted to divide all of our values by π we could just type the following in the console:
/pi x
## [1] 86.58029 86.58029 75.12113 49.01972 81.48733 49.65634 45.51831 85.62536
## [9] 65.25353 55.70423
which is good, but this just displays them to the screen. In order to do something with these we want to store them in a new variable. Note that we can use any name we want for variables. We will create a new one called dbh:
<- x/pi dbh
Now we have a new vector which has our data properly represented in values of dbh which we measured in the field. We can use this vector in subsequent work below.
The Normal Distribution
Lets pretend that we had a sample size of 1500! Luckily we can create
a dataset using simulation pretty easily, and we will based it on our
actual sample mean and standard deviation. We will use a function called
rnorm
which basically will generate random samples from a
given Normal distribution, we just have to supply the mean
,
standard deviation
, and sample size.
<- rnorm(n=1500, mean=mean(x), sd=sd(x)) sim_dbh
We don’t want to look at all the values on the screen because there are too many. We can look at the first few, the last few, and then count the number of observations as follows:
head(sim_dbh)
## [1] 242.6497 194.6676 258.7242 261.6019 310.3390 174.4379
tail(sim_dbh)
## [1] 180.8939 244.0322 285.1821 194.0609 201.4368 258.3805
length(sim_dbh)
## [1] 1500
because these are randomly simulated values, yours may differ a bit from what you see here, but there should be 1500 values.
We can look at the distribution of values by plotting the histogram:
hist(sim_dbh, xlab = "Simulated DBH(cm) Values", main="")
We can also create a line showing the theoretical distribution that would be the population standard deviation:
= seq(0,200,.1)
x = dnorm(x, mean=mean(dbh), sd=sd(dbh))
y plot(x,y, type="l", main="Normal Distribution")
The key point here is that from this smoothly varying distribution,
we can calculate probabilities. Say we sampled a tree with a dbh of 100,
we can ask the question - what is the probability of finding a tree with
a dbh of 100 or greater from this distribution, that is akin to finding
the area under the curve for everything to the right of the 100 value on
the x-axis. We will use a special probability function in R
to get this, it is called pnorm
pnorm(q=100, mean = mean(dbh), sd=sd(dbh), lower.tail = FALSE)
## [1] 0.02995489
which tells us that the probability of sampling a tree with a dbh
value of 100 or greater is 0.03
- so pretty unlikely.
Using z-scores instead of raw data to more easily find probabilities from a normal distribution
z-score is simply a score we can calculate for every observation which standardizes the data to a standard normal distribution. This means it eliminates the units of the data (i.e., we’re no longer in the centimeters we measured) - but it makes scores comparable across datasets and easier to quickly interpret.
The z-score formula is
\[z_i = \frac{x_i - \bar{x}}{\sigma}\]
which is just the observation, minus the mean, divided by the standard
deviation. We’ll go back to our smaller dataset to illustrate:
- mean(dbh)) / sd(dbh) (dbh
## [1] 1.0910881 1.0910881 0.4161882 -1.1210836 0.7911326 -1.0835892
## [7] -1.3273030 1.0348464 -0.1649755 -0.7273920
z-score have the properties that values less than zero are for observations below the mean, values greater than zero are for observations above the mean, and values of zero correspond to the mean (which are rare). Moving back to our hypothetical tree with a dbh of 100, lets calculate its z-score:
100 - mean(dbh)) / sd(dbh) (
## [1] 1.881457
which gives us a z-score of 1.881457
If we checked the
probability value associated with this z-score we would see
that it matched the one we got above, we just no longer have to specify
the mean and sd because this is referenced to a standard normal
distribution
pnorm(1.881457, lower.tail=FALSE)
## [1] 0.02995489
Assignment
Calculate the probability of finding a tree with a dbh of 90 or greater based on the sample mean and sample standard deviation above. What is the z-score associated with a dbh of 90 cm? Include commands used to generate the answer. (out of 3)
Create a new dataset with a mean and standard deviation of your choosing and a sample size of 100. Plot the histogram being sure to label axes appropriately. (out of 3)
Do you think the sample of trees we collected was representative of the wider forest? What would be some potential issues with inferring characteristics of the forest from this dataset? (out of 4)
Hand in
Please submit your answers on MLS under Assignment 2. Your final
report should be in pdf
format. Also Please make sure to
include clean codes and their results. The document
formatting of your assignment has 5 marks.
Credit
This lab material is adopted from GESC 258- Labs originally developed by Dr. Colin Robertson.