GESC 258- Normal Distributions

GESC 258- Geographical Research Methods

Our lab this week will examine trees in a coastal forest in British Columbia. In particular we have a small dataset of tree circumferences which we want to analyze. While tree circumference is easy to measure and tells us something about tree size and age, it is more customary to work with tree diameters. A typical measure used to measure tree is diameter-at-breast-height (DBH) which is measured 4.5 feet from the ground using a DBH tape.We can get started by converting our measures of circumference to diameter using the relationship between these two quantities which is

\[diameter=Circumference/\pi\] recalling that π is ratio of a circle’s circumference to its diameter which we can use \(3.14159\) as a value. So assuming we have a circumference of 214cm, the diameter would be

\[214 /3.14159= 68.11837\] which in our dataset our expressed in units of centimeters. ### Tree Circumference Data The data we collected in the video were as follows:

In order to bring the data into R, we need to assign these numbers to a data frame which is like a table, or to a vector which is just a bunch of numbers stored in order. A data frame can be made up of multiple vectors (as individual columns in a table). We will make a vector, since we do not have anything other than tree circumference to store. If we had collected information on tree species, condition, site characteristics, etc. we would want to store the data as a table, so we would create a data frame.

x <- c(272,272,236,154,256,156,143,269,205,175)

What is happening in the command line above? First of all we are making an assignment using the <- operator. This means we are assigning whatever is on the right side of the<- to what is on the left side. In this case we are creating a vector of numbers representing our tree dataset to a variable called x . The function c is being used which all it does is take some values and create a vector from them.

Before going further, as we noted above we want to convert these measures of circumference to diameters. R has a special number for π which we can use by just typing pi in the console. Try it:

pi

## [1] 3.141593

So if we wanted to divide all of our values by π we could just type the following in the console:

x/pi

##  [1] 86.58029 86.58029 75.12113 49.01972 81.48733 49.65634 45.51831 85.62536
##  [9] 65.25353 55.70423

which is good, but this just displays them to the screen. In order to do something with these we want to store them in a new variable. Note that we can use any name we want for variables. We will create a new one called dbh:

dbh <- x/pi

Now we have a new vector which has our data properly represented in values of dbh which we measured in the field. We can use this vector in subsequent work below.

The Normal Distribution

Lets pretend that we had a sample size of 1500! Luckily we can create a dataset using simulation pretty easily, and we will based it on our actual sample mean and standard deviation. We will use a function called rnorm which basically will generate random samples from a given Normal distribution, we just have to supply the mean, standard deviation, and sample size.

sim_dbh <- rnorm(n=1500, mean=mean(x), sd=sd(x))

We don’t want to look at all the values on the screen because there are too many. We can look at the first few, the last few, and then count the number of observations as follows:

head(sim_dbh)

## [1] 242.6497 194.6676 258.7242 261.6019 310.3390 174.4379

tail(sim_dbh)

## [1] 180.8939 244.0322 285.1821 194.0609 201.4368 258.3805

length(sim_dbh)

## [1] 1500

because these are randomly simulated values, yours may differ a bit from what you see here, but there should be 1500 values.

We can look at the distribution of values by plotting the histogram:

hist(sim_dbh, xlab = "Simulated DBH(cm) Values", main="")

We can also create a line showing the theoretical distribution that would be the population standard deviation:

x = seq(0,200,.1)
y = dnorm(x, mean=mean(dbh), sd=sd(dbh))
plot(x,y, type="l", main="Normal Distribution")

The key point here is that from this smoothly varying distribution, we can calculate probabilities. Say we sampled a tree with a dbh of 100, we can ask the question - what is the probability of finding a tree with a dbh of 100 or greater from this distribution, that is akin to finding the area under the curve for everything to the right of the 100 value on the x-axis. We will use a special probability function in R to get this, it is called pnorm

pnorm(q=100, mean = mean(dbh), sd=sd(dbh), lower.tail = FALSE)

## [1] 0.02995489

which tells us that the probability of sampling a tree with a dbh value of 100 or greater is 0.03 - so pretty unlikely.

Using z-scores instead of raw data to more easily find probabilities from a normal distribution

z-score is simply a score we can calculate for every observation which standardizes the data to a standard normal distribution. This means it eliminates the units of the data (i.e., we’re no longer in the centimeters we measured) - but it makes scores comparable across datasets and easier to quickly interpret.

The z-score formula is
\[z_i = \frac{x_i - \bar{x}}{\sigma}\] which is just the observation, minus the mean, divided by the standard deviation. We’ll go back to our smaller dataset to illustrate:

(dbh - mean(dbh)) / sd(dbh)

##  [1]  1.0910881  1.0910881  0.4161882 -1.1210836  0.7911326 -1.0835892
##  [7] -1.3273030  1.0348464 -0.1649755 -0.7273920

z-score have the properties that values less than zero are for observations below the mean, values greater than zero are for observations above the mean, and values of zero correspond to the mean (which are rare). Moving back to our hypothetical tree with a dbh of 100, lets calculate its z-score:

(100 - mean(dbh)) / sd(dbh)

## [1] 1.881457

which gives us a z-score of 1.881457 If we checked the probability value associated with this z-score we would see that it matched the one we got above, we just no longer have to specify the mean and sd because this is referenced to a standard normal distribution

pnorm(1.881457, lower.tail=FALSE)

## [1] 0.02995489

Assignment

Calculate the probability of finding a tree with a dbh of 90 or greater based on the sample mean and sample standard deviation above. What is the z-score associated with a dbh of 90 cm? Include commands used to generate the answer. (out of 3)
Create a new dataset with a mean and standard deviation of your choosing and a sample size of 100. Plot the histogram being sure to label axes appropriately. (out of 3)
Do you think the sample of trees we collected was representative of the wider forest? What would be some potential issues with inferring characteristics of the forest from this dataset? (out of 4)

Hand in

Please submit your answers on MLS under Assignment 2. Your final report should be in pdf format. Also Please make sure to include clean codes and their results. The document formatting of your assignment has 5 marks.

Credit

This lab material is adopted from GESC 258- Labs originally developed by Dr. Colin Robertson.