GESC 258- Geographical Research Methods
Access online version of this document at (https://rpubs.com/majidhojati/gesc258lab3)
This lab has two parts. Your Lab demonstrator will teach you the first part on (Feb 4th) and the second part on (Feb 11th and Feb 18th). This week we review a discrete distribution in the first part. In the second part we will look at the continues distribution.
Part I: n-year floods as binomial distribution
Binomial distribution is a distribution that applies to the experiments that have two outcomes (success
or failure
). Tossing a coin more than once is an example of binomial distribution. There are many real-life examples of binomial distribution. Such as number of side effects from medications, or number of spam emails/calls that you might receive per day. A good example of binomial distribution in geography is 100-year flood. A 100-year flood
is a flood event that has a 1 in 100 chance (0.01 probability) of being equaled or exceeded in any given year. Similarly we have 50-year flood
which is a flood event that has a 1 in 50 chance (0.2 probability) of occurring. Read more about this concept here :http://bcn.boulder.co.us/basin/watershed/flood.html. The Binomial distribution can be used to calculate the probability of experiencing a certain number of 50-year
floods in a specified number of years. In this lab we are going to calculate the probability of having x
number 50-year flood in 50 years period.
So to get started we will first define our binomial model. We know the probability of a 50-year flood happening each year is 0.2
so this is probability of success
or \(p=0.2\). What is probability of failure? We can calculate it using this equation:
\[q = 1- p\] As you recall, a binomial distribution has two parameters, n
and p
. As we mentioned in the first paragraph we are going to calculate probability of x
number of 50-year floods
in a 50 years period, so \(n=50\). We can now easily define our binomial model as \(X~B(50,0.2)\). Lets start calculations in the R. First we need to define our variables. We can define them as follows:
please pay attention on the comments in each line of the codes
If you want to find out what is the probability of having 3
50-year flood in 50 years you need to find \(P(x=3)\). In R you can use the following function:
px_3 <- dbinom(3,n,p) # n and p are already defined in the pervious part. We are storing output of dbinom as px_3
px_3
## [1] 0.004370946
Now we want to calculate all the probability values for all the possible values of X
. Recall that the domain of binomial distribution is from 0
to n
. So we need to run dbinom(x,n,p)
50 times and change x parameter every time. Another faster way to do it is to define a vector of numbers from 0 to 50 and pass it as x
to dbinom(x,n,p)
. lets do it in R:
## [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
## [26] 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
so if you pass variable x
to the dbinom
function, it will calculate binomial probability for all the possible values of X
.
## [1] 1.784060e-04 1.092737e-03 4.370946e-03 1.283965e-02 2.953120e-02
## [6] 5.537101e-02 8.701158e-02 1.169218e-01 1.364088e-01 1.398190e-01
## [11] 1.271082e-01 1.032754e-01 7.547049e-02 4.986443e-02 2.991866e-02
## [16] 1.636177e-02 8.180883e-03 3.749571e-03 1.578767e-03 6.117722e-04
## [21] 2.184901e-04 7.200240e-05 2.191378e-05 6.163249e-06 1.602445e-06
## [26] 3.852031e-07 8.560068e-08 1.757871e-08 3.333894e-09 5.834314e-10
## [31] 9.410184e-11 1.396824e-11 1.904760e-12 2.380950e-13 2.721086e-14
## [36] 2.834465e-15 2.681250e-16 2.293175e-17 1.763980e-18 1.212737e-19
## [41] 7.394735e-21 3.961465e-22 1.842542e-23 7.328292e-25 2.442764e-26
## [46] 6.637946e-28 1.412329e-29 2.206764e-31 2.251800e-33 1.125900e-35
In this output you see an
e-
at the end of each number (e.g. 1.784060e-04). It is called scientific notation, You can read more about it here: https://en.wikipedia.org/wiki/Scientific_notation
In the above output, the first value (Px[1]
) is probability of having x=1
,one 50-year flood in n=50
years. we can say \(P(x=1)=0.00017\). To find max value of probabilities we can use max(Px)
which is
## [1] 0.139819
lets start plotting. Here we first set a few parameters for our plot and make a new blank plot using plot.new()
function. using plot.window()
function we set our plot’s x
and y
ranges. As you see we set x
axis’ limit using xlim
parameter and set y
axis’ limit using ylim
parameter. both xlim
and ylim
parameters should be in form of a range. it means they should have min
and max
of our values. In our x axis we want to plot x
value (vector of numbers between 1 and n=50) and as a result xlim=c(1,50)
and for the y axis we want to plot probability value range from 0 to max(Px)
where Px
is out calculated values so ylim=c(0, max(Px)+0.2)
. We add 0.2
to max(Px)
to add an extra gap on top of our chart. In the next line of code we use barplot
to plot values of Px
on the chart. First parameter in barplot function is height
of bars. It should be a vector of numbers. We already have the heights of the bars in Px
. So we just need to pass it to our function. Pay attention to col
parameter which is a color and you can change it to whatever color you want. space
parameter is the space between bars in barplot.
You can type
?c
in R console to find out what doesc(1,50)
mean. Try to change parameters in the codes to learn how they impact your plot
Read comments of following section of code:
par(mar=c(5,5,5,5), las=1) # we set a margin of 5 pixels around our charts to have enough room for labels
plot.new() # create a new plot
plot.window(xlim=c(1,50), ylim=c(0, max(Px)+0.2)) + # set our plot's ylim and xlim
barplot(Px,space=0, ylim=c(0, max(Px)+0.2),col="orange") # draw bar plot. col can be name of your colors like orange, blue, red
## numeric(0)
Now we want to add axis
to our plot. we use axis
function to do it. side
parameter in this function determines the side of the axis. 1 is bottom, 2 is left, 3 is top and 4 is right side. at
parameter tells to the R to add a tick
to the presented coordinates. at=seq(0.5,n, by=1)
means that add a tick every 1 unit, starting from 0.5 and end at n=50. labels
parameter tells R to add a label on defined ticks.
par(mar=c(5,5,5,5), las=1)
plot.new()
plot.window(xlim=range(x), ylim=c(0, max(Px)+0.2))
barplot(Px,space=0, ylim=c(0, max(Px)+0.2),col="orange")
### New Lines of code
axis(side=1, at=seq(0.5,n, by=1), labels=seq(1,n, by=1)) # set ticks at defined positions for x axis
now we want to draw a box around our plot. we use
box()
function.
par(mar=c(5,5,5,5), las=1)
plot.new()
plot.window(xlim=range(x), ylim=c(0, max(Px)+0.2))
barplot(Px,space=0, ylim=c(0, max(Px)+0.2),col="orange")
axis(side=1, at=seq(0.5,n, by=1), labels=seq(1,n, by=1))
### New Lines of code
box()
Now Lets imagine we want to find probability of having less than or equal 9 event P(x<=9)
. If we want to show P(x<=9)
on the chart using a line and arrow we use two functions called lines
and arrows
. lines
functions gets a vector of x
values and a vector of y
values to draw a line based on the (x,y)
values. lwd
is thickness of the line and lty
is type of line. arrows
function is similar but it will get x0
and y0
as start point of the arrow and x1
and y1
as end point of arrow. use ?lines
or ?arrows
command to find out more about this function.
Also we can use text
function to add a text on (x,y)
location on the chart.
par(mar=c(5,5,5,5), las=1)
plot.new()
plot.window(xlim=range(x), ylim=c(0, max(Px)+0.2))
barplot(Px,space=0, ylim=c(0, max(Px)+0.2),col="orange")
axis(side=1, at=seq(0.5,n, by=1), labels=seq(1,n, by=1))
box()
### New Lines of code
lines(x=c(9,9),y=c(0,0.3), col="red",lwd=2,lty="dashed")
arrows(x0=9,y0=0.3,x1=0,y1=0.3, col = "red", lwd=2, lty="dashed")
text(x=4.2,y=0.32,"P(x<=9)")
To calculate
P(x<=9)
we have Several methods. One method is to calculate P(x=1)+ P(x=2) + ... + P(x=9)
. Lets try this method.
## [1] 0.4437261
you can also use this function to get the same result.
sum(dbinom(1:9,n,p) )
Another more advanced method is to calculate cumulative probability values
using pbinom
function. Type ?pbinom
in R console to learn what are its parameters. Lets look at its output when we run it with x from 1 to 50
and n=50
and p=0.2
## [1] 0.0001926784 0.0012854150 0.0056563610 0.0184960151 0.0480272194
## [6] 0.1033982275 0.1904098116 0.3073316278 0.4437404133 0.5835594185
## [11] 0.7106676050 0.8139430065 0.8894134923 0.9392779204 0.9691965772
## [16] 0.9855583427 0.9937392254 0.9974887967 0.9990675635 0.9996793357
## [21] 0.9998978257 0.9999698281 0.9999917419 0.9999979051 0.9999995076
## [26] 0.9999998928 0.9999999784 0.9999999960 0.9999999993 0.9999999999
## [31] 1.0000000000 1.0000000000 1.0000000000 1.0000000000 1.0000000000
## [36] 1.0000000000 1.0000000000 1.0000000000 1.0000000000 1.0000000000
## [41] 1.0000000000 1.0000000000 1.0000000000 1.0000000000 1.0000000000
## [46] 1.0000000000 1.0000000000 1.0000000000 1.0000000000 1.0000000000
As you see this function first calculates P(x=1)
then calculates P(x=1) + P(x=2)
and continues until reaches to P(x=1) + P(x=2) + ,,, P(x=n)
. Lets plot it on the chart. Notice that here we want to add another y axis on the right side of the plot. So we use another plot.window
function and from now all of the future parameters will use this axis for their y
parameter.
par(mar=c(5,5,5,5), las=1)
plot.new()
plot.window(xlim=range(x), ylim=c(0, max(Px)+0.2))
barplot(Px,space=0, ylim=c(0, max(Px)+0.2),col="orange")
axis(side=1, at=seq(0.5,n, by=1), labels=seq(1,n, by=1))
box()
lines(x=c(9,9),y=c(0,0.3), col="red",lwd=2,lty="dashed")
arrows(x0=9,y0=0.3,x1=0,y1=0.3, col = "red", lwd=2, lty="dashed")
text(x=4.2,y=0.32,"P(x<=9)")
### New Lines of code
cumulativePx <- pbinom(x,n,p)
plot.window(xlim=c(0,n), ylim=c(0,1.5)) # we define a new plot window and set xlim and ylim, xlim is same as our previous plot but ylim is from 0 to 1.5
lines(x, cumulativePx, col="red", lwd=2) #draw a line, with color= red, width=2 and x coordinates of 0 to 50 and y coordinates of cumPX
axis(side=4,at=seq(0,1.5, by=0.2), labels=seq(0,1.5, by=0.2)) # same as before we add an axis on the right side (side=4) and add ticks from 0 to 1.5 with 0.2 intervals, and label them
Lets add title and labels to our plot
par(mar=c(5,5,5,5), las=1)
plot.new()
plot.window(xlim=range(x), ylim=c(0, max(Px)+0.2))
barplot(Px,space=0, ylim=c(0, max(Px)+0.2),col="orange")
axis(side=1, at=seq(0.5,n, by=1), labels=seq(1,n, by=1))
box()
lines(x=c(9,9),y=c(0,0.3), col="red",lwd=2,lty="dashed")
arrows(x0=9,y0=0.3,x1=0,y1=0.3, col = "red", lwd=2, lty="dashed")
text(x=4.2,y=0.32,"P(x<=9)")
cumulativePx <- pbinom(x,n,p)
plot.window(xlim=c(0,n), ylim=c(0,1.5))
lines(x, cumulativePx, col="red", lwd=2)
axis(side=4,at=seq(0,1.5, by=0.2), labels=seq(0,1.5, by=0.2))
### New Lines of code
title("Binomial distribution for n=50 and p=0.2")
mtext("Cumulative Probability", side = 4, las=3, line=3 ) # you can change values of las and line to see their effect on your axis's title
mtext("Probability", side = 2, las=3, line=3)
Now we want to draw a polygon
on the chart to highlight areas under chart that show P(x<=9)
and P(x>9)
par(mar=c(5,5,5,5), las=1)
plot.new()
plot.window(xlim=range(x), ylim=c(0, max(Px)+0.2))
barplot(Px,space=0, ylim=c(0, max(Px)+0.2),col="orange")
axis(side=1, at=seq(0.5,n, by=1), labels=seq(1,n, by=1))
box()
lines(x=c(9,9),y=c(0,0.3), col="red",lwd=2,lty="dashed")
arrows(x0=9,y0=0.3,x1=0,y1=0.3, col = "red", lwd=2, lty="dashed")
text(x=4.2,y=0.32,"P(x<=9)")
cumulativePx <- pbinom(x,n,p)
plot.window(xlim=c(0,n), ylim=c(0,1.5))
lines(x, cumulativePx, col="red", lwd=2)
axis(side=4,at=seq(0,1.5, by=0.2), labels=seq(0,1.5, by=0.2))
title("Binomial distribution for n=50 and p=0.2")
mtext("Cumulative Probability", side = 4, las=3, line=3 ) # you can change values of las and line to see their effect on your axis's title
mtext("Probability", side = 2, las=3, line=3)
### New Lines of code
polygon(c(1:9, 9,1), c(cumulativePx[1:9], cumulativePx[1],cumulativePx[1]), col=adjustcolor( "green", alpha.f = 0.2))
text(7, 0.20, "P(x<=9)")
polygon(c(9:n, n,9), c(cumulativePx[9:n], Px[n],cumulativePx[1]), col=adjustcolor( "brown", alpha.f = 0.2))
text(18, 0.20, "P(x>9)")
Lets calculate mean and standard deviation of our binomial model.
mu <- n*p # mean of a binomial distribution is equal to n * p
sigma <- sqrt(n*p*(1-p)) # stdev of a binomial distribution is equal to square root of n*p*(1-p)
sd1.pos <- mu + sigma # lets calculate 1 stdev distance from mean in the positive side
sd1.neg <- mu - sigma # lets calculate 1 stdev distance from mean in the negative side
And finally lets plot mu
and stdev values on our plot. We use a function named abline
which draws a horizontal or vertical line on the chart. v
parameter is value of x
that it draws a vertical line on it.
par(mar=c(5,5,5,5), las=1)
plot.new()
plot.window(xlim=range(x), ylim=c(0, max(Px)+0.2))
barplot(Px,space=0, ylim=c(0, max(Px)+0.2),col="orange")
axis(side=1, at=seq(0.5,n, by=1), labels=seq(1,n, by=1))
box()
lines(x=c(9,9),y=c(0,0.3), col="red",lwd=2,lty="dashed")
arrows(x0=9,y0=0.3,x1=0,y1=0.3, col = "red", lwd=2, lty="dashed")
text(x=4.2,y=0.32,"P(x<=9)")
cumulativePx <- pbinom(x,n,p)
plot.window(xlim=c(0,n), ylim=c(0,1.5))
lines(x, cumulativePx, col="red", lwd=2)
axis(side=4,at=seq(0,1.5, by=0.2), labels=seq(0,1.5, by=0.2))
title("Binomial distribution for n=50 and p=0.2")
mtext("Cumulative Probability", side = 4, las=3, line=3 )
mtext("Probability", side = 2, las=3, line=3)
polygon(c(1:9, 9,1), c(cumulativePx[1:9], Px[1],Px[1]), col=adjustcolor( "green", alpha.f = 0.2))
text(7, 0.20, "P(x<=9)")
polygon(c(9:n, n,9), c(cumulativePx[9:n], Px[n],Px[1]), col=adjustcolor( "brown", alpha.f = 0.2))
text(18, 0.20, "P(x>9)")
### New Lines of code
abline(v=mu,lty = "dashed")
text(mu+2, 1.1, "mean")
abline(v=sd1.pos, lty= "dashed", col="gray")
abline(v=sd1.neg, lty= "dashed", col="gray")
text(sd1.pos+2, 1.2, "+1 sd")
text(sd1.neg-2, 1.2, "-1 sd")
Part I Assignment
1- Add your name and date on the above chart using text
function and save it as an image on your one-drive. Submit it with part II of lab before March 3rd in MLS (5 Marks).
Part II
In this part of the lab, we will cover sampling distributions and confidence intervals. To learn about these we are going to explore a new dataset which was painstakingly collected on a beach after a large storm. Our objective here is to try to estimate the average size of driftwood on the beach. Driftwood on beaches in coastal British Columbia is very common, and most are escaped logs from logging booms - while the odd one naturally uprooted shows up as well. Tracking how driftwood accumulates in different sections of beach is a fascinating combination of where logging takes place, how logs are transported, ocean currents and storm activity, etc. There is in fact a whole job category devoted to capturing escaped logs and selling them back to the logging companies - an occupation made famous by the epic Canadian 1980s television show - The Beachcombers - set in Gibson’s B.C. See this clip (https://www.youtube.com/watch?v=erd4vN40jYw)
We set out to record the length of a sample of beach on the morning after a large storm. Our intended goal was to record the length of every notable log in the section of beach being sampled, but data collection had to be cut early due to an unruly field assistant. In all we collected the following 8 measurements:
from which we can obtain a sample mean and sample standard deviation; as well as the summary command to see more descriptive statistics;
## [1] 622
## [1] 411.5538
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 153.0 271.8 649.5 622.0 788.8 1200.0
which shows that there is a lot of variability in the lengths. We also have a tiny sample size which is not great. Lets look at the frequency distribution of our sample:
which shows the data is not normally distributed. However - we know we can use the sampling distribution of the mean to calculate the standard error of the mean even for non-normal data. Generally we want a larger sample size (n >= 30
), but we will use what we have to illustrate. Recall that for the sampling distribution of \(\bar{x}\) the mean of sample means \(μ_{\overline{x}}\) is equal to \(μ\) and that \(σ_{\overline{x}}\) is equal to \(\frac{\sigma}{\sqrt{n}}\) Now, because we do not know what \(σ\) is - we have to use our sample estimate of it, which is \(s\) which we calculate using the sd function -
## [1] 411.5538
so we could calculate \(\frac{\sigma}{\sqrt{n}}\) using
## [1] 145.5063
Sampling Distribution of the Mean
The sampling distribution for our mean is estimated as coming from a normal distribution with \(μ = μ_{\overline{x}}\) and \(σ_{\overline{x}}=\frac{\sigma}{\sqrt{n}}\). Let’s pretend we knew the actual values of \(μ\) and \(σ\) and use the following two values for the population, a \(μ\) of 860 and a \(σ\) of 270.
The z-score
of the sample mean is:
\[ z = \frac{\bar{x}-\mu}{\sigma_{\bar{x}}} \]
so the question is how unusual is the mean we observed in our small sample? We can use the normal distribution as we have previously using the pnorm
function. Let’s calculate the z-score
:
## [1] -2.493206
right away we should notice a couple of things with this z-score;
The z-score is below
0
and therefore \(\bar{x} < \mu\)The z-score is below -2 so this is a pretty unlikely outcome
The probability of such a z-score or lower is
## [1] 0.006329769
Now what would have happened if everything else was the same, but our sample size was only n = 2? Then we would be looking at:
## [1] 0.1062715
now this seems much more likely even though the sample mean stayed the same - this is because a sample statistic (or estimate of \(μ\) using \(\bar{x}\)) is highly uncertain if we only measured 2 logs.
Confidence Intervals
A better way to capture the uncertainty in our sample statistic is with a confidence interval. Instead of calculating a z-score and finding a probability we invert that process - so we set the probability - which is our confidence level, then we find the associated value of z-score for this probability, then we find the interval of values around \(\bar{x}\).
A generic formula for a confidence interval is:
point estimate +/- critical value * standard error
Now let’s go through the steps above:
Set the probability. There are common levels which we use such as 95% or 99% or 90%. We will use a 95% confidence level.
Now we need to find a z-score associated with 95% probability. We do this by taking the inverse so \(1-0.95\). We could then use
qnorm
to find the z-score for 0.05. But this would be wrong becauseqnorm
gives us one-tailed probabilities and we really need two-tailed (i.e., 0.05 split into the upper and lower tails), so we would useqnorm(0.05/2)
which would give us a lower z-score associated with .025 probability. This is what we use as the critical value of the confidence interval.Now to find the interval in
R
we just need the standard error - which we already know is \(\frac{\sigma}{\sqrt{n}}\)
So it would look something like:
z_crit = qnorm(0.05/2, lower.tail = FALSE)
lower = mean(x) - (z_crit * (270/sqrt(8)))
upper = mean(x) + (z_crit * (270/sqrt(8)))
plot(x=c(lower, upper), y=c(1, 1), type="l", xlim = c(min(x),max(x)), xlab="Driftwood Length (cm)", ylab="", col = "Black")
points(x = mean(x), y = 1, pch=20, cex=3)
#we could add another to it if we wanted
z_crit = qnorm(0.10/2, lower.tail = FALSE)
lower = mean(x) - (z_crit * (270/sqrt(8)))
upper = mean(x) + (z_crit * (270/sqrt(8)))
lines(x = c(lower, upper), y=c(1.2, 1.2), col="red")
points(x = mean(x), y = 1.2, pch=20, cex=3)
So there is a lot of extra commands here you can ignore. The key thing being that what do we notice about the second (red) interval we created? It is narrower. This is because we set the probability value to be lower (i.e., we have lower confidence) - which basically means - we can be less sure (only 90% instead of 95%) about a narrower interval - for the same data.
The above procedure for a confidence interval works for when we have a normally distributed test statistic, that is why we use a z-score as the critical value. In cases where we have a small sample size we can actually use another distribution - the t distribution
- as the critical value. In this way we can capture the greater uncertainty associated with very small sample sizes. We will learn about the t distribution
in future. But if you want to learn about it now, you can use the qt
function to get critical values from the t-distribution
. Take a look at the help to see how to set the degrees of freedom.
Part II Assignment
Because we have three weeks to complete, we are doing things a little differently. It is a challenge to create datasets that are of interest to all students so instead for this lab we are going to get you to find your own. This can be as simple or as complex as you want to make it. The key point is that you collect/create the dataset yourself (i.e., this is not taken from an example elsewhere).
Try to get a sample size of at least 30. If collecting this dataset requires you going outside and measuring something - even better. Or this can be something related to sports, music, whatever interests you. Use Piazza to ask questions about suitable datasets if you are not sure. You could for example check the temperature at environment canada on this day for the past 30 years. Or you could check with goals against average for Toronto Maple Leafs goalies for the past 30 years. It is up to you - but the dataset needs to be unique to you and you will have to explain where it comes from in your write up.
Write a paragraph describing your dataset. Include how you acquired it, why you selected it, what the variable of interest is, whether there is any measurement error, sources of sampling error, bias, etc. Also include a histogram and a table with basic summmary statistics describing your dataset. The histogram and table should have descriptive captions. Include your dataset with your R commands. (out of 10)
Create 90%, 95% and 99% confidence intervals for your data (using a sample mean). Plot these on graph. Include a sentence describing what these mean and commands used to generate the answer. (out of 10)
Conduct your own analysis of the dataset. This could include calculating z-scores to find unusual values or groups of unusual values and any plots you want to explore. Write a paragrph commenting on the results and interpreting your findings. (out of 10)
Hand in
Please submit your answers on MLS under Assignment 2. Your final report should be in pdf
format. Also Please make sure to include clean codes and their results. The document formatting of your assignment has 5 marks.
Credit
This lab material for part II is adopted from GESC 258- Labs originally developed by Dr. Colin Robertson.