GESC 258- Geographical Research Methods

Access online version of this document at (https://rpubs.com/majidhojati/gesc258lab3)

This lab has two parts. Your Lab demonstrator will teach you the first part on (Feb 4th) and the second part on (Feb 10th). This week we review a discrete distribution in the first part.

Part I: n-year floods as binomial distribution

Binomial distribution is a distribution that applies to the experiments that have two outcomes (success or failure). Tossing a coin more than once is an example of binomial distribution. There are many real-life examples of binomial distribution. Such as number of side effects from medications, or number of spam emails/calls that you might receive per day. A good example of binomial distribution in geography is 100-year flood. A 100-year flood is a flood event that has a 1 in 100 chance (0.01 probability) of being equaled or exceeded in any given year. Similarly we have 50-year flood which is a flood event that has a 1 in 50 chance (0.2 probability) of occurring. Read more about this concept here :http://bcn.boulder.co.us/basin/watershed/flood.html. The Binomial distribution can be used to calculate the probability of experiencing a certain number of 50-year floods in a specified number of years. In this lab we are going to calculate the probability of having x number 50-year flood in 50 years period.

So to get started we will first define our binomial model. We know the probability of a 50-year flood happening each year is 0.2 so this is probability of success or \(p=0.2\). What is probability of failure? We can calculate it using this equation:

\[q = 1- p\] As you recall, a binomial distribution has two parameters, n and p. As we mentioned in the first paragraph we are going to calculate probability of x number of 50-year floods in a 50 years period, so \(n=50\). We can now easily define our binomial model as \(X~B(50,0.2)\). Lets start calculations in the R. First we need to define our variables. We can define them as follows:

please pay attention on the comments in each line of the codes

n <- 50 #number of binomial tests 
p <- 0.2 # probability of success

If you want to find out what is the probability of having 3 50-year flood in 50 years you need to find \(P(x=3)\). In R you can use the following function:

px_3 <- dbinom(3,n,p) # n  and p are already defined in the pervious part. We are storing output of dbinom as px_3
px_3
## [1] 0.004370946

Now we want to calculate all the probability values for all the possible values of X. Recall that the domain of binomial distribution is from 0 to n. So we need to run dbinom(x,n,p) 50 times and change x parameter every time. Another faster way to do it is to define a vector of numbers from 0 to 50 and pass it as x to dbinom(x,n,p). lets do it in R:

x <- 1:n # We define a vector of numbers starting from 1 and ending with n=50
x
##  [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
## [26] 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50

so if you pass variable x to the dbinom function, it will calculate binomial probability for all the possible values of X.

Px <- dbinom(x,n,p)
Px
##  [1] 1.784060e-04 1.092737e-03 4.370946e-03 1.283965e-02 2.953120e-02
##  [6] 5.537101e-02 8.701158e-02 1.169218e-01 1.364088e-01 1.398190e-01
## [11] 1.271082e-01 1.032754e-01 7.547049e-02 4.986443e-02 2.991866e-02
## [16] 1.636177e-02 8.180883e-03 3.749571e-03 1.578767e-03 6.117722e-04
## [21] 2.184901e-04 7.200240e-05 2.191378e-05 6.163249e-06 1.602445e-06
## [26] 3.852031e-07 8.560068e-08 1.757871e-08 3.333894e-09 5.834314e-10
## [31] 9.410184e-11 1.396824e-11 1.904760e-12 2.380950e-13 2.721086e-14
## [36] 2.834465e-15 2.681250e-16 2.293175e-17 1.763980e-18 1.212737e-19
## [41] 7.394735e-21 3.961465e-22 1.842542e-23 7.328292e-25 2.442764e-26
## [46] 6.637946e-28 1.412329e-29 2.206764e-31 2.251800e-33 1.125900e-35

In this output you see an e- at the end of each number (e.g. 1.784060e-04). It is called scientific notation, You can read more about it here: https://en.wikipedia.org/wiki/Scientific_notation

In the above output, the first value (Px[1]) is probability of having x=1,one 50-year flood in n=50 years. we can say \(P(x=1)=0.00017\). To find max value of probabilities we can use max(Px) which is

max(Px)
## [1] 0.139819

lets start plotting. Here we first set a few parameters for our plot and make a new blank plot using plot.new() function. using plot.window() function we set our plot’s x and y ranges. As you see we set x axis’ limit using xlim parameter and set y axis’ limit using ylim parameter. both xlim and ylim parameters should be in form of a range. it means they should have min and max of our values. In our x axis we want to plot x value (vector of numbers between 1 and n=50) and as a result xlim=c(1,50) and for the y axis we want to plot probability value range from 0 to max(Px) where Px is out calculated values so ylim=c(0, max(Px)+0.2). We add 0.2 to max(Px) to add an extra gap on top of our chart. In the next line of code we use barplot to plot values of Px on the chart. First parameter in barplot function is height of bars. It should be a vector of numbers. We already have the heights of the bars in Px. So we just need to pass it to our function. Pay attention to col parameter which is a color and you can change it to whatever color you want. space parameter is the space between bars in barplot.

You can type ?c in R console to find out what does c(1,50) mean. Try to change parameters in the codes to learn how they impact your plot

Read comments of following section of code:

par(mar=c(5,5,5,5), las=1) # we set a margin of 5 pixels around our charts to have enough room for labels 
plot.new() # create a new plot
plot.window(xlim=c(1,50), ylim=c(0, max(Px)+0.2)) + # set our plot's ylim and xlim
barplot(Px,space=0, ylim=c(0, max(Px)+0.2),col="orange") # draw bar plot. col can be name of your colors like orange, blue, red

## numeric(0)

Now we want to add axis to our plot. we use axis function to do it. side parameter in this function determines the side of the axis. 1 is bottom, 2 is left, 3 is top and 4 is right side. at parameter tells to the R to add a tick to the presented coordinates. at=seq(0.5,n, by=1) means that add a tick every 1 unit, starting from 0.5 and end at n=50. labels parameter tells R to add a label on defined ticks.

par(mar=c(5,5,5,5), las=1)
plot.new()
plot.window(xlim=range(x), ylim=c(0, max(Px)+0.2))
barplot(Px,space=0, ylim=c(0, max(Px)+0.2),col="orange")

### New Lines of code
axis(side=1, at=seq(0.5,n, by=1), labels=seq(1,n, by=1)) # set ticks at defined positions for x axis

now we want to draw a box around our plot. we use box() function.

par(mar=c(5,5,5,5), las=1)
plot.new()
plot.window(xlim=range(x), ylim=c(0, max(Px)+0.2))
barplot(Px,space=0, ylim=c(0, max(Px)+0.2),col="orange")
axis(side=1, at=seq(0.5,n, by=1), labels=seq(1,n, by=1))

### New Lines of code
box()

Now Lets imagine we want to find probability of having less than or equal 9 event P(x<=9). If we want to show P(x<=9) on the chart using a line and arrow we use two functions called lines and arrows. lines functions gets a vector of x values and a vector of y values to draw a line based on the (x,y) values. lwd is thickness of the line and lty is type of line. arrows function is similar but it will get x0 and y0 as start point of the arrow and x1 and y1 as end point of arrow. use ?lines or ?arrows command to find out more about this function.

Also we can use text function to add a text on (x,y) location on the chart.

par(mar=c(5,5,5,5), las=1)
plot.new()
plot.window(xlim=range(x), ylim=c(0, max(Px)+0.2))
barplot(Px,space=0, ylim=c(0, max(Px)+0.2),col="orange")
axis(side=1, at=seq(0.5,n, by=1), labels=seq(1,n, by=1))
box()

### New Lines of code
lines(x=c(9,9),y=c(0,0.3), col="red",lwd=2,lty="dashed")
arrows(x0=9,y0=0.3,x1=0,y1=0.3, col = "red", lwd=2, lty="dashed")
text(x=4.2,y=0.32,"P(x<=9)")

To calculate P(x<=9) we have Several methods. One method is to calculate P(x=1)+ P(x=2) + ... + P(x=9). Lets try this method.

dbinom(1,n,p) + dbinom(2,n,p) + dbinom(3,n,p) + dbinom(4,n,p) + dbinom(5,n,p) + dbinom(6,n,p) + dbinom(7,n,p)+ dbinom(8,n,p) +dbinom(9,n,p)
## [1] 0.4437261

you can also use this function to get the same result. sum(dbinom(1:9,n,p) )

Another more advanced method is to calculate cumulative probability values using pbinom function. Type ?pbinom in R console to learn what are its parameters. Lets look at its output when we run it with x from 1 to 50 and n=50 and p=0.2

 cumulativePx <- pbinom(x,n,p) 
 cumulativePx
##  [1] 0.0001926784 0.0012854150 0.0056563610 0.0184960151 0.0480272194
##  [6] 0.1033982275 0.1904098116 0.3073316278 0.4437404133 0.5835594185
## [11] 0.7106676050 0.8139430065 0.8894134923 0.9392779204 0.9691965772
## [16] 0.9855583427 0.9937392254 0.9974887967 0.9990675635 0.9996793357
## [21] 0.9998978257 0.9999698281 0.9999917419 0.9999979051 0.9999995076
## [26] 0.9999998928 0.9999999784 0.9999999960 0.9999999993 0.9999999999
## [31] 1.0000000000 1.0000000000 1.0000000000 1.0000000000 1.0000000000
## [36] 1.0000000000 1.0000000000 1.0000000000 1.0000000000 1.0000000000
## [41] 1.0000000000 1.0000000000 1.0000000000 1.0000000000 1.0000000000
## [46] 1.0000000000 1.0000000000 1.0000000000 1.0000000000 1.0000000000

As you see this function first calculates P(x=1) then calculates P(x=1) + P(x=2) and continues until reaches to P(x=1) + P(x=2) + ,,, P(x=n). Lets plot it on the chart. Notice that here we want to add another y axis on the right side of the plot. So we use another plot.window function and from now all of the future parameters will use this axis for their y parameter.

par(mar=c(5,5,5,5), las=1)
plot.new()
plot.window(xlim=range(x), ylim=c(0, max(Px)+0.2))
barplot(Px,space=0, ylim=c(0, max(Px)+0.2),col="orange")
axis(side=1, at=seq(0.5,n, by=1), labels=seq(1,n, by=1))
box()
lines(x=c(9,9),y=c(0,0.3), col="red",lwd=2,lty="dashed")
arrows(x0=9,y0=0.3,x1=0,y1=0.3, col = "red", lwd=2, lty="dashed")
text(x=4.2,y=0.32,"P(x<=9)")

### New Lines of code
cumulativePx <- pbinom(x,n,p) 
plot.window(xlim=c(0,n), ylim=c(0,1.5)) # we define a new plot window and set xlim and ylim, xlim is same as our previous plot but ylim is from 0 to 1.5
lines(x, cumulativePx, col="red", lwd=2) #draw a line, with color=  red, width=2 and x coordinates of 0 to 50 and y coordinates of cumPX
axis(side=4,at=seq(0,1.5, by=0.2), labels=seq(0,1.5, by=0.2))  # same as before we add an axis on the right side (side=4) and add ticks from 0 to 1.5 with 0.2 intervals, and label them

Lets add title and labels to our plot

par(mar=c(5,5,5,5), las=1)
plot.new()
plot.window(xlim=range(x), ylim=c(0, max(Px)+0.2))
barplot(Px,space=0, ylim=c(0, max(Px)+0.2),col="orange")
axis(side=1, at=seq(0.5,n, by=1), labels=seq(1,n, by=1)) 
box()
lines(x=c(9,9),y=c(0,0.3), col="red",lwd=2,lty="dashed")
arrows(x0=9,y0=0.3,x1=0,y1=0.3, col = "red", lwd=2, lty="dashed")
text(x=4.2,y=0.32,"P(x<=9)")
cumulativePx <- pbinom(x,n,p) 
plot.window(xlim=c(0,n), ylim=c(0,1.5))
lines(x, cumulativePx, col="red", lwd=2) 
axis(side=4,at=seq(0,1.5, by=0.2), labels=seq(0,1.5, by=0.2)) 

### New Lines of code
title("Binomial distribution for n=50 and p=0.2")
mtext("Cumulative Probability", side = 4, las=3, line=3 ) # you can change values of las and line to see their effect on your axis's title
mtext("Probability", side = 2, las=3, line=3)

Now we want to draw a polygon on the chart to highlight areas under chart that show P(x<=9) and P(x>9)

par(mar=c(5,5,5,5), las=1)
plot.new()
plot.window(xlim=range(x), ylim=c(0, max(Px)+0.2))
barplot(Px,space=0, ylim=c(0, max(Px)+0.2),col="orange")
axis(side=1, at=seq(0.5,n, by=1), labels=seq(1,n, by=1)) 
box()
lines(x=c(9,9),y=c(0,0.3), col="red",lwd=2,lty="dashed")
arrows(x0=9,y0=0.3,x1=0,y1=0.3, col = "red", lwd=2, lty="dashed")
text(x=4.2,y=0.32,"P(x<=9)")
cumulativePx <- pbinom(x,n,p) 
plot.window(xlim=c(0,n), ylim=c(0,1.5))
lines(x, cumulativePx, col="red", lwd=2) 
axis(side=4,at=seq(0,1.5, by=0.2), labels=seq(0,1.5, by=0.2)) 
title("Binomial distribution for n=50 and p=0.2")
mtext("Cumulative Probability", side = 4, las=3, line=3 ) # you can change values of las and line to see their effect on your axis's title
mtext("Probability", side = 2, las=3, line=3)
### New Lines of code

polygon(c(1:9, 9,1), c(cumulativePx[1:9], cumulativePx[1],cumulativePx[1]), col=adjustcolor( "green", alpha.f = 0.2))
text(7, 0.20, "P(x<=9)")

polygon(c(9:n, n,9), c(cumulativePx[9:n], Px[n],cumulativePx[1]), col=adjustcolor( "brown", alpha.f = 0.2))
text(18, 0.20, "P(x>9)")

Lets calculate mean and standard deviation of our binomial model.

mu <- n*p # mean of a binomial distribution is equal to n * p 
sigma <- sqrt(n*p*(1-p)) # stdev of a binomial distribution is equal to square root of n*p*(1-p)
sd1.pos <- mu + sigma # lets calculate 1 stdev distance from mean in the positive side
sd1.neg <- mu - sigma # lets calculate 1 stdev distance from mean in the negative side

And finally lets plot mu and stdev values on our plot. We use a function named abline which draws a horizontal or vertical line on the chart. v parameter is value of x that it draws a vertical line on it.

par(mar=c(5,5,5,5), las=1)
plot.new()
plot.window(xlim=range(x), ylim=c(0, max(Px)+0.2))
barplot(Px,space=0, ylim=c(0, max(Px)+0.2),col="orange")
axis(side=1, at=seq(0.5,n, by=1), labels=seq(1,n, by=1)) 
box()
lines(x=c(9,9),y=c(0,0.3), col="red",lwd=2,lty="dashed")
arrows(x0=9,y0=0.3,x1=0,y1=0.3, col = "red", lwd=2, lty="dashed")
text(x=4.2,y=0.32,"P(x<=9)")
cumulativePx <- pbinom(x,n,p) 
plot.window(xlim=c(0,n), ylim=c(0,1.5))
lines(x, cumulativePx, col="red", lwd=2) 
axis(side=4,at=seq(0,1.5, by=0.2), labels=seq(0,1.5, by=0.2)) 
title("Binomial distribution for n=50 and p=0.2")
mtext("Cumulative Probability", side = 4, las=3, line=3 )
mtext("Probability", side = 2, las=3, line=3)
polygon(c(1:9, 9,1), c(cumulativePx[1:9], Px[1],Px[1]), col=adjustcolor( "green", alpha.f = 0.2))
text(7, 0.20, "P(x<=9)")
polygon(c(9:n, n,9), c(cumulativePx[9:n], Px[n],Px[1]), col=adjustcolor( "brown", alpha.f = 0.2))
text(18, 0.20, "P(x>9)")

### New Lines of code
abline(v=mu,lty = "dashed")
text(mu+2, 1.1, "mean")
abline(v=sd1.pos, lty= "dashed", col="gray")
abline(v=sd1.neg, lty= "dashed", col="gray")
text(sd1.pos+2, 1.2, "+1 sd")
text(sd1.neg-2, 1.2, "-1 sd")

Part I Assignment

1- Add your name and date on the above chart using text function and save it as an image on your one-drive. Save this image and submit it with your part II report (5 Marks).

Part II - Poisson Distribution

Our lab this week will examine a specific process of rock weathering in sandstone found in coastal rocky areas in coastal British Columbia. This weathering process produces characteristic ‘honeycomb’ patterns in sandstone which can easily be found in coastal sandstone around the world. These features are called by several names, such as honeycomb weathering, cavernous weathering, and tafoni(https://en.wikipedia.org/wiki/Tafoni). In coastal salt-rich environments such as where we will be exploring, we will investigate how this unique pattern of rock weathering forms and how we can characterize their patterns using probability distributions. But most importantly let’s take a look at what we’re talking about:

Figure 2.1: Honeycomb weathering on the coast of Gabriola Island, British Columbia. Image credit: Colin Robertson

These rock features are pretty cool to see and very distinctive. Here are a few more examples taken from the same site: all photos were taken at approximately the same vertical distance so the scale is approximately equal across these images.

Figure 2.2: More examples of honeycomb weathering on the coast of Gabriola Island, British Columbia. Image credit: Colin Robertson

How does honeycomb patterns in sandstone form?

There are a variety of mechanisms that produce these regular weathering patterns in sandstone, but in salt-rich coastal environments such as that here, the primary process is a result of what is called salt crystallization. In short, rockfaces exposed to sunlight have salt-rich water drawn up to the surface where the salt crystallizes as the water evaporates. The salt crystallization exerts pressure on the surround rock to create tiny factures which become the pits we recognize in the pattern above. As a result of this process, we therefore expect to see these interesting features where sandstone has direct sunlight exposure, such as on south-facing sections of coastline.

We can investigate the density of the weathering process. We will take one image as our study dataset, then count the number of pits in different sections of the image. Here we are taking one small section of coastline and enumerating the density of weathering pits for the entire small section. This is a sample - and our population is the sandstone surfaces undergoing this process in general. Think about what potential issues there may be in making inferences from the data to the population in this case.

Study Image

The image below we will use as our dataset.We have placed a regular 6 x 6 grid over the image which will serve as our sampling units. Then we will count the number of pits in each cell and generate a dataset of counts. Since this will be a dataset of counts, we will consider this a discrete random variable.

Figure 2.3: Study image dataset featuring honeycomb weathering on the coast of Gabriola Island, British Columbia. Image credit: Colin Robertson

In order to count the number of pits per cell, we have to make some decisions; namely

  1. what counts as a pit and
  2. how do we associate pits with cells since in many cases the hole crosses cell boundaries.

This is the sort of research decision we have to make carefully, and then later consider how it impacts our analysis. We will use the following criteria. A pit will be defined based on having visible walls (i.e., we can discern the edges/sides of the pit) and further, we will link it with a cell based on the centroid of the pit. Since this step requires some interpretation which may vary - I have done this for you digitizing our dataset of pit centroids, as below:

Figure 2.4: Study image with pits digitized. Image credit: Colin Robertson

Your first job is to create a dataset by counting the number of pits in each cell. I have done the first row for you to give us a working dataset as an example, but when you do the assignment you will have to get the counts for all cells - you should end up with a datset with 36 observations, each of which is a count.

Figure 2.5: First row of cells with number of pits counted. Image credit: Colin Robertson

So to get started our dataset would be created in R as follows:

pits <- c(4,7,9,4,5,0)

recall why we are treating this as a discrete random variable. Based on our criteria and the nature of the dataset, we have counts - which cannot take on fractional values. This was a decision which impacts how we treat the data. An alternative way to analyze the data would be to measure the fraction of each cell covered by a pit, which we could then treat as a continuous random variable.

Analyzing Count Data

The first thing we can do if we want to estimate the parameters of the Poisson distribution with our dataset, is to calculate the sample statistic estimate of the distributions parameters. The Poisson distribution has a probability mass function as follows: \[P(X = k) = \frac{e^{-\lambda}\lambda^k}{k!}\]

and the only important thing for you to know is that there is one unknown quantity required to define it (i.e., one parameter) - which is the mean number of events per unit of observation - which in our case is the cells which we counted pits in. The sample estimate of \(λ\) is therefore simply:

mean(pits)
## [1] 4.833333

so we know that is the average number of pits, but we are not sure what the overall distribution would look like. Again we can simulate some new data using our estimate of \(λ\) that we calculated from the data to see

simpits = rpois(n=1000, lambda = mean(pits))
hist(simpits, xlab = "Simulated Pit Count Values", main="")

which again might vary a bit for you, but should be a distribution centered on around 5 and tailing off around 12 or 14. Lets look at this and notice a few details;

  • the lowest value is zero
  • the higher values quickly trail off
  • the distribution shape is almost normal

the Poisson is useful for what we call rare events because of these properties of being bounded by zero on the lower end. If you have count data that has very high counts, the difference between the Poisson and the Normal is negligible and you can just use a Normal distribution (this is called a Normal approximation). We will have a look at normal distribution in Part II of this lab. Lets look at the Poisson functions.

The function dpois gives the probability of x number of events observed given the expected number specified by λ . Whereas the function ppois gives a cumulative probability, we have to specify the upper or lower tail (i.e., equal or less than vs. equal or greater than). Say we wanted to calculate the probability of observing 7 pits in a cell given our mean number per cell of 4.8333333 - we could of course work through the formula on a calculator or using R math functions - and it would look like this:

#compare this to the probability function noted above
exp(-mean(pits))*mean(pits)^7/factorial(7)
## [1] 0.09732103

but it is much easier to use the built in function dpois to get the probability:

dpois(x = 7, lambda = mean(pits))
## [1] 0.09732103

we get the same answer but using dpois is a lot cleaner and easier. We can also plot the probability for each of the observed counts in our dataset. We just have to supply the whole vector of counts to the same dpois function

plot(pits, dpois(x = pits, lambda = mean(pits)), pch=20, cex=2, xlab = "Pit Count", ylab = "Probability")

So now we have the ability to calculate probabilities from the Poisson distribution in R. How do we answer meaningful questions with these tools? While the process of salt crystallization is the dominant one creating the weathering pits, we may want to know how pits form. For example we could ask how many can form before adjacent pit edges dissolve into one larger pit. What do you think the maximum number of pits per unit area is? Where we see large pits form, are these the gradual growth of single pits or the amalgamation of several into one larger one? The other thing to think about is where we have large pits, the count in the cell is likely going to be lower.

Part II Assignment

  1. Calculate the probability of observing a cell with 15 pits based on the sample intensity (λ computed from all cells in the image). Write a sentence interpreting what this means. Include commands used to generate the answer. (out of 2)

  2. What is the probability of observing a cell with between 3 and 5 pits (λ computed from all cells in the image)? Write a sentence interpreting what this means. Include commands used to generate the answer. (out of 3)

  3. There are two important assumptions to using the Poisson distribution we should consider.

  • The probability that an event will occur within a given unit must be the same for all units (i.e. the underlying process governing the phenomenon must be invariant)

  • The number of events occurring per unit must be independent of the number of events occurring in other units (no interactions/dependencies).

Write short paragraph (200-300 words) explaining why or why not these assumptions are met in this analysis of the weathering pits dataset. (out of 5)

  1. Explore the assumption of independence by calculating the probability of each observed count and noting where on the image any counts with a probability less than 0.10 occur. Comment on whether these are distributed randomly over the image or clumped in specific parts of the image and what this means for the independence assumption. Include an image showing which cells have an unusual (i.e., p < 0.10) number of pits. (out of 5)

Hint: to answer 4 you can take a screenshot from the lab and use MSPaint (Windows), Photos (Mac) or another graphics program to identify which cells have unusual counts.

Hand in

Please submit your answers on MLS under Assignment 1. Your final report should be in pdf format. Also Please make sure to include clean codes and their results. The document formatting of your assignment has 5 marks.

Extra information about formatting your reports

Credit

This lab material for part II is adopted from GESC 258- Labs originally developed by Dr. Colin Robertson.