GESC 258- Geographical Research Methods
Sampling Methods
In this week’s lab we are going to learn how different sampling methods work when we have spatial data. Sampling is the process of selecting a part of a population to estimate the characteristics of the whole population. The characteristics could be the total or mean value of an attribute regarding the population.
In spatial sampling, we collect samples in a two-dimensional framework. The sampling scheme is designed to maximize the probability of including the spatial variation of the study population.
There are different methods to collect a sample, in this lab we will focus and compare results between three sampling methods:
Simple random sampling - randomly selecting sample points in a study region, where each location has an equal opportunity to be sampled.
Systematic sampling - Systematic sampling is a type of probability sampling method in which sample members from a larger population are selected according to a random starting point but with a fixed, periodic interval.
Stratified sampling - the study region is divided into groups a.k.a. strata by a collective characteristic of the study region (elevation, land use, etc.). Then for each strata, a spatial random sample is collected and combined to one sample.
We will be using Above ground biomass data this week.
Above ground biomass (AGB)
is defined as “the aboveground
standing dry mass of live or dead matter from tree or shrub (woody) life
forms, expressed as a mass per unit area”. We will apply different
sampling methods to
estimate above ground biomass in vancouver island (study area)
and compare the computed sample mean with the actual mean of the
population. At the end of this lab we will be able to see how sample
size and sampling design can impact the results.
Important! to complete the analysis it is essential to have the population mean and standard deviation. We will discuss that in the
Population
section
Installing necessary packages
When working with R projects we usually need to use some extra
packages. Packages in R
are a set of tools that add extra
functions to R. To use packages we first need to install them. We can
install a package using install.packages
function. So to
get started open your R studio, and run the following codes:
install.packages('raster')
install.packages("rgdal")
install.packages("sf")
install.packages("sp")
install.packages("leaflet")
NOTE: Installing packages is a one-time process. So once you have installed the necessary packages you don’t need to install them again.
Once you install the necessary packages, you need to import them into
your current R environment to be able to use them. To import them we use
library
function. Use following lines of the code to import
installed packages into your environment.
library(raster)
library(leaflet)
library(sp)
library(sf)
NOTE: Everytime you want to use R you need to import necessary package into the environment using
library
function.
Population
So in statistics, we usually don’t have access to the population. This is why we use sampling methods. However, in this lab we have a map layer that includes AGB data. We assume this map layer is our population. So if we calculate the mean value of this layer’s data we will have our population’s mean or \(\mu\). So if we select a set of data from this AGB map layer we can assume it as a sample collected from our population. You might ask, how we get a hold of this data. These data are usually generated from satellite imagery data or by other modeling approached. This particular dataset is downloaded from (https://open.canada.ca/data/en/dataset/ec9e2659-1c29-4ddb-87a2-6aced147a990) and you can read more about it in the provided link. These data are from 2001 to 2011 and are available from entire Canada. I have already clipped these data for the study area for you.
We will use this data to measure population’s mean \(\mu\) and population’s standard deviations \(\sigma\). We also select our samples from this layer and compare our calculated mean from sample with our real \(\mu\) value. So lets have a look at the population data:
NOTE: These data are in Raster format. You have already learned about
Vector data
in previous courses (e.g GG251). Here we learn about a new data model which is very common in GIS. Raster data are simple images that have geographic coordinates. so each pixel/cell in that image can store a value (our multiple values) and represents a geographical area. Rasters are one of the main data models in GIS. You can learn more about rasters in the following links, please watch the video and have a look at the website to learn about them before continuing to the next section https://www.youtube.com/watch?v=VeI4n7duavI https://vector.geospatial.science/textbook/section-three-raster-data
We will use a R package called
leaflet
to generate interactive maps in R. Learning leaflet is not part of this lab and you just need to copy and paste the codes to be able to visualize maps. However, it gives you some idea how interactive maps work in R. You can googleR leaflet
to learn more about interactive maps in R.
Now that you have some idea about raster data type, run the following codes in your R script. Read the comments to learn about each line of the code.
<- raster("https://www.dropbox.com/s/zvzqb4a0qvhulye/bc_NFI_MODIS250m_2011_kNN_Structure_Biomass_TotalLiveAboveGround_v1_victoria_4326.tif?dl=1") # we read our raster as a new R object
biomass
## Lets show it on the map
<- colorNumeric(c("#0C2C84", "#41B6C4", "#FFFFCC"), values(biomass),na.color = "transparent") # in this like we tell R how to color each pixel on the map.
biomass_pallet # In the next section we generate an interactive map using leaflet() function. then we use addRasterImage to add our raster image to the map. We use addLegend to add a legend to the map.
<- leaflet() %>% addTiles() %>%
m addRasterImage(biomass, colors = biomass_pallet, opacity = 0.8) %>%
addLegend(pal = biomass_pallet, values = values(biomass),title = "Biomass Total\n Live Above Ground")
m
In the above code, I use raster
function to read data
from a url
and save it as an R object called
biomass
. Notice .tif
section at the end of the
URL. .tif
is a common file extension for the raster files.
As we said earlier, in the raster images each pixel has a value, in our
case the value of these each pixel is AGB data and its unit is tone per
hectare. So if a pixel in this map layer has a value of 100, it means
the area that is covered by that pixel has 100 tone per each hectare.
Zoom in the interactive map to see the actual pixel data.
Zoom in to the map to see each individual pixel and its spatial coverage.
You can access AGB values using the following functions:
<- values(biomass)[!is.na(values(biomass))] # we remove NA data. Recall what are NA data from the first lab
p_data head(p_data) # Surprise, it is just a set of numbers, guess what, We have a distribution here
x |
---|
12.3500 |
207.4633 |
205.9867 |
207.4633 |
209.9700 |
209.9700 |
In the above lines, I have removed missing data from the population
using !is.na
function. Now I am able to plot a histogram of
my data.
hist(p_data,xlab="Above Ground Biomass t/ha",main="Histogram of AGB")
We can calculate mean and stdev of our population’s data:
<- mean(p_data)
p_mean <- sqrt(sum((p_data - p_mean)^2) / (length(p_data)))
p_sd p_mean
## [1] 229.6976
p_sd
## [1] 120.597
Now that we have our population mean \(\mu\) and stdev \(\sigma\) we can start sampling from population.
Simple Random Sampling
Simple random sampling is one of the easiest methods of sampling, We just need to determine the sample size and select population members randomly. In this example unit of analysis is each pixel in the raster file. We define a sample size of 10. If you want to chose other sample sizes you need to change this value.
<- 10
sample_size <- sampleRandom(biomass, size=sample_size, cells=TRUE, sp=TRUE)
sample_1
::kable(head(sample_1@data), caption = 'Selected Cells') knitr
cell | bc_NFI_MODIS250m_2011_kNN_Structure_Biomass_TotalLiveAboveGround_v1_victoria_4326.tif.dl.1 |
---|---|
773918 | 173.9333 |
1081830 | 244.5233 |
514502 | 328.2700 |
429283 | 339.6467 |
980421 | 0.0000 |
422431 | 137.8400 |
The output of the sampleRandom
function has several
members. If you run sample_1@data
it will give you the
selected pixel number and its associated value. We later use this to
calculate mean and sd of our sample. If you call
sample_1@coords
it gives you x
and
y
coordinate of the selected pixels (cells). We use these
data to generate a map. So
<- m %>% leaflet::addCircles(sample_1@coords[,1],sample_1@coords[,2],popup = paste("Pixel Value:",sample_1$bc_NFI_MODIS250m_2011_kNN_Structure_Biomass_TotalLiveAboveGround_v1_victoria_4326.tif.dl.1),color="red")
map_sample_1 map_sample_1
You can click on each red point and see the value of the pixel under that point. Now that we have our sample, we are able to calculate sample mean \(\bar{x}\) and \(s\).
Recall that sample standard deviation is computed using: \(\sqrt{\frac{\sum\limits_{i=1}^{n} (x_i - \bar{x})^2}{n-1}}\)
<- mean(sample_1$bc_NFI_MODIS250m_2011_kNN_Structure_Biomass_TotalLiveAboveGround_v1_victoria_4326.tif.dl.1)
sample_1_mean <- sqrt(sum((sample_1$bc_NFI_MODIS250m_2011_kNN_Structure_Biomass_TotalLiveAboveGround_v1_victoria_4326.tif.dl.1 - sample_1_mean)^2) / (sample_size - 1))
sample_1_sd
sample_1_mean
## [1] 174.808
NOTE: You might get different values as sample mean, which is fine, since we are selecting a random sample from the population and every sample has different mean
Now that we have our \(\bar{x}\) and \(s\) we can find a confidence interval for 95% of confidence level. Recall our confidence level equation from previous labs:
point estimate +/- critical value * standard error
Follow these steps to find confidence intervals
Set the probability. We will use a 95% confidence level.
Now we need to find a z-score associated with 95% probability. We do this by taking the inverse so \(1-0.95\). We could then use
qnorm
to find the z-score for 0.05. But this would be wrong becauseqnorm
gives us one-tailed probabilities and we really need two-tailed (i.e., 0.05 split into the upper and lower tails), so we would useqnorm(0.05/2)
which would give us a lower z-score associated with .025 probability. This is what we use as the critical value of the confidence interval.Now to find the interval in
R
we just need the standard error - which we already know is \(\frac{\sigma}{\sqrt{n}}\)
= qnorm(0.05/2, lower.tail = FALSE)
z_crit = sample_1_mean - (z_crit * (p_sd/sqrt(sample_size)))
sample_1_lower = sample_1_mean + (z_crit * (p_sd/sqrt(sample_size)))
sample_1_upper
sample_1_lower
## [1] 100.0626
sample_1_upper
## [1] 249.5534
Now that we have the confidence interval. lets plot them.
plot.new() # create a new plot
plot.window(xlim=c(0, 6), ylim=c(min(p_data), max(p_data))) # set our plot's ylim and xlim
box() # draw a box around our plot
axis(side=1, at=seq(2,4, by=1), labels=c("Simple","Systematic","Stratified")) # define bottom axis values
axis(side=2, at=seq(min(p_data),max(p_data), by=50), labels=seq(min(p_data),max(p_data), by=50)) # define left axis value
mtext("Biomass", side = 2, las=3, line=3 ) # you can change values of las and line to see their effect on your axis's title
mtext("Sampling method", side = 1, las=1, line=3) #set title of x bar
abline(h=p_mean,col="orange",lwd=2, lty="dashed") # draw a horizontal line on y=p_mean
text(x = 5.5, y = p_mean, "Exact Population's \n mean") # add a text on the horizontal line
points(x = 2, y = sample_1_mean, pch=20, cex=1) #add a point for our sample mean
arrows(2, sample_1_lower, 2, sample_1_upper, length=0.05, angle=90, code=3) # hack: we draw arrows but with very special "arrowheads" to draw an error bars
Systematic Sampling
The second sampling method we will use is called systematic sampling. To do so we first need to create a regular grid on top of the study area and then select center point of each grid cell. Once we have all of the center points, we call it our sampling frame. We then can select from sampling frame based on a systematic selection method (like every other point and so on). so lets get started.
To generate our regular grid we use a function called
st_make_grid
. The first input of this function is the
raster file that you want to draw a grid on top of it. The second
parameter is the type of output grid. It can be polygons
which generates the boundary of the grid cells as the output. You can
also set it to centers
and it will return center point of
each grid cell (we will use this functionality later). the third
parameter of this function is number of grid rows and columns. In this
case we are going to have a 20 by 20
grid.
<- st_make_grid(biomass,what = "polygons",n = c(20, 20)) #making a 20 by 20 grid on top of our study area.
grid
<- m %>% addPolygons(data = grid)
map_grid map_grid
Now lets extract the center points of the grid cells.
<- st_make_grid(biomass,what = "centers",n = c(20, 20)) # we use the same function as before, but this time we set the second parameter to centers to give us the center points
grid
<- m %>% addCircles(data = grid)
map_grid map_grid
As you see our sampling grid has some issue. For example many of our
grid cells are outside of the study area. So we need to clean those
cells. Also since this time we did not use sampleRandom
function, we need to assign raster value under each point to its
representative point.
## In the next following lines we extract raster values under each point and remove the NA values
<- st_as_sf(grid)
grid_sf <- extract(biomass,grid_sf) #Extract raster pixel values under each point
grid_points_with_values <- st_sf(value=grid_points_with_values,grid)
grid_sf <- grid_sf[!is.na(grid_sf$value),] #remove the NA values
grid_sf
# show them on the map
<- m %>% addCircles(data = grid_sf,popup = paste("Biomas Value:", grid_sf$value),color="yellow")
map_sample_2 map_sample_2
Now that we have our regular grid, without any error grid data, we
will need to select 10 items from our points. But we cannot do it
randomly. it should follow a systematic form. To do so we need to have a
systematic sample interval
.We can divide total number of
items in our sampling frame by the sample size and find our systematic
sample interval.
<- round(length(grid_sf$value)/sample_size) #total grid points divided by sample size
sample_2_interval
# select sample items based on the sample interval we just calculated
<- grid_sf$grid[1:(length(grid_sf$grid)/sample_2_interval)*sample_2_interval]
grid_sf_grid <- grid_sf$value[1:(length(grid_sf$value)/sample_2_interval)*sample_2_interval]
grid_sf_value
#show the selected samples on the map
%>% addCircles(data=grid_sf_grid,color="red") map_sample_2
now we have our sample. we will continue calculating our CI, sample mean and stdev as usual.
<- mean(grid_sf_value)
sample_2_mean <- sqrt(sum((grid_sf_value - sample_2_mean)^2) / (sample_size - 1))
sample_2_sd sample_2_mean
## [1] 242.2126
And for the CI:
= qnorm(0.05/2, lower.tail = FALSE)
z_crit = sample_2_mean - (z_crit * (p_sd/sqrt(sample_size)))
sample_2_lower = sample_2_mean + (z_crit * (p_sd/sqrt(sample_size)))
sample_2_upper sample_2_lower
## [1] 167.4672
sample_2_upper
## [1] 316.958
And we need to update our previous plot
plot.new()
plot.window(xlim=c(0, 6), ylim=c(min(p_data), max(p_data)))
box()
axis(side=1, at=seq(2,4, by=1), labels=c("Simple","Systematic","Stratified"))
axis(side=2, at=seq(min(p_data),max(p_data), by=50), labels=seq(min(p_data),max(p_data), by=50))
mtext("Biomass", side = 2, las=3, line=3 )
mtext("Sampling method", side = 1, las=1, line=3)
abline(h=p_mean,col="orange",lwd=2, lty="dashed")
text(x = 5.5, y = p_mean, "Exact Population's \n mean")
points(x = 2, y = sample_1_mean, pch=20, cex=1)
arrows(2, sample_1_lower, 2, sample_1_upper, length=0.05, angle=90, code=3)
## new part of the code
points(x = 3, y = sample_2_mean, pch=20, cex=1 , col="red")
arrows(3, sample_2_lower, 3, sample_2_upper, length=0.05, angle=90, code=3, col="red")
Stratified Random Sampling
In the stratified random sampling, sample frame split into mutually exclusive homogeneous sub-groups (i.e., strata) and then a simple random sampling will be done within these groups. To get started we first need to define our strata. In this case we will use elevation to define our strata. The elevation data are usually stored as data models called Digital Elevation Models (DEM)`. DEM files are simple raster data that value of each pixel/cell is the elevation above sea level. Lets take a look at a DEM file.
<- raster("https://www.dropbox.com/s/nrxzy6bemzg4tks/dem_lowresolution.tif?dl=1")
dem values(dem)[values(dem) <= 0] = NA # we replace the values less than and equal to zero meters with NA.
## Plot it on the map
<- colorNumeric(c("#008435", "#33cc00", "#f4f071","#f4bd45","#99642b","#ffffff"), values(dem),na.color = "transparent")
dem_pallet
<- leaflet() %>% addTiles() %>%
map_dem addRasterImage(dem, colors = dem_pallet, opacity = 0.8) %>%
addLegend(pal = dem_pallet, values = values(dem),title = "DEM (elevation m)")
## Warning in colors(.): Some values were outside the color scale and will be
## treated as NA
map_dem
Now that we have our DEM file. We will define our strata as follows:
- Strata 1: \(elevation <=900\)
- Strata 2: \(900 < elevation\)
To do so we will do following:
<- dem # duplicate our dem file as strata_1
strata_1 values(strata_1)[values(strata_1) <= 900] = 1 # replace values <= 900 with 1
values(strata_1)[values(strata_1) > 900] = NA # replace other values with NA
<- dem # duplicate our dem file as strata_2
strata_2 values(strata_2)[values(strata_2) <= 900] = NA #replace other values with NA
values(strata_2)[values(strata_2) > 900] = 2 # replace values > 900 with 2
## lets show them on the map
<- colorBin(c("#008435", "#99642b"),bins = 2, c(1,2),na.color = "transparent")
strata_bins leaflet() %>% addTiles() %>%
addRasterImage(strata_1, colors = strata_bins, opacity = 0.8)%>%
addRasterImage(strata_2, colors = strata_bins, opacity = 0.8)%>%
addLegend(pal = strata_bins, values = c(1,2),title = "Strata 1 & 2")
## Warning in colors(.): Some values were outside the color scale and will be
## treated as NA
## Warning in colors(.): Some values were outside the color scale and will be
## treated as NA
Now that we have two sub-groups we just need to perform a simple random sample on each subgroup.
# we clean some of the missing data
values(strata_1)[is.na(values(biomass))] = NA
values(strata_2)[is.na(values(biomass))] = NA
# perform sampleRandom Function on each strata
<- sampleRandom(strata_1, size=round(sample_size/2), cells=TRUE, sp=TRUE)
sample_3_st1 <- sampleRandom(strata_2, size=round(sample_size/2), cells=TRUE, sp=TRUE)
sample_3_st2
# extract values of each strata
<- values(biomass)[sample_3_st1$cell]
values_st1 <- values(biomass)[sample_3_st2$cell]
values_st2
#show them on the map
<- m %>% leaflet::addCircles(sample_3_st1@coords[,1],sample_3_st1@coords[,2],popup = paste("Pixel Value (Strata 1):",values_st1),color = "red") %>% leaflet::addCircles(sample_3_st2@coords[,1],sample_3_st2@coords[,2],popup = paste("Pixel Value (Strata 2):",values_st2),color = "orange")
map_sample_3 map_sample_3
Now we just need to calculate CI and sample mean and sample stdev:
# combine selected values from each strata into one array of numbers
<- c(values_st2,values_st1)
sample_3_values # calculate mean and stdev
<- mean(sample_3_values)
sample_3_mean <- sqrt(sum((sample_3_values - sample_3_mean)^2) / (sample_size - 1))
sample_3_sd sample_3_mean
## [1] 228.9363
= qnorm(0.05/2, lower.tail = FALSE)
z_crit = sample_3_mean - (z_crit * (p_sd/sqrt(sample_size)))
sample_3_lower = sample_3_mean + (z_crit * (p_sd/sqrt(sample_size)))
sample_3_upper sample_3_lower
## [1] 154.1909
sample_3_upper
## [1] 303.6817
And now we plot them
plot.new()
plot.window(xlim=c(0, 6), ylim=c(min(p_data), max(p_data)))
box()
axis(side=1, at=seq(2,4, by=1), labels=c("Simple","Systematic","Stratified"))
axis(side=2, at=seq(min(p_data),max(p_data), by=50), labels=seq(min(p_data),max(p_data), by=50))
mtext("Biomass", side = 2, las=3, line=3 )
mtext("Sampling method", side = 1, las=1, line=3)
abline(h=p_mean,col="orange",lwd=2, lty="dashed")
text(x = 5.5, y = p_mean, "Exact Population's \n mean")
points(x = 2, y = sample_1_mean, pch=20, cex=1)
arrows(2, sample_1_lower, 2, sample_1_upper, length=0.05, angle=90, code=3)
points(x = 3, y = sample_2_mean, pch=20, cex=1 , col="red")
arrows(3, sample_2_lower, 3, sample_2_upper, length=0.05, angle=90, code=3, col="red")
## new part of the code
points(x = 4, y = sample_3_mean, pch=20, cex=1 , col="green")
arrows(4, sample_3_lower, 4, sample_3_upper, length=0.05, angle=90, code=3, col="green")
Now you can see the impact of each sampling method on accuracy of our results. Try to explain why each sampling method is resulting different sample mean and what sampling method is more accurate. Keep in mind that our sample size in this example was 10 for each method.
Assignment
Write down the sample mean and confidence interval you calculated for a sample size of 10 per each sampling method. Repeat the above steps with two more sample sizes (e.g 15, 30) and calculate sample mean and confidence intervals for 95% confidence level. Write the results in a table and discuss them (out of 6).
Calculate
sampling error
\((Sample Mean) - (Populalation Mean)\) for each pair of sampling method (simple random, systematic and stratified) and sample size (10,15,30) (total of 9 pairs) and discuss what sampling method do you think works better in this case (out of 5).Using bellow code snippet, generate three plots (one per each sampling method). Use
sample size
forx
axis andsample mean
ony
axis. Addexact population mean
on the plot too. Explain each plot and discuss the relation between sample size and accuracy of the results (out of 10).
## Replace the following numbers with your own numbers for each sampling method
#data for sampling method 1 with sample size =10
<- 191.09
sample_1_mean <- 116.35
sample_1_lower <- 265.84
sample_1_upper #data for sampling method 1 with sample size =15
<- 148.45
sample_2_mean <- 87.42
sample_2_lower <- 209.48
sample_2_upper #data for sampling method 1 with sample size =30
<- 223.17
sample_3_mean <- 180.02
sample_3_lower <- 266.33
sample_3_upper
### Only change labels
#############################
<- c(10,15,30)
x <- c(sample_1_mean,sample_2_mean,sample_3_mean)
y <- c(sample_1_upper,sample_2_upper,sample_3_upper)
y1 <- c(sample_1_lower,sample_2_lower,sample_3_lower)
y2
<- loess(y~x)
lo
plot.new()
plot.window(xlim=c(5, 40), ylim=c(min(p_data), max(p_data))) # set our plot's ylim and xlim
box()
axis(side=1, at=seq(5,40, by=1), labels=seq(5,40, by=1))
axis(side=2, at=seq(min(p_data),max(p_data), by=50), labels=seq(min(p_data),max(p_data), by=50))
mtext("Biomass", side = 2, las=3, line=3 ) # Y axis label
mtext("Sampling Size", side = 1, las=1, line=3) # X axis label
title("Sampling Size effec for Random Sample method") # Plot Title
# CI polygon
polygon(c(x, rev(x)), c(y1, rev(y2)),col = adjustcolor(col = "black", alpha = 0.25), border = NA)
lines(x,predict(lo), col='red', lwd=2)
# exact population mean
abline(h=p_mean,col="orange",lwd=2, lty="dashed")
text(x = 37, y = p_mean, "Exact Population's \n mean")
points(x = 10, y = sample_1_mean, pch=20, cex=1)
arrows(10, sample_1_lower, 10, sample_1_upper, length=0.05, angle=90, code=3)
points(x = 15, y = sample_2_mean, pch=20, cex=1 , col="red")
arrows(15, sample_2_lower, 15, sample_2_upper, length=0.05, angle=90, code=3, col="red")
points(x = 30, y = sample_3_mean, pch=20, cex=1 , col="green")
arrows(30, sample_3_lower, 30, sample_3_upper, length=0.05, angle=90, code=3, col="green")
- What other factors do you think impact the accuracy of our results apart the sampling method and sample size. Think about how the input data are collected, or the resolution of raster files and their impacts on the final results (out of 4).
Hand in
Please submit your answers on MLS under Assignment 3 folder. Your
final report should be in pdf
format. Also Please make sure
to include clean codes and their results when it is
applicable. The document formatting of your assignment has
5 marks.
Citations
Tracking forest attributes across Canada between 2001 and 2011 using a k nearest neighbors mapping approach applied to MODIS imagery. 2018. Beaudoin, A.; Bernier, P.Y.; Villemaire, P.; Guindon, L.; Jing Guo, X. Can. J. For. Res. 48: 85-93. https://open.canada.ca/data/en/dataset/698dc612-5059-43ee-84f3-49756e6d5ad6