This project is part of my internship at the Univeristy of Eastern Finland, aiming at using hemispherical canopy photographs taken from 20 plots in Joensuu, Finland in June 2023, in order to estimate Leaf area index (LAI) in the region. The acquisition of satellite images is used for LAI modelling and mapping. This notebook mainly features the data analysis process, while other activities such as field measurements, image processing and LAI computation at plots will be briefly introduced.

Leaf Area Index

Leaf Area Index (LAI) is defined as one half the total green leaf are per unit horizontal ground surface area (GCOS | WMO, 2011). LAI is an important parameter in many land surface vegetation and climate models, which determines light, thermal and moisture conditions within the canopy, and then influences carbon, water cycles and energy balances (FASSNACHT et al., 1994). Because leaf surfaces are the primary sites of energy and mass exchange of a tree, critical processes including canopy interception, evapotranspiration, and gross photosynthesis are directly proportional to the LAI.

Objectives

Data collection

Hemispherical canopy photography was used to obtain photographs through a hemispherical (fisheye) lens oriented toward the zenith from beneath the canopy. 20 plots were established for field measurements, each plot with a radius of 12.5m, 12 photographs were collected in each plot. Totally, 240 digital hemispherical photographs were obtained from 20 plots

Satellite data from Copernicus Sentinel-2 was used to download multi-band layers for LAI prediction and visualization. There are 13 Sentinel-2 bands in total, but in this exercise we included 10 bands to stack for a multispectral image. Sentinel2 images

Data pre-processing

Hemispherical images were pre-processed on HSP software for binarized images, which classified pixels into either sky or canopy. Thus, LAI of each plot was computed on MATLAB using binarized images. Canopy gap fraction which measures fractional light penetration were analysed by using Beer-Lambert law to estimate LAI.

Workflow

There are quite several steps to prepare the dataset for model construction. The figure below briefly shows the workflow of LAI estimation and prediction.

Fig.1 Workflow diagram for LAI estimation

Define variables

As stated, Sentinel-2 images contain 13 bands.We exclude Band 1 (Ultra Blue) since it’s more useful for water assessment and coastal monitoring. We also select bands with resolution of 10m and 20m for more spatial details and accuracy. Thus, Band 9 and 10 are not included.

Table 1: Sentinel2 band descriptions

Table 2: Variable names for the analysis

Data analysis

Import necessary packages in R

See this part

First, we should recall necessary packages for this analysis

## Loading required package: raster
## Warning: package 'raster' was built under R version 4.2.3
## Loading required package: sp
## Warning: package 'sp' was built under R version 4.2.3
## Loading required package: rgeos
## Warning: package 'rgeos' was built under R version 4.2.3
## rgeos version: 0.6-2, (SVN revision 693)
##  GEOS runtime version: 3.9.3-CAPI-1.14.3 
##  Please note that rgeos will be retired during 2023,
## plan transition to sf functions using GEOS at your earliest convenience.
##  GEOS using OverlayNG
##  Linking to sp version: 1.6-0 
##  Polygon checking: TRUE
## Loading required package: rgdal
## Warning: package 'rgdal' was built under R version 4.2.3
## Please note that rgdal will be retired during 2023,
## plan transition to sf/stars/terra functions using GDAL and PROJ
## at your earliest convenience.
## See https://r-spatial.org/r/2022/04/12/evolution.html and https://github.com/r-spatial/evolution
## rgdal: version: 1.6-5, (SVN revision 1199)
## Geospatial Data Abstraction Library extensions to R successfully loaded
## Loaded GDAL runtime: GDAL 3.5.2, released 2022/09/02
## Path to GDAL shared files: C:/Users/ACER/AppData/Local/R/win-library/4.2/rgdal/gdal
## GDAL binary built with GEOS: TRUE 
## Loaded PROJ runtime: Rel. 8.2.1, January 1st, 2022, [PJ_VERSION: 821]
## Path to PROJ shared files: C:/Users/ACER/AppData/Local/R/win-library/4.2/rgdal/proj
## PROJ CDN enabled: FALSE
## Linking to sp version:1.6-0
## To mute warnings of possible GDAL/OSR exportToProj4() degradation,
## use options("rgdal_show_exportToProj4_warnings"="none") before loading sp or rgdal.
## Loading required package: sf
## Warning: package 'sf' was built under R version 4.2.3
## Linking to GEOS 3.9.3, GDAL 3.5.2, PROJ 8.2.1; sf_use_s2() is TRUE
## Loading required package: car
## Warning: package 'car' was built under R version 4.2.3
## Loading required package: carData

Data exploration

See this part

One of very important steps to begin before any further analysis is checking our dataset. In this exercise, the dataset is relatively small (20 plots), we can easily check the data by summary table or simple visualization (histogram or boxplot).

It is also important to check the data features, for example: class of dataframes, class of variables, in case variables are supposed to be ‘numeric’, but are being defined as ‘character’ in the dataframe.

d<- read.csv2 ('Tasks/LAI raw data.csv', header = T, dec = '.')
head(d)
##   plot        x       y  xystdev domspec sub site ba_pine dbh_pine h_pine
## 1   50 638574.6 6954055 1.772277       1   1    2      15     24.5   20.9
## 2   51 638528.9 6953853 1.600662       1   1    3      21     23.9   17.2
## 3   52 637431.4 6952722 0.987803       2   1    3       0      0.0    0.0
## 4   53 649467.5 6944577 0.049953       1   3    4       1      5.6    4.4
## 5   54 649220.8 6945558 0.681737       1   3    3      22     18.3   17.9
## 6   55 648702.9 6944785 0.096317       3   3    3       2     38.7   23.9
##   cbh_pine ba_spruce dbh_spruce h_spruce cbh_spruce ba_birch dbh_birch h_birch
## 1     12.8         0          0      0.0        0.0        6      19.3    28.3
## 2      8.4         0          0      0.0        0.0        0       0.0     0.0
## 3      0.0        16         21     18.5        6.5        3      20.8    24.2
## 4      1.5         0          0      0.0        0.0        0       0.0     0.0
## 5     10.8         0          0      0.0        0.0        0       0.0     0.0
## 6     12.2         0          0      0.0        0.0        3      10.5    10.4
##   cbh_birch ba_other dbh_other h_other cbh_other      clos       cov      LAI
## 1      13.3        1      22.5      19       5.3 0.7108465 0.8934462 2.865535
## 2       0.0        0       0.0       0       0.0 0.4759077 0.6376521 1.636080
## 3      12.0        0       0.0       0       0.0 0.4822671 0.5715514 2.433383
## 4       0.0        0       0.0       0       0.0 0.0051500 0.0073300 0.030945
## 5       0.0        0       0.0       0       0.0 0.4755771 0.6413992 2.300299
## 6       3.3        0       0.0       0       0.0 0.1675724 0.2328902 0.872287
##   Closure75    Gaps1    Gaps2    Gaps3    Gaps4    Gaps5 WithinGaps1
## 1  0.886935 0.292846 0.237462 0.163374 0.084674 0.023227    0.167457
## 2  0.723103 0.559199 0.461225 0.391230 0.252599 0.098147    0.188357
## 3  0.842276 0.545723 0.319892 0.209371 0.110439 0.037361    0.093453
## 4  0.026871 0.999998 0.993989 0.988135 0.976962 0.946376    0.000000
## 5  0.811202 0.520846 0.396425 0.279667 0.132300 0.037525    0.189871
## 6  0.500739 0.842840 0.723333 0.613804 0.500977 0.271285    0.069326
##   WithinGaps2 WithinGaps3 WithinGaps4 WithinGaps5 BetweenGaps1 BetweenGaps2
## 1    0.163589    0.139053    0.079504    0.022645     0.125389     0.073874
## 2    0.222825    0.232592    0.188948    0.088153     0.370842     0.238400
## 3    0.121302    0.128163    0.090008    0.034383     0.452270     0.198590
## 4    0.002683    0.005402    0.013133    0.034097     0.999998     0.991306
## 5    0.213418    0.198373    0.115071    0.036625     0.330974     0.183006
## 6    0.109509    0.172868    0.208078    0.172968     0.773513     0.613824
##   BetweenGaps3 BetweenGaps4 BetweenGaps5
## 1     0.024320     0.005170     0.000582
## 2     0.158638     0.063651     0.009994
## 3     0.081208     0.020431     0.002978
## 4     0.982733     0.963829     0.912279
## 5     0.081294     0.017229     0.000901
## 6     0.440936     0.292899     0.098317
class(d)
## [1] "data.frame"
class(d$y)
## [1] "numeric"

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.03095 1.08753 2.29870 2.01953 2.81003 4.44793

The histogram and boxplot can tell us the pattern of our dataset, this is very useful if we have considerably large dataset. At this case, we barely see the pattern of our data with only 20 plots.

Coordinate system

See this part

Now notice that x and y which are the coordinates of each plot are defined as numeric, we need to set its coordinates and transform them into points. Here, we use ESPG:3067, which is commonly used in Finland to set each plot’s coordinates

coordinates(d) <- ~x+y
proj4string(d) <- CRS("+proj=utm +zone=35 +ellps=GRS80 +units=m +no_defs") 

Next is importing raster layers from Sentinel2 and Satellite land class

s2 <- brick('Tasks/Joensuu sentinel2.tif')

The coordinate system of Sentinel 2 will be different to the filed plot, we can check the coordinate system and convert it into the same system to the field plots.

crs(s2) <- 3067

As the Sentinel-2 images are covering a greatly larger area, while the field plots were taken mainly in Joensuu center. We will clip the map by extent narrow the studied area, and it also reduces the bias when we run the prediction on a excessive area without former field measurements.

crop <- extent(630217.4171,652851.8192,6941861.0863,6958448.1290)
s2_crop <- crop(s2, crop)

#Map the extracted map
plotRGB(s2_crop, r = 1, g = 2, b = 3, stretch = 'lin')
plot(d, col = 'red', pch = 16, cex = 0.75, add = T, width = 6, height = 4)

Value extraction and weight normalization

See this part

The extraction of pixel values requires a radius which is set for each plot. During field measurements, each plot is set up with 12.5m in radius. The pixel values for the plots are obtained as a weighted average of pixel values and compiled into a dataframe.

Here we expect that each buffer should be created independently for each geometry of field plots, thus, the argument ‘byid = TRUE’ defines that each geometry will be buffered separately according to its corresponding radius.

d$rad <- 12.5
buffer <- gBuffer(d, width = d$rad, byid = T)
## Warning: GEOS support is provided by the sf and terra packages among others

Note that the weights should be normalized so that they sum to 1 for each plot. Normalizing weights is often necessary when performing weighted extraction to ensure that the weighted average or sum makes sense.

image_df <- extract(s2_crop, buffer, df = T, weights = T, normalizeWeights = T)

names(image_df) <- c('ID', 'blue', 'green', 'red', 'vnir1', 'vnir2', 'vnir3', 'vnir4', 'swir1', 'swir2', 'vnir5', 'weight')
head(image_df)
##   ID blue green  red vnir1 vnir2 vnir3 vnir4 swir1 swir2 vnir5     weight
## 1  1 1251  1385 1182  1612  3316  3876  3944  2209  1520  4010 0.05572755
## 2  1 1252  1388 1202  1612  3316  3876  4192  2209  1520  4010 0.14344685
## 3  1 1240  1386 1192  1644  3380  3986  4022  2259  1539  4138 0.04385965
## 4  1 1271  1388 1215  1621  3158  3591  3654  2190  1533  3835 0.15479876
## 5  1 1268  1397 1213  1621  3158  3591  4028  2190  1533  3835 0.20639835
## 6  1 1247  1375 1202  1654  3303  3860  4060  2255  1547  4009 0.13725490
# Function for computing the weighted mean by band
wmean <- function(band){ 
  wsum <- image_df[,band] * image_df$weight
  tapply(wsum, image_df$ID, sum)
}
sentinel2 <- as.data.frame(cbind(d$ID, wmean("blue"), wmean("green"), wmean("red"), wmean("vnir1"), wmean("vnir2"),
                                 wmean("vnir3"), wmean("vnir4"), wmean("swir1"), wmean("swir2"), wmean("vnir5")))

names(sentinel2) <- c('blue', 'green', 'red', 'vnir1', 'vnir2', 'vnir3', 'vnir4', 'swir1', 'swir2', 'vnir5')
head(sentinel2)
##       blue    green      red    vnir1    vnir2    vnir3    vnir4    swir1
## 1 1262.906 1395.184 1208.993 1626.363 3226.216 3715.197 3940.390 2208.918
## 2 1293.035 1431.031 1286.758 1674.927 2720.336 3034.002 3217.723 2248.780
## 3 1262.450 1370.941 1229.412 1580.962 2512.410 2828.136 3026.689 1937.073
## 4 1568.257 1742.635 1846.696 2403.428 3155.025 3493.305 3806.445 3497.650
## 5 1296.608 1406.883 1269.517 1663.406 2701.899 2985.959 3030.128 2245.358
## 6 1323.314 1501.052 1314.265 1842.373 3254.668 3594.694 3786.193 2595.488
##      swir2    vnir5
## 1 1533.282 3915.557
## 2 1603.665 3243.149
## 3 1470.175 2986.213
## 4 2442.027 3877.257
## 5 1604.979 3176.320
## 6 1736.158 3808.276
#Compile data into a dataframe
merge <- cbind(as.data.frame(d), sentinel2)

Linear regression model

See this part

The pixel values for each plot have been extracted, now it’s the time to apply linear regression model.Let’s see how the model behaves with 10 different Sentinel-2 bands

#Apply linear regression model
summary(model <- lm (LAI ~ blue + green + red + vnir1 + vnir2 + vnir3 + vnir4 + swir1 + swir2 + vnir5, data =merge))
## 
## Call:
## lm(formula = LAI ~ blue + green + red + vnir1 + vnir2 + vnir3 + 
##     vnir4 + swir1 + swir2 + vnir5, data = merge)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.51644 -0.19382  0.02655  0.14342  0.49447 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)  
## (Intercept) -24.122912   8.941428  -2.698   0.0245 *
## blue          0.037458   0.014836   2.525   0.0325 *
## green        -0.004122   0.010430  -0.395   0.7019  
## red          -0.009547   0.012761  -0.748   0.4735  
## vnir1        -0.006479   0.007655  -0.846   0.4193  
## vnir2         0.004425   0.004536   0.976   0.3548  
## vnir3        -0.006386   0.005583  -1.144   0.2822  
## vnir4         0.000141   0.001989   0.071   0.9450  
## swir1        -0.010306   0.003299  -3.124   0.0122 *
## swir2         0.011801   0.005949   1.984   0.0786 .
## vnir5         0.005348   0.005277   1.013   0.3373  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4141 on 9 degrees of freedom
## Multiple R-squared:  0.9458, Adjusted R-squared:  0.8855 
## F-statistic:  15.7 on 10 and 9 DF,  p-value: 0.0001595

p-value of the model is significant, however, when looking at individual predictor, some variables are non-significant to the model. It means that some variables need to be removed. Let’s check out how residual plot looks like to this model

The variance is quite constant in the residual plot. So it’s not necessary to transform the model. Let’s try to remove non-significant variables and see if the model is improved

summary(model1 <- lm(LAI ~ blue + swir1, data = merge))
## 
## Call:
## lm(formula = LAI ~ blue + swir1, data = merge)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.1725 -0.6404 -0.1450  0.6484  1.4331 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)  
## (Intercept)  6.303681   5.473673   1.152    0.265  
## blue         0.000835   0.005857   0.143    0.888  
## swir1       -0.002228   0.001061  -2.099    0.051 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7878 on 17 degrees of freedom
## Multiple R-squared:  0.6292, Adjusted R-squared:  0.5856 
## F-statistic: 14.43 on 2 and 17 DF,  p-value: 0.0002174

This time the model is still significant, but variable ‘blue’ is not significant to the model. Also, the variance is not constant, which means the performance of the model is bad for use. We need to consider other method to choose variables for the model.

Variable selection

See this part

This exh_var_search function is to search a combination of variables with the smallest Root mean square error (RMSE) in a linear regression. Here we try to select a combination of 3 variables and see top 5 suggestions among 120 models.

exh_var_search(merge, 3, 'LAI', 44, 53)
## Loading required package: microbenchmark
## Warning: package 'microbenchmark' was built under R version 4.2.3
## Evaluating 120 models
## Estimated time: 0 hours, 0 minutes, 0 seconds
##        V1    V2    V3  RMSE
## 108 vnir2 swir1 swir2 0.199
## 117 vnir4 swir1 swir2 0.203
## 114 vnir3 swir1 swir2 0.204
## 120 swir1 swir2 vnir5 0.214
## 31   blue vnir4 swir1 0.273

Now we select ‘vnir2’, ‘swir1’ and ‘swir2’ as our new model’s variables, in which RMSE is the smallest (0.199)

summary(model3 <- lm(LAI ~ swir1 + swir2 + vnir2, data = merge))
## 
## Call:
## lm(formula = LAI ~ swir1 + swir2 + vnir2, data = merge)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.5126 -0.2885 -0.1279  0.1579  1.0988 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2.465681   1.679544  -1.468 0.161470    
## swir1       -0.010941   0.001897  -5.768 2.88e-05 ***
## swir2        0.012808   0.002709   4.728 0.000227 ***
## vnir2        0.003058   0.000509   6.008 1.83e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4503 on 16 degrees of freedom
## Multiple R-squared:  0.886,  Adjusted R-squared:  0.8646 
## F-statistic: 41.44 on 3 and 16 DF,  p-value: 9.038e-08

The model looks very good now, p-value of the model and predictors are significant. The model explains 88.6 percent of the variability of LAI.

Looking at the residual plot, it does not look like a very good one but it’s acceptable. Also, because we have only 20 plots so the variability distribution will not always be constant as in theory. But the predicted values are very fitted to the dataset, also because we choose the model with the smallest RMSE. Thus, this is a good model that can be used for spatial prediction.

LAI prediction and LAI map

Next, we import non-forest mask and stack with Sentinel-2 image to predict LAI in desired region in Joensuu. The non-forest mask has been done in QGIS, using Supervised classification to classify forest, non-forest and water body. Raster calculation is applied after the classification, which assigns forest as 1, non-forest and water body as 0. We have a binary raster image as below

mask <- raster('Tasks/binary raster class.tif')
mask[mask==0] <- NA
plot(mask)

Because after classification, the extent of two images are different, first we need to resample the non-forest mask image to have the same resolution as Sentinel-2 image, and then stack them together

resample <- resample (mask, s2_crop, method = 'bilinear')
stack <- stack(s2_crop, resample)

Now transform the stacked image into a dataframe for LAI prediction

s2_df <- as.data.frame(stack)
names(s2_df) <- c('blue', 'green', 'red', 'vnir1', 'vnir2', 'vnir3', 'vnir4', 'swir1', 'swir2', 'vnir5', 'class')

pred_s2 <- predict(model3, s2_df)

Let’s take a look at our prediction

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  -8.180   2.111   2.805   3.147   3.468  66.478
## Percentage of values range from 0 to 6 is : 91.17224 %

There are values which below 0 and particular high, this might be due to some false prediction on non-forest area resulted from model extrapolation. But 91.17 percent of prediction values range from 0 to 6, and only 1.26 percent of the values are below 0, so the false prediction is minor. Here, we simply remove the negative values and set the LAI limit to 6 for best visualization on the map.

But first, let’s replace non-forest area to NA values in our prediction

pred_s2 <- replace(pred_s2, is.na(s2_df$class), NA)

Now our prediction only contains values of forested area, we set the LAI prediction limit from 0 to 6 and check again the histogram and summary

#Remove predicted values which are negative and larger than 6.0
clipped_lai <- pmax(pmin(pred_s2, 6.0), 0)
hist(clipped_lai, xlab = 'Predicted LAI')

summary(clipped_lai)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##     0.0     1.8     2.6     2.6     3.2     6.0 1618182

It looks quite good now. Next is to create the map. We extract a layer from Sentinel-2 image, and assigns values to the map layer

#Create a new map of LAI
laimap <- s2_crop[[1]]
values(laimap) <- clipped_lai
par(bg='light gray')
plot(laimap, main = 'Satellite image-based leaf area index map')
plot(d, col = 'red', pch = 16, cex = 0.75, add = T, width = 6, height = 4)

This is the final map that we have built.Now we can check the value prediction of any location to manually assess the accuracy of the prediction.In general, the prediction is highly correct in forested area, but there are some errors during land-use classification, in which grassland, bareland can be also classified as forests. To address with this issue, we should classify lands into more categories. The more categories we have, the higher accuracy of the pixels being trained to classify forest and non-forest area.

THE END