knitr::opts_chunk$set(echo = FALSE, fig.height = 5, fig.width = 9, fig.align = "center")
setwd("G:/My Drive/CanopyHeights_Analysis")

### Libraries
library(dplyr)
library(ggplot2)
library(xgboost)
library(knitr)
library(plotly)
library(caret)
library(kableExtra)
library(hexbin)
library(modelr)
library(readr)
library(broom)


### Import data
setwd("G:/My Drive/CanopyHeights_Analysis")

d <- read.csv("SamplePts.csv")
#head(d)


Lidar<-d$Lidar_ht
GEDI<-d$GEDI_ht
NFIS<-d$NFIS_hT
HARM<-d$Harmonized.Landcover
harmName<-d$Harmorized.Landcover.Name
wetcls<-d$ABMI_Wetland_Class

Introduction

Purpose: To assess two open source canopy height models against Alberta GOA lidar and ABMI Photoplots

Datasets to be evaluated:

GEDI spaceborne lidar extrapolated with Landsat through random forest regression models, represents year 2019
- Potapov et al (2021)

Prepublication dataset for Alberta provided to ABMI by the author, for internal use and evaluation
X:\ALPHA\CanopyHeight\GEDI_TreeHt_ABclip.tif
In Potapov et al (2021), the dataset has a MAE of 4.5m and a r² of 0.6, compared to validation data.
Forest (>3m trees) /non forest breakdown was 83-88% accurate. The dataset provided to ABMI is a slightly newer version than the one described in Potapov et al (2021).
Coverage: All of Alberta
GEDI spaceborne lidar actually only reaches 51.6 degrees of latitude N but the canopy height models are extrapolated further north
Identified as “GEDI” in this report

Cross-Canada airborne lidar transects extrapolated with Landsat through NN imputation model, represents year 2015, elev_95 product

Matasci et al (2018)
Downloaded from https://opendata.nfis.org/mapserver/nfis-change_eng.html
X:\ALPHA\CanopyHeight\NFIS_Htp95_ABclip.tif
Coverage: Boreal, Foothills, and Rocky Mountain regions of Alberta
Identified as “NFIS” in this report (National Forest Information System )
Note that the NFIS layer does not have any values between 0 and ~2.8

Comparison datasets:

Government of Alberta lidar ht = ff-be,

Acquisition years mainly 2006-2008, some later.
Original 1 m resolution tiles on ABMI L: drive resampled by 90th percentile to 30 m resolution mosaic (90th percentile was used to match validation data in Potapov et al 2021)
X:\ALPHA\CanopyHeight\LidarHt_30m.tif
Coverage: Most but not all of green zone, very limited white zones coverage (see L:\Lidar_coverage.pdf)
Not open source; Cannot be used for production of open source products

ABMI Photoplots, Height attribute, Interpreted @2015

Downloaded from https://www.abmi.ca/home/data-analytics/da-top/da-product-overview/Advanced-Landcover-Prediction-and-Habitat-Assessment--ALPHA--Products/Photoplot-Land-Cover-Data-Training-and-Validation.html?scroll=true
Coverage: 3 by 7 km tiles distributed across Alberta

Analysis Part 1 Comparison between three raster canopy height datasets: “GEDI”, “NFIS”, and “Lidar”"

Data preparation

A sample of points was taken 5 km apart over all of the province where there is coverage of all three datasets, excluding human disturbances (ABMI HFI dataset) and burns (Fire Perimeter layer) since 2006.
Attributes from the following datasets were added to the points: GEDI canopy height, NFIS canopy height, Lidar canopy height, Natural Regions of Alberta, ABMI Wetlands, and three additional layers from the NFIS site: Landcover 2015 (forested land only), Canada Harmonized Agriculture Forest Landcover 2015 (Alberta wide), Forest Percent Above Mean 2015 (simulates canopy density).

Scatter / Density plots

In the following scatterplots, darker shades of blue indicate greater density of data. The dashed line is a linear regression trendline. The yellow line is a Loess regression line (smoothed non parametric regression) [-1]http://r-statistics.co/Loess-Regression-With-R.html

Both open source satellite based canopy height layers (GEDI and NFIS) tend to exceed corresponding lidar values up to about 12m, after which they level off and eventually lidar values >15m exceed the corresponding values from the other two layers.

densPlot <-function(xVar, yVar,titleVar ){
ggplot(d, aes_string(x=xVar, y=yVar)) +  geom_jitter(alpha=0.3, colour = "lightblue") +
    xlim(0, 25) +
    ylim (0, 25) + 
  stat_density2d(aes(fill = stat(level)),geom = "polygon", alpha = 0.7) +
  scale_fill_gradient(low = "lightblue", high = "darkblue")+
  theme(legend.position = "none") +
  geom_smooth(span = 10, colour = "lightyellow", weight = 0.5) +
    geom_smooth(method = lm, colour = "black", weight = 0.01, linetype = "dashed")+
    xlab (deparse(substitute(xVar)))+
    ylab (deparse(substitute(yVar))) +
     labs(
       title = paste(titleVar)
       
     )
}

densPlot(Lidar, GEDI,"GEDI Based Canopy Heights Compared to GOA Airborne Lidar" )

gedi.lm= lm(Lidar ~ GEDI, data = d)
gedi_r2<-summary(gedi.lm)$r.squared

print ( paste ("Lidar ~ GEDI Linear Regression Rsquared: ", formatC(gedi_r2, digits = 3, format = "f")))

## [1] "Lidar ~ GEDI Linear Regression Rsquared:  0.336"

densPlot(Lidar, NFIS,"NFIS Extrapolated Canopy Heights Compared to GOA Airborne Lidar " )

nfis.lm=lm(Lidar ~NFIS, data = d)
nfis_r2<-summary(nfis.lm)$r.squared

print ( paste ("Lidar ~NFIS Linear Regression Rsquared: ", formatC(nfis_r2, digits = 3, format = "f")))

## [1] "Lidar ~NFIS Linear Regression Rsquared:  0.272"

Surprisingly, the GEDI and NFIS canopy height layers are not closely correlated with each other either. While the densest data range, values <2m, shows most values from the two datasets are within the same range, both also have values which are very low in one datset and high in the other. Most of the rest of the data is bunched up in the 10-15m (GEDI) or 12-17m (NFIS) range. The descrepancy is surprising because both these datasets use Landsat data for extrapolation.

densPlot(GEDI, NFIS,"NFIS Extrapolated Canopy Heights  Compared to GEDI Based Canopy Heights " )

both.lm=lm(GEDI ~NFIS, data = d)
both_r2<-summary(both.lm)$r.squared

print ( paste ("GEDI ~NFIS Linear Regression Rsquared: ", formatC(both_r2, digits = 3, format = "f")))

## [1] "GEDI ~NFIS Linear Regression Rsquared:  0.362"

Comments: The descrepancy between the GOA Lidar CHM and the two open source canopy height layers can probably be attributed at least partially to the temporal descrepancy. Tree growth will proceed at different rates in different ages and types of forest, so we should not expect a linear relationahip to be preserved over time. The methodological difference of resampling 1m data down to 30m by taking a percentile vs. modelled data can also explain some of the descrepancy.

The difference between the two modelled datasets is harder to explain without a detailed analysis of the input data and modelling approaches.

Hot Spot Spatial Analysis

To examine the spatial distribution of the differences between the canopy height datasets, difference images were created for 1) GEDI minus Lidar heights 2) NFIS minus Lidar Heights, and 3) GEDI minus NFIS. An Optimized Hot Spot Analysis was run in ArcPro. Red areas show significant cluster of positive values and blue areas show significant clusters of negative values. Grey areas have no significant clustering, although there may be individual high and low values mixed together.

The GEDI-minus-Lidar layer showed the most pronounced spatial clustering, with the Foothills Region, Hawk Hills, and Birch Mountains being hot spot areas where GEDI > Lidar, while surrounding areas in the northwest of the province were cold spots with Lidar > GEDI. These hilly areas have a lot of shrub, according to the NFIS Landcover layer. The NFIS-Minus-Lidar and GEDI-minus-NFIS showed less pronounced spatial patterns.

GEDI-Minus-Lidar

GEDI-Minus-NFIS

NFIS-Minus-Lidar

Analysis by Factor

The next phase of analysis focused on drilling down into the landcover, height, and density characteristics that might explain the spatial patterns. Density based on the Percent Above Mean variable turned out to be strongly positively correlated with height, so height class was the focus.

1. Pivot Table analysis in Excel showed that the average heights when broken down by landcover and wetland type usually made sense. For example, here the data is arranged by wetland class, and then the fen is futher broken down by Lidar height class. It’s observed that the GEDI and NFIS heights average higher than Lidar for the shorter height categories. Fen can be graminoid, shrub, or treed. When the 0-2m lidar height category is broken down further by NFIS landcover type, it suggests some of the fen in the group are actually treed, and these account for the higher GEDI and NFIS heights. Possibly, these areas have transitioned from graminoid to shrub or treed fen in the time between the lidar aquisition and the modelled layers. (Note that this graph does not show how many points are in each category; some, like Snow/Ice in Fen, are negligable.)

Pivot Chart with Wetland Types, Lidar Height Classes, and Landcover

2. The following density plots show the breakdown by 6 class Harmonized Landcover, ABMI Wetlands/Upland, and 13 class NFIS Landcover.

#Density Plots by Class
densPlot2 <-function(xVar, yVar,wrapVar ){
ggplot(d, aes_string(x=xVar, y=yVar)) +  geom_point(alpha=1/5, colour = "lightblue") +
    xlim(0, 25) +
    ylim (0, 25) + 
  stat_density2d(aes(fill = stat(level)),geom = "polygon", alpha = 0.7) +
  scale_fill_gradient(low = "lightblue", high = "darkblue")+
  theme(legend.position = "none") +
  geom_smooth(span = 10, colour = "lightyellow", weight = 0.5) +
   xlab (deparse(substitute(xVar)))+
    ylab (deparse(substitute(yVar)))  +
  facet_wrap(wrapVar)+
    labs(title = paste((deparse(substitute(yVar))), "vs", (deparse(substitute(xVar))), "by",(deparse(substitute(wrapVar)))))
}

densPlot2(Lidar, GEDI,"~Harmorized.Landcover.Name")

densPlot2(Lidar, GEDI,"~ABMI_Wetland_Class")

densPlot2(Lidar, GEDI,"~NFIS_Landcover_Name")

densPlot2(Lidar, NFIS,"~Harmorized.Landcover.Name")

densPlot2(Lidar, NFIS,"~ABMI_Wetland_Class")

densPlot2(Lidar, NFIS,"~NFIS_Landcover_Name")

3. These boxplots show the difference between GEDI and Lidar by Harmonized Landcover class. As expected, the range is greater for landcovers with a greater range of values, while the landcover types without woody vegetation’s differences are mainly outliers. The box plots for NFIS minus Lidar (not shown) are similar.

ggplot(d,aes(harmName,d$G_Minus_L))+ 
  geom_hline(yintercept = 0, color = "white", size = 2) +
  geom_boxplot() +
  scale_y_continuous(breaks = seq(-20, 20, 5))+
  labs(title = "GEDI Minus Lidar by Harmonized Landcover Class")

#ggplot(d,aes(d$NFIS_Landcover_Name,d$G_Minus_L))+geom_boxplot() +
#  scale_y_continuous(breaks = seq(-20, 20, 5))

4. These boxplots show the differences between GEDI and Lidar, and NFIS and Lidar, by Lidar height class. They follow a similar pattern, though the NFIS ranges tend to be larger. The positive differences in the shorter height ranges may be related to vegetation growth, while the negative differences in the taller ranges indicate that the modelled canopy heights tend to saturate out after about 15m. The sum of differences average out close to zero.

ggplot(d,aes(LidarHtCls,G_Minus_L))+ 
  geom_hline(yintercept = 0, color = "white", size = 2) +geom_boxplot() +
  labs(title = "GEDI Minus Lidar by Lidar Height Class")

ggplot(d,aes(LidarHtCls,N_Minus_L))+ 
  geom_hline(yintercept = 0, color = "white", size = 2) +geom_boxplot() +
  labs(title = "NFIS Minus Lidar by Lidar Height Class")

5.When the data is viewed organized by GEDI height class, there seems to be some omission; data in the 0-2 height range that averages 6-8m in Lidar heights and taller in NFIS heights, and is classifed as treed in the NFIS Landcover.

GEDI Height Classes by Landcover

These bar charts of GEDI heights broken down the Harmonized Landcover class also suggest at these missing heights. The coloured bars are GEDI heights and the black outline bars show the coresponding lidar heights for the same data.

ggplot(d) + 
  geom_bar(aes(wetcls, GEDI, fill = harmName), 
           position = "dodge", alpha = 0.9, width=0.75,
           stat = "summary", fun.y = "mean") +
  geom_bar(aes(wetcls, Lidar, group = interaction(wetcls,harmName)), 
           position = "dodge", alpha = 0.15, width=0.75, colour = "black", fill = NA,
           stat = "summary", fun.y = "mean") +
             facet_wrap(~GEDI_htClass,ncol = 2)+
  labs(title = "GEDI mean heights  by Wetland Class and Harmonized Landcover Class")

Here’s an example of a copse of trees in the Parkland Region visible on the 30m Lidar CHM but not on the modelled GEDI based layer (There were no NFIS canopy height data in this area.)

The same issue exists in the NFIS layer, but not always at the same locations.

Calculations Based on Pivot Tables, assuming the NFIS Landcover and Harmonized Landcover Shrub, Conifer, Broadlead, and Mixed Forest and Treed Wetland classes are correct, suggest that

13% of sample points in the 0-3m height range of the GEDI Canopy Height Model are underestimated (mostly just zeroes, should have a vegetation height)
25% of sample points in the 0-3m height range of the NFIS Canopy Height Model are underestimated (zeroes, should have a vegetation height)
This amounts to 7% (GEDI) /14% (NFIS) of total shrub and treed samples

If the two modelled canopy height layers can be used together to identify forest/nonforest, the omission can be reduced to 7% (producer’s accuracy) or about 2% consumer’s accuracy. Commission of heights above 3m is negligable.

However, the NFIS canopy height layer does not cover the Parkland and Grassland Natural Regions.

Analysis Part 2: Photoplot Data

1. The photoplot height attribute is meant to represent the average height of the tallest overstory canopy in each polygon and is expressed as integers in 1m increments. ArcPro’s Zonal Statistics as Table tool was used to calculate the mean for each raster canopy height layer corresponding to all photoplot polygons of a certain height.

The photoplot heights may be the closest we have to a comprehensive “ground truth” dataset since they are closer in time to the modelled canopy height layers. In addition, the averaging process across polygons may be more forgiving than matching single point locations as in the previous analysis, making for a better match.

At any rate, the correlation between the layers is much closer in this format. The black line represents what would be a 1:1 relationship between photoplot heights and mean raster canopy heights. Note that there are relatively few stands over 25m, so the data points over 25m represent relatively few polygons compared to the others.

#first the three summarized by photplot site height
g <- read_csv("PhotoplotHts.csv")


ggplot(g, aes(x= SITE_HT)) +  
  scale_colour_manual(name="",  values =c("GEDI"="red", "NFIS"="forestgreen", "Lidar"= 'blue')) +
  geom_hline(yintercept = 0, color = "white", size = 2) +
  geom_point(aes( y=GEDI, colour = "GEDI")) +
    geom_line(aes( y=GEDI),colour = "pink",  alpha = 0.7) +
  geom_point(aes( y=NFIS, colour = "NFIS")) +
    geom_line(aes( y=NFIS),colour = "lightgreen",  alpha = 0.7) +
  geom_point(aes( y=Lidar, colour = "Lidar")) +
    geom_line(aes( y=Lidar),colour = "lightblue",  alpha = 0.7) +
  geom_line(aes(y = SITE_HT))+
  scale_x_continuous(breaks = seq(0, 36, 2)) +
  scale_y_continuous(breaks = seq(0, 36, 2)) +
                     xlab("Photoplot Polygon Height")+
                     ylab ("Mean Raster Canopy Height")

2. Regression Models. Although the y intercepts and slopes are offset from the 1:1 line, the mean raster canopy heights are all quite closely correlated with the photoplot heights. The GOA Lidar data shows a very strong relationship to the photoplot heights, which increases confidence in its relevance for evaluation of the modelled canopy heights.

get_formula <- function(model) {
  
  broom::tidy(model)[, 1:2] %>%
    mutate(sign = ifelse(sign(estimate) == 1, ' + ', ' - ')) %>% #coeff signs
    mutate_if(is.numeric, ~ abs(round(., 2))) %>% #for improving formatting
    mutate(a = ifelse(term == '(Intercept)', paste0('y ~ ', estimate), paste0(sign, estimate, 'x'))) %>%
    summarise(formula = paste(a, collapse = '')) %>%
    as.character
  
}

 
ppgedi.lm= lm(g$GEDI~ g$SITE_HT, data = d)
gedi_r2<-summary(ppgedi.lm)$r.squared
print ( paste ("GEDI ~ Photoplot Linear Regression : ", get_formula(ppgedi.lm)))

## [1] "GEDI ~ Photoplot Linear Regression :  y ~ 4.85 + 0.23x"

print ( paste (" R squared: ", formatC(gedi_r2, digits = 3, format = "f")))

## [1] " R squared:  0.773"

ppnfis.lm=lm(g$NFIS ~g$SITE_HT, data = d)
nfis_r2<-summary(ppnfis.lm)$r.squared
print ( paste ("NFIS ~Photoplot Linear Regression : ", get_formula(ppnfis.lm)))

## [1] "NFIS ~Photoplot Linear Regression :  y ~ 4.87 + 0.41x"

print ( paste (" R squared: ", formatC(nfis_r2, digits = 3, format = "f")))

## [1] " R squared:  0.894"

pplidar.lm=lm(g$Lidar ~g$SITE_HT, data = d)
lidar_r2<-summary(pplidar.lm)$r.squared
print ( paste ("Lidar ~Photoplot Linear Regression : ", get_formula(pplidar.lm)))

## [1] "Lidar ~Photoplot Linear Regression :  y ~ 0.37 + 0.7x"

print ( paste (" R squared: ", formatC(lidar_r2, digits = 3, format = "f")))

## [1] " R squared:  0.978"

3. The relationship with the photoplot heights can perhaps be seen most clearly by examining the differences. All three mean raster canopy heights initially exceed the polygon heights by a few metres, then drop into a negative difference, confirming the pattern observed with the sample points. When we examined the sample points, the NFIS values often seemed too high, but now averaged across thousands of polygons, they fall closer to the photoplot heights than the GEDI based layer.

g$diffGEDI<-g$GEDI-g$SITE_HT
g$diffNFIS<-g$NFIS-g$SITE_HT
g$diffLidar<-g$Lidar-g$SITE_HT

#print(g)

ggplot(g, aes(x= SITE_HT)) +  
  scale_colour_manual(name="",  values =c("Diff_GEDI"="red", "Diff_NFIS"="forestgreen", "Diff_Lidar"= 'blue')) +
  geom_hline(yintercept = 0, color = "white", size = 2) +
  geom_line(aes( y=diffGEDI, colour = "Diff_GEDI")) +
  geom_line(aes( y=diffNFIS, colour = "Diff_NFIS")) +
  geom_line(aes( y=diffLidar, colour = "Diff_Lidar")) +
  xlim(0, 25) +
    ylim (-25, 5) + 
                     xlab("Photoplot Polygon Height")+
                     ylab ("Differences from Photoplot Height")

It should be noted that the NFIS product is based on a 95th percentile of airborne lidar. The GEDI based product is based on “RH95”, a 95th percentile of a 25m wide spaceborne lidar footprint, which Potapov et al (2021) found corresponded best to the 90th percentile of airborne lidar.

Conclusions

The Government of Alberta lidar sampled down to 30m is an imperfect evaluation layer for the two modelled canopy height layers, principally due to the different in age, and lack of coverage in certain natural regions (Canadian Shield, Parkland and Grassland). It is getting too dated to be used as “truth” for canopy height models, but it can serve as a rough guide after known disturbances in the intervening years are removed.
There is spatial variation in the areas where the three canopy height layers differ, with no one layer consistently higher or lower than the others across the province. These variations seem to be related to variables such as height, density, terrain, and vegetation type, which makes it hard to separate out effects of vegetation growth over time, methodology, and consistency of results.
An examination of the canopy height values against three landcover layers, two from NFIS and one produced by ABMI (5 wetland classes and Uplands), suggest that on average the height values match the classified cover type i.e treed upland and wetland cover types have heights above 3m.
The two modelled height layers are often higher than the GOA lidar for shorter, younger forest, suggesting a growth effect, but level out for forest above 15m in height. The three layers are closest to each other in the 10 to 15m range, which is where a large proportion of the trees are. Both layers completely omit a small percentage of shorter tree or shrub heights,
The NFIS canopy height layer has a few shortcomings for our uses

Only covers part of Alberta
No values 0 < 3, for shrub heights
more errors of omission for shorter trees
more erratic values when analyzed as individual sample points
Some values seem anomalously high compared to the other datasets

However, when averaged over polygons, the NFIS layer appeared to match photoplot heights more closely than the GEDI layer.

Recommendations

If one is primarily interested in identifying taller stands, eg old growth, the NFIS layer may be more useful, although both layers have shortfalls in this respect.
For shorter sparsely treed landcover, such as treed wetland and shrub areas, the GEDI layer is probably more reliable.
For distinguishing shrub and forest from graminoid and other non woody or non vegetated areas, both layers in combination would be ideal, but in the south of the province only the GEDI based layer is available.
For general purposes, if one layer is to be chosen, I would recommend the GEDI based layer for its comprehensive coverage, more recent data, and fewer omissions and extreme values.

Both open source canopy height modelled layers could be useful and meaningful for many applications, but must be used with caution and a few caveats until more precise open source canopy height data is available.

References

Matasci, G., T. Hermosilla, M.A. Wulder, J.C. White, N.C. Coops, G.W. Hobart, H.S.J. Zald Large-area mapping of Canadian boreal forest cover, height, biomass and other structural attributes using Landsat composites and lidar plots Remote Sens. Environ, 209 (2018) https://doi.org/10.1016/j.rse.2017.12.020

Potapov, P., X. Li, A. Hernandez-Serna, A. Tyukavina, M.C. Hansen, A. Kommareddy, A. Pickens, S. Turubanova, H. Tang, C.E. Silva, J. Armston, R. Dubayah, J.B. Blair, M. Hofton Mapping global forest canopy height through integration of GEDI and Landsat data Remote Sens. Environ., 253(2021) https://doi.org/10.1016/j.rse.2020.112165

Canopy Height Comparison

Fiona Gregory, ABMI

July 31 2021