BWA-FIA Data Exploration

Introduction

While fully recognizing (and appreciating!) the fact that Erin is the FIA expert, and will be doing most of the meaningful work with FIA data, I thought I’d do some quick exploratory work to, at the very least, start to nail down the extent within which we are interested in grabbing FIA plot data. Plus, it’s an excuse for me to get more familiar with FIA. Massive disclaimer: I am an extreme FIA novice, so I think I’m doing things correctly here, but I very well might not be. Thoughts/questions/feedback genuinely appreciated!

Read in the data

First things first, I’m going to load in FIA tree and plot data from the states from which we are likely to pull data, due to their proximity to the UWC and Ashley NFs (CO, ID, NV, UT, WY). These data were acquired from the FIA DataMart in CSV format on 3/7/2022 (https://apps.fs.usda.gov/fia/datamart/CSV/datamart_csv.html).

library(knitr)

## Warning: package 'knitr' was built under R version 4.1.2

library(rmarkdown)
library(leaflet)

## Warning: package 'leaflet' was built under R version 4.1.2

library(sf)

## Warning: package 'sf' was built under R version 4.1.2

## Linking to GEOS 3.9.1, GDAL 3.2.1, PROJ 7.2.1

library(viridis)

## Warning: package 'viridis' was built under R version 4.1.2

## Loading required package: viridisLite

## Warning: package 'viridisLite' was built under R version 4.1.2

# read in plot-level data
df.plot.co <- read.csv("S:/ursa/campbell/bwa/data/fia/CO_PLOT.csv")
df.plot.id <- read.csv("S:/ursa/campbell/bwa/data/fia/ID_PLOT.csv")
df.plot.nv <- read.csv("S:/ursa/campbell/bwa/data/fia/NV_PLOT.csv")
df.plot.ut <- read.csv("S:/ursa/campbell/bwa/data/fia/UT_PLOT.csv")
df.plot.wy <- read.csv("S:/ursa/campbell/bwa/data/fia/WY_PLOT.csv")

# read in tree-level data
df.tree.co <- read.csv("S:/ursa/campbell/bwa/data/fia/CO_TREE.csv")
df.tree.id <- read.csv("S:/ursa/campbell/bwa/data/fia/ID_TREE.csv")
df.tree.nv <- read.csv("S:/ursa/campbell/bwa/data/fia/NV_TREE.csv")
df.tree.ut <- read.csv("S:/ursa/campbell/bwa/data/fia/UT_TREE.csv")
df.tree.wy <- read.csv("S:/ursa/campbell/bwa/data/fia/WY_TREE.csv")

Just for s’s and g’s, here’s a quick look at an example of each data frame:

# print out plot-level data frame
paged_table(df.plot.co)

# print out tree-level data frame
paged_table(df.tree.co)

…So, you know, A LOT of data. And that’s just for one state. Next, I’ll merge the state-level datasets together.

# merge state-level data together
df.plot <- rbind(df.plot.co, df.plot.id, df.plot.nv, df.plot.ut, df.plot.wy)
df.tree <- rbind(df.tree.co, df.tree.id, df.tree.nv, df.tree.ut, df.tree.wy)

In all, we’re looking at 137766 records in the plot-level data, and 1021162 records in the tree-level data. Yikes!

Data filtering a pre-processing

Before moving any further, I’m going to do a few things to simplify the analysis. All of the following steps were gleaned from a combination of experimentation and a somewhat careful reading of the [FIA Database Description and User Guide for Phase 2 (version 9.0.1)] (https://www.fia.fs.fed.us/library/database-documentation/current/ver90/FIADB%20User%20Guide%20P2_9-0-1_final.pdf). Here are the steps you’ll see in the following code:

Plot-level data filtering and pre-processing

Filter to focus only on the “national design” plots only. This is recorded in the plot-level data under the field DESIGNCD. As I understand it, the national design makes up the majority of FIA plots, but there are a host of other plot protocols that are implemented for various reasons. To facilitate the most direct comparison between plots, I’m going to focus only on the national design plots, though Erin may have other thoughts about this when it comes to actual data analysis.
Filter to only plots that were measured. This is recorded in the plot-level data under the field PLOT_STATUS_CD. Values of 1 or 2 mean the plot was visited.
Creation of a unique plot ID. This one took me a while to figure out. As I understand it (again, which may be severely incorrect), the CN field is a unique sequence number used to identify a plot record. But(!) a plot record is unique to each visit. So, given that there are, in many cases, multiple records for one plot, I need a plot-level ID that is only unique to the plot itself, not to the measurement event of that plot, so that I can filter to only the most recent measurement event… If that makes sense. According to the User Guide, a unique plot ID can be created based on the state code (STATECD), survey unit code (UNITCD), county code (COUNTYCD) and plot number (PLOT). So, by concatenating these together, I should be able to filter to only the most recent year’s data.
Filter to only the most recent year’s data. This can be done using the new plot-level ID along with the inventory year field (INVYR).
Reduce table complexity. With these steps done, I can now dramatically simplify the plot-level data to only those variables I will need moving foward. These include:

CN – now that we’re focused only on the most recent data, the sequence number can be used to join plot-level data to tree-level data.
LAT – fuzzed latitude of the plot. For the sake of this simple analysis, fuzz isn’t really too much of an issue.
LON – fuzzed longitude of the plot.

# filter to only national design plots
df.plot <- df.plot[df.plot$DESIGNCD == 1,] # national plot design

# filter to only those plots that were measured
df.plot <- df.plot[df.plot$PLOT_STATUS_CD %in% c(1,2),] # plots were measured

# create new plot id
df.plot$plot.id <- paste(df.plot$STATECD, df.plot$UNITCD, df.plot$COUNTYCD, df.plot$PLOT, sep = ".")

# select only the most recent year's data
df.plot.years <- aggregate(df.plot$INVYR, list(df.plot$plot.id), max)
colnames(df.plot.years) <- c("plot.id", "most.recent.year")
df.plot <- merge(df.plot, df.plot.years)
df.plot <- df.plot[df.plot$INVYR == df.plot$most.recent.year,]

# simplify table
df.plot <- df.plot[,c("CN", "LAT", "LON")]

Tree-level data filtering and pre-processing

The only filtering I’m going to do on the tree-level data for now is focusing only on live trees. That can be done using the STATUSCD field, where a value of 1 means the tree is alive. As with the plot data, I’ll also pare down the table’s complexity to focus only on those fields that will be useful moving foward, including: - PLT_CN – plot sequence number to join to plot-level data - SPCD – species code - DIA – stem diameter

# filter to only live trees
df.tree <- df.tree[df.tree$STATUSCD == 1,]

# simplify table
df.tree <- df.tree[,c("PLT_CN", "SPCD", "DIA")]

Map the data out

At this point, I think it’s useful to visualize the spatial nature of these data to gain a sense for how dense the plot network is within the study area.

# read in usfs boundaries
sf.usfs <- st_read("S:/ursa/campbell/bwa/data/usfs/boundaries/a02_extracted/S_USA.AdministrativeForest.gdb")

## Reading layer `AdministrativeForest' from data source 
##   `S:\ursa\campbell\bwa\data\usfs\boundaries\a02_extracted\S_USA.AdministrativeForest.gdb' 
##   using driver `OpenFileGDB'

## Warning in CPL_read_ogr(dsn, layer, query, as.character(options), quiet, :
## GDAL Message 1: organizePolygons() received a polygon with more than 100 parts.
## The processing may be really slow. You can skip the processing by setting
## METHOD=SKIP, or only make it analyze counter-clock wise parts by setting
## METHOD=ONLY_CCW if you can assume that the outline of holes is counter-clock
## wise defined

## Simple feature collection with 112 features and 8 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -150.0079 ymin: 18.03373 xmax: -65.69967 ymax: 61.51899
## Geodetic CRS:  NAD83

# create subset of just the stuy area
sf.usfs.sa <- sf.usfs[sf.usfs$FORESTNAME %in% c("Uinta-Wasatch-Cache National Forest",
                                                "Ashley National Forest"),]

# read in state boundaries
sf.state <- st_read("S:/ursa/campbell/bwa/data/census/a02_extracted", "tl_2021_us_state")

## Reading layer `tl_2021_us_state' from data source 
##   `S:\ursa\campbell\bwa\data\census\a02_extracted' using driver `ESRI Shapefile'
## Simple feature collection with 56 features and 14 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -179.2311 ymin: -14.60181 xmax: 179.8597 ymax: 71.43979
## Geodetic CRS:  NAD83

# create sf object from plots
sf.plot <- st_as_sf(df.plot, coords = c("LON", "LAT"), crs = 4326)

# plot it all out
plot(st_geometry(sf.usfs.sa))
plot(st_geometry(sf.plot), pch = 16, cex = 0.2, col = "lightgray", add = T)
plot(st_geometry(sf.state), lwd = 3, add = T)

My eyes might be deceiving me, but it seems like something strange is going on in Wyoming – seemingly more plots than in the other states. Paired plots right next to each other maybe? Not sure… Need to investigate further.

…Upon further investigation (i.e., opening up the plot points in ArcGIS), it seems like there is, in fact, a higher density of plots that are measured using the national design in WY as compared to the other states. Erin, any idea what’s going on here?

Focus on ABLA

Wyoming weirdness aside, I’ve now got a set of (fuzzed) FIA plots with the most recent measurements that follow the national design data collection protocol and all of the associated live trees. Let’s figure out which of these plots have ABLA. The species code for ABLA is 19 according to the [FIA Master Species List] (https://www.fia.fs.fed.us/library/field-guides-methods-proc/docs/2021/2021%20Master%20Species%20FGver9-1_9_2021_final.xlsx). To do this, I’ll just focus on presence/absence for now.

# get species presence (1) vs. absence (0) per plot
df.tree$ABLA <- 0
df.tree$ABLA[df.tree$SPCD == 19] <- 1
df.tree.abla <- aggregate(df.tree$ABLA, list(CN = df.tree$PLT_CN), max)
colnames(df.tree.abla) <- c("PLT_CN", "ABLA")

# merge with plot sf
sf.plot <- merge(sf.plot, df.tree.abla, by.x = "CN", by.y = "PLT_CN")

# subset to only plots with abla
sf.abla <- sf.plot[sf.plot$ABLA == 1,]

There are 4301 plots in CO, ID, NV, UT, and WY that have ABLA present. Let’s see their spatial distribution.

# plot it all out
plot(st_geometry(sf.abla), pch = 16, cex = 0.2)
plot(st_geometry(sf.state), lwd = 3, add = T)

Explore distance from study area

OK, so we know that there are 4301 plots total, but we probably shouldn’t build a model based on all of these plots, for a few reasons:

Since we’re focused on a relatively small study area (two NFs), is it really defensible to build a model that includes data from hundreds of miles away? The habitat conditions that are suitable for ABLA in northern ID are likely quite a bit different than those that are suitable for ABLA in the UWCNF.
It would require some MASSIVE predictor datasets. If we wanted Landsat or Sentinel data, DEM data, climate data, etc. at the full extent of all of these ABLA plots at 10 or 30 m spatial resolution, we’d be looking at possibly several TB of data. While that’s not unreasonable, given the capacity to use tools like Earth Engine, it would definitely complicate matters a bit.

So, the question then becomes how to define extent within which we want to grab FIA plot data to build our models. To do this, I’ll first determine the distance of each plot to our study area.

# reproject into albers projection
sf.abla <- st_transform(sf.abla, crs = 5070)
sf.usfs.sa <- st_transform(sf.usfs.sa, crs = 5070)
sf.state <- st_transform(sf.state, crs = 5070)

# get distance from plots to study area
dists <- st_distance(sf.abla, sf.usfs.sa) %>%
  apply(1, min)

# add as field to abla plost
sf.abla$DIST <- dists

# add a color field to color points by distance from study area
sf.abla$COL <- magma(100)[as.numeric(cut(sf.abla$DIST, breaks = 100))]

# plot it out
plot(st_geometry(sf.abla), pch = 16, cex = 0.2, col = sf.abla$COL)
plot(st_geometry(sf.state), lwd = 3, add = T)

Taking into account the fact that these plots are fuzzed, and as such, these numbers are slightly incorrect, there are 182 plots with ABLA within UWC and Ashley NFs. Let’s take a slightly more informative look…

# bin abla plot counts by 500 m distance from study area intervals
for (i in seq(0, max(dists), 500)){
  n <- length(dists[dists <= i])
  if (i == 0){
    results.df <- data.frame(dist = i, count = n)
  } else {
    results.df <- rbind(results.df, data.frame(dist = i, count = n))
  }
}

# convert to km
results.df$dist <- results.df$dist / 1000

# plot out the results
plot(count ~ dist,
     data = results.df,
     type = "l",
     xlab = "Distance (km)",
     ylab = "Number of ABLA plots",
     lwd = 2,
     col = "red",
     las = 1)
grid()

Thinking about how this looks from the peer review perspective, it would be nice to provide a nice, rounded number to the sentence “We buffered our study area boundary by x km and gathered FIA plot data from within this buffer area for use in the prediction of bla bla bla…”. So, here’s a table of some possible x values:

# create summary table
summary.df <- results.df[results.df$dist %% 50 == 0,]
paged_table(summary.df)

And here’s one more consideration… I’ll now buffer the study area by each of these distances, and estimate how large the predictor raster files would be, making some admittedly over-simplified assumptions.

# add two new fields to the summary df, representing number of raster pixels
# needed to cover the extent of the study area size, with either 10 or 30 m pix
summary.df$n.10 <- NA
summary.df$n.30 <- NA

# add two more for file size (in GB), assuming 32-bit float pixel type
summary.df$size.10 <- NA
summary.df$size.30 <- NA

# begin buffer loop
for (i in seq(1,nrow(summary.df))){
  buff.dist <- summary.df$dist[i] * 1000
  if (buff.dist == 0){
    bb <- st_bbox(sf.usfs.sa)
  } else {
    buff <- st_buffer(sf.usfs.sa, buff.dist)
    bb <- st_bbox(buff)
  }
  area <- (bb[3] - bb[1]) * (bb[4] - bb[2])
  summary.df$n.30[i] <- area / 900
  summary.df$n.10[i] <- area / 100
  summary.df$size.30[i] <- 32 * summary.df$n.30[i] / 8589934592
  summary.df$size.10[i] <- 32 * summary.df$n.10[i] / 8589934592
}

# take a look at the table
paged_table(summary.df)

From this, we can derive some useful equations. If we fit models to the 10 and 30 m resolution file sizes, we can use these models, along with an estimate of how many predictor layers we’re going to have, to predict total predictor stack file size.

# model the relationships
mod.10 <- nls(size.10 ~ a * dist ^ b + c,
              data = summary.df,
              start = list(a = 0.01, b = 2, c = 10))
mod.30 <- nls(size.30 ~ a * dist ^ b + c,
              data = summary.df,
              start = list(a = 0.01, b = 2, c = 10))

# plot out the data
par(mfrow = c(1,2))
plot(size.10 ~ dist,
     data = summary.df,
     xlab = "Buffer Distance (km)",
     ylab = "File Size of Single Raster (GB)",
     pch = 16,
     las = 1,
     main = "10 m Resolution")
pred <- predict(mod.10, summary.df)
lines(pred ~ summary.df$dist, col = "red", lwd = 2)
grid()
plot(size.30 ~ dist,
     data = summary.df,
     xlab = "Buffer Distance (km)",
     ylab = "File Size of Single Raster (GB)",
     pch = 16,
     las = 1,
     main = "30 m Resolution")
pred <- predict(mod.30, summary.df)
lines(pred ~ summary.df$dist, col = "red", lwd = 2)
grid()

The resulting equations are:

10 m resolution
- \(s = n * (0.001348 * d ^ {1.7125} + 4.4641)\)
30 m resolution
- \(s = n * (1.5\times 10^{-4} * d ^ {1.7125} + 0.496)\)

where \(s\) is the uncompressed file size in GB, \(n\) is the number of predictor layers, and \(d\) is the buffer distance in kilometers. Note the emphasis on uncompressed. In reality, there will be some compression that goes on, which for an average TIF image could reduce maybe up to 1/2. So, let’s imagine a few scenarios:

25 predictors
- 10 m resolution
  - 125 km buffer
    - Number of ABLA plots = 506
    - File size = 242.5 GB
  - 250 km buffer
    - Number of ABLA plots = 1654
    - File size = 542.5 GB
  - 500 km buffer
    - Number of ABLA plots = 3449
    - File size = 1522.5 GB
- 30 m resolution
  - 125 km buffer
    - Number of ABLA plots = 506
    - File size = 27.5 GB
  - 250 km buffer
    - Number of ABLA plots = 1654
    - File size = 60 GB
  - 500 km buffer
    - Number of ABLA plots = 3449
    - File size = 170 GB
50 predictors
- 10 m resolution
  - 125 km buffer
    - Number of ABLA plots = 506
    - File size = 485 GB
  - 250 km buffer
    - Number of ABLA plots = 1654
    - File size = 1085 GB
  - 500 km buffer
    - Number of ABLA plots = 3449
    - File size = 3045 GB
- 30 m resolution
  - 125 km buffer
    - Number of ABLA plots = 506
    - File size = 55 GB
  - 250 km buffer
    - Number of ABLA plots = 1654
    - File size = 120 GB
  - 500 km buffer
    - Number of ABLA plots = 3449
    - File size = 340 GB

Summary and conclusions

The goal here was to get a very general sense of how many FIA plots we might be working with as our willingness to incorporate increasingly-distance ABLA-presence plots. There are a few key caveats:

This only includes national design FIA plots. Perhaps that’s too limiting… Erin?
Something weird seems to be going on in WY… Erin?
File sizes are uncompressed and represent a full rectangular extent, rather than a clipped out buffer area. So, they should be treated as overestimates, but in the general ballpark of reality.
I could very much envision a situation in which we have more than 50 predictors.

There’s obviously no “right” answer here. There’s also the possibility that we could do a more nuanced, ecologically-relevant plot selection process, whereby instead of just doing a simple buffer, we figure out which plots are most similar to those in our study area and use those. But, even then, there has to be some threshold within which similarity is considered “similar enough”. Distance is simple, ecological similarity is muddy. So, I’m inclined to keep it simple. Plus, thanks to our old friend spatial autocorrelation, distance and ecological similarity should align somewhat well.

I’m inclined to say a buffer in the 200-250 km range might work, giving us well over 1000 ABLA plots (and a whole boatload of non-ABLA plots), while avoiding a huge amount of data. But, you know, I’m obviously open to further discussion, and that opinion may well change with feedback from Erin (e.g., “you did this all wrong, none of this is valid, sorry”).