Safety Zone Data Analysis

Introduction

I’ve been trying to think of ways to address Matt’s comment below:

makes me wonder if this could be an interesting use case to demonstrate – how SSD can vary under very different conditions in one of the study landscapes. How much of the SZ polygons are fairly stable and how many change significantly, and what are the implications.

There’s a bunch of different ways to possibly go about this, so rather than make a bunch of edits to the manuscript or write a series of lengthy emails, I thought I’d put together an R Markdown document to demonstrate some data analyses and visualizations that might address the comment.

All of the following results are based on the LANDFIRE bias-corrected data, and the 75 randomly-generated SZ polygons in the three different test areas (NM, WY, and CA), each of which were evaluated for every combination of wind class and burn condition.

The effects of wind speed and burn condition

To my mind the first and simplest test of the stability of SZ suitability metrics would be to test the extent to which wind speed and burn condition classes affect the maximum within-SZ proportional SSD (pSSD_max). This can be done using dummy (categorical) regression. I’ll generate two models, one for wind speed only and one for burn condition only, and summarize the results.

# set working directory and read in the data
setwd("S:\\ursa\\campbell\\safety_zones\\data\\lf_vs_lidar\\")
df <- read.csv("comparison_results.csv")

# convert square meters to hectares
df$sz.area <- df$sz.area / 10000
df$safe.area.lc <- df$safe.area.lc / 10000

# create regression models
mod.wind <- lm(ssd.prop.max.lc ~ factor(wc), data = df)
mod.burn <- lm(ssd.prop.max.lc ~ factor(bc), data = df)

# print their summaries
summary(mod.wind)

## 
## Call:
## lm(formula = ssd.prop.max.lc ~ factor(wc), data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.3075 -0.3622 -0.1470  0.2154  4.2632 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.57607    0.04866   32.39   <2e-16 ***
## factor(wc)2 -0.91399    0.06882  -13.28   <2e-16 ***
## factor(wc)3 -1.09007    0.06882  -15.84   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.73 on 672 degrees of freedom
## Multiple R-squared:  0.3009, Adjusted R-squared:  0.2988 
## F-statistic: 144.6 on 2 and 672 DF,  p-value: < 2.2e-16

summary(mod.burn)

## 
## Call:
## lm(formula = ssd.prop.max.lc ~ factor(bc), data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.9728 -0.5308 -0.2641  0.2516  4.7553 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.08393    0.05734  18.902  < 2e-16 ***
## factor(bc)2 -0.16366    0.08110  -2.018    0.044 *  
## factor(bc)3 -0.36396    0.08110  -4.488 8.46e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8602 on 672 degrees of freedom
## Multiple R-squared:  0.0292, Adjusted R-squared:  0.02631 
## F-statistic:  10.1 on 2 and 672 DF,  p-value: 4.745e-05

No major surprises here. The major takeaways are:

Both models are significant, meaning that pSSD_max is affected by both wind class and burn condition.
There are significant differences between the different wind speed and burn condition classes, further suggesting that there is a noteworthy effect on pSSD_max.
Differences in wind class explain much more variance in pSSD_max (R² = 0.299), than burn class (R² = 0.026). This makes sense when you look at the slope-wind factor table, as SWF always increases with increases in wind, while this is not true for burn condition.

Here’s a couple of boxplots to help visualize these relationships:

# wind class vs pssdmax boxplot
par(mar = c(5,5,1,1))
boxplot(ssd.prop.max.lc ~ factor(wc),
        data = df,
        las = 1,
        xlab = "Wind Speed Class",
        ylab = "Maximum Within-SZ pSSD")

# burn class vs pssdmax boxplot
par(mar = c(5,5,1,1))
boxplot(ssd.prop.max.lc ~ factor(bc),
        data = df,
        las = 1,
        xlab = "Burn Condition Class",
        ylab = "Maximum Within-SZ pSSD")

The stronger effect of wind class can similarly be seen in these plots. Let’s see how much variance we can explain if we look at both wind speed and burn condition simultaneously:

# model wind speed and burn condition vs pssdmax
mod.wind.burn <- lm(ssd.prop.max.lc ~ factor(wc) + factor(bc), data = df)
summary(mod.wind.burn)

## 
## Call:
## lm(formula = ssd.prop.max.lc ~ factor(wc) + factor(bc), data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.3290 -0.3684 -0.1128  0.2075  4.0873 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.75195    0.06159  28.445  < 2e-16 ***
## factor(wc)2 -0.91399    0.06747 -13.547  < 2e-16 ***
## factor(wc)3 -1.09007    0.06747 -16.156  < 2e-16 ***
## factor(bc)2 -0.16366    0.06747  -2.426   0.0155 *  
## factor(bc)3 -0.36396    0.06747  -5.394 9.54e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7156 on 670 degrees of freedom
## Multiple R-squared:  0.3301, Adjusted R-squared:  0.3261 
## F-statistic: 82.53 on 4 and 670 DF,  p-value: < 2.2e-16

…Only slightly better than wind speed alone.

Testing the effect of SZ area

Obviously wind speed and burn condition aren’t enough to explain SZ suitability. If that were the case, the tool would be useless. So, good! The next variable that will presumably have a major effect is the size of the safety zone. First, I’ll just look at SZ area vs pSSD_max in isolation:

# model sz area vs pssdmax
mod.area <- lm(ssd.prop.max.lc ~ sz.area, data = df)
summary(mod.area)

## 
## Call:
## lm(formula = ssd.prop.max.lc ~ sz.area, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.5307 -0.4088 -0.1998  0.2098  4.2782 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 0.4161479  0.0467396   8.904   <2e-16 ***
## sz.area     0.0038801  0.0002845  13.637   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7722 on 673 degrees of freedom
## Multiple R-squared:  0.2165, Adjusted R-squared:  0.2153 
## F-statistic:   186 on 1 and 673 DF,  p-value: < 2.2e-16

Again, not surprisingly, the model is statistically significant (there is a relationship between SZ area/size and pSSD_max). However, once again, the amount of variance explained is quite low (R² = 0.215). Here’s the accompanying visualization:

# plot sz area vs pssdmax
par(mar = c(5,5,1,1))
plot(ssd.prop.max.lc ~ sz.area,
     data = df,
     pch = 16,
     col = rgb(0,0,0,0.25),
     las = 1,
     xlab = "Safety Zone Polygon Area (ha)",
     ylab = "Maximum Within-SZ pSSD")
grid()
abline(mod.area, lwd = 2, col = "red")

OK, so let’s see what happens when we combine all three variables into a single model:

# model sz area, wind speed, and burn condition vs pssdmax
mod.area.wind.burn <- lm(ssd.prop.max.lc ~ sz.area + factor(wc) + factor(bc), data = df)
summary(mod.area.wind.burn)

## 
## Call:
## lm(formula = ssd.prop.max.lc ~ sz.area + factor(wc) + factor(bc), 
##     data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.1531 -0.3172 -0.0425  0.1994  3.4344 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.2600417  0.0576974  21.839  < 2e-16 ***
## sz.area      0.0038801  0.0002171  17.872  < 2e-16 ***
## factor(wc)2 -0.9139895  0.0555491 -16.454  < 2e-16 ***
## factor(wc)3 -1.0900666  0.0555491 -19.623  < 2e-16 ***
## factor(bc)2 -0.1636633  0.0555491  -2.946  0.00333 ** 
## factor(bc)3 -0.3639623  0.0555491  -6.552 1.13e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5892 on 669 degrees of freedom
## Multiple R-squared:  0.5466, Adjusted R-squared:  0.5432 
## F-statistic: 161.3 on 5 and 669 DF,  p-value: < 2.2e-16

Over half of the variance explained now (R² = 0.543). We’re getting somewhere! Tough to visualize this relationship very well, but I’ll give it a shot…

# set symbols for wind speed
df$wc.pch <- 0
df$wc.pch[df$wc == 2] <- 1
df$wc.pch[df$wc == 3] <- 2

# set colors for burn condition
df$bc.col <- "green"
df$bc.col[df$bc == 2] <- "yellow"
df$bc.col[df$bc == 3] <- "red"

# plot it out
par(mar = c(5,5,1,1))
plot(ssd.prop.max.lc ~ sz.area,
     data = df,
     pch = df$wc.pch,
     col = df$bc.col,
     las = 1,
     xlab = "Safety Zone Polygon Area (ha)",
     ylab = "Maximum Within-SZ pSSD")
grid()
legend("topleft",
       legend = c("Burn Low", "Burn Moderate", "Burn Extreme",
                  "Wind Light", "Wind Moderate", "Wind High"),
       pch = c(15,15,15,0,1,2),
       col = c("green","yellow","red","black","black","black"))

Once again, this reinforces the idea that wind is the major factor here.

We’ve now looked at three major factors, but we’re still only accounting for just over half of the variance in pSSD_max. Let’s briefly pause and acknowledge the importance of that finding – once again, if we could explain SSD by SZ size, wind speed, and burn condition alone, then we wouldn’t need a tool that takes spatially-explicit landscape conditions into consideration. Once again, good news for the justification of the tool. So maybe that’s it? Does that address your comment, Matt?

Adding in vegetation height and slope

If we wanted to dig a little deeper, obviously two other major variables that we’ve left out are vegetation height and slope. Of course, this begs the question – “why test the effects of veg height, slope, SZ size, wind speed, and burn condition in this way? Isn’t this what the tool does?” And the answer is, of course, yes. But, a few things:

I want to test the relative importance of these variables in driving pSSD_max, which the tool does not do.
Whereas the tool does complex spatial analysis, based on pixel segments, in this case I’m just taking a median vegetation height and median slope throughout the entire buffer area surrounding the SZ polygon. So, it’s still spatially-explicit, but it just lacks the spatial nuance of the actual tool, thus providing a more general sense for regional variation in slope and vegetation height on pSSD_max.

In my view, we’re getting beyond the comfort zone of basic old regression, with the number of variables and mixture of continuous and categorical predictors we’re getting into. So, I’m going to shift over to my old friend the random forest. I’ll build a RF model that takes all of these variables into consideration and assesses the relative importance of them.

# load rf library
library(randomForest)

## randomForest 4.6-14

## Type rfNews() to see new features/changes/bug fixes.

# build random forest model
df$wc <- as.factor(df$wc)
df$bc <- as.factor(df$bc)
mod.rf.1 <- randomForest(ssd.prop.max.lc ~ sz.area + wc + bc +
                           vh.median.lc + slope.median.lc,
                         data = df,
                         importance = T)
mod.rf.1

## 
## Call:
##  randomForest(formula = ssd.prop.max.lc ~ sz.area + wc + bc +      vh.median.lc + slope.median.lc, data = df, importance = T) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 1
## 
##           Mean of squared residuals: 0.13036
##                     % Var explained: 82.82

# plot it out
ax.max <- max(c(mod.rf.1$predicted, df$ssd.prop.max.lc))
par(mar = c(5,5,1,1))
plot(mod.rf.1$predicted ~ df$ssd.prop.max.lc,
     pch = 16,
     col = rgb(0,0,0,0.25),
     xlim = c(0,ax.max),
     ylim = c(0,ax.max),
     xlab = "Actual pSSDmax",
     ylab = "Modeled pSSDmax",
     las = 1)
grid()
lines(c(-10,10),
      c(-10,10))
mod.rf.pred.act <- lm(mod.rf.1$predicted ~ df$ssd.prop.max.lc)
abline(mod.rf.pred.act, lwd = 2, col = "red")

Now we’re cooking! The model is now explaining 83% of variance. Note that random forests (and other ensemble models) tends to overestimate low values and underestimate high values, hence the predicted vs. actual relationship that does not follow the 1:1 line. But, nevertheless, we don’t really care about that, because we’re not trying to make actual predictions – just understand how much these variables help to explain pSSD_max. And the answer to that question is quite a bit, but not entirely. Again, further justification for the tool – if 100% of variance could be explained by just knowing regional slope and vegetation height, SZ area, wind speed, and burn condition, you could just make a regional map that could tell you the size of SZ you need. And maybe that’s actually quite useful? But, if you’re missing 17% of SSD, then that’s potentially problematic.

Let’s now evaluate the variable importance – what is the magnitude of the effect of these individual variables on pSSD_max?

# plot variable importance
varImpPlot(mod.rf.1)

I think the %IncMSE variable importance tends to be easier to interpret – the increase in model error that would result from removing each predictor. So, wind is the biggest driver of reduction in model error, followed by the size of the SZ, the slope of the terrain surrounding the SZ, the vegetation height, and the burn condition. Very interesting! To me, at least.

Adding in some SZ shape metrics

OK, last thing, I promise. SZ area is obviously very important, but so is SZ geometry. Let’s say, using an extreme example, that you used an interstate highway as a SZ. It’s very large, in area, but given its narrowness, it may not provide SSD. So, I wanted to further test the extent to which SZ shape affects pSSD_max, given the fact that these SZ polygons are somewhat randomly-drawn. To do this, I calculated four shape metrics for each SZ polygon:

Perimeter-to-area ratio (par) – self-explanatory
Elongation ratio (er) – the ratio between the longest and shortest axis of the polygon
Related circumscribing circle (rcc) – the ratio between the area of the polygon and the area of the smallest circle that encompasses the polygon
Shape complexity index – the ratio betwen the area of the polygon and the area of a convex hull around the polygon

Let’s add these into the mix and see what pops out…

# build rf model with additional shape metrics
mod.rf.2 <- randomForest(ssd.prop.max.lc ~ sz.area + wc + bc +
                           vh.median.lc + slope.median.lc +
                           sz.par + sz.er + sz.rcc + sz.sci,
                         data = df,
                         importance = T)
mod.rf.2

## 
## Call:
##  randomForest(formula = ssd.prop.max.lc ~ sz.area + wc + bc +      vh.median.lc + slope.median.lc + sz.par + sz.er + sz.rcc +      sz.sci, data = df, importance = T) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 3
## 
##           Mean of squared residuals: 0.04470512
##                     % Var explained: 94.11

# plot it out
ax.max <- max(c(mod.rf.2$predicted, df$ssd.prop.max.lc))
par(mar = c(5,5,1,1))
plot(mod.rf.2$predicted ~ df$ssd.prop.max.lc,
     pch = 16,
     col = rgb(0,0,0,0.25),
     xlim = c(0,ax.max),
     ylim = c(0,ax.max),
     xlab = "Actual pSSDmax",
     ylab = "Modeled pSSDmax",
     las = 1)
grid()
lines(c(-10,10),
      c(-10,10))
mod.rf.pred.act <- lm(mod.rf.2$predicted ~ df$ssd.prop.max.lc)
abline(mod.rf.pred.act, lwd = 2, col = "red")

# plot variable importance
varImpPlot(mod.rf.2)

Impressive! With these added metrics, the model is now able to explain 94% of variance in pSSD_max. Plus, the slope of the predicted vs. actual regression line is much closer to 1:1 (again, not that that super matters for our purposes). Wind speed is still the dominant factor, but interestingly the importance of SZ area has decreased. This suggests that, while SZ area is still a driving force, the shape of the SZ matters quite a bit (which makes intuitive sense).

Concluding remarks

So, this all leads to the question of… What do I do with this stuff? Anything? Nothing? There are definitely some interesting things in here, but worthy of inclusion in the paper? I don’t know. I would appreciate any/all feedback you might have. If you’ve made it this far in the document, thanks for donating the last hour of your life to this :).