Stream Occurence. Using machine learning and remote sensing to help identify streams.

Introduction

Within resource management, we usually end up with a question that needs a quick response. In this domain, a decision has potential impacts on the environment, albeit positive or negative. These questions present conundrums; we don’t want to make a mistake so we shelve the question for later or we get hurried into making an decision, mostly the latter. Sometimes, taking it slow is the best option but with foreseeable increases in pace and scale and environmental neoliberalism (Lave 2012), quick decisions will inevitably become the norm. The traditional way to correct for hurried decisions is to default to some hydrology a priori or a posteriori. This default mode is not obsolete and shouldn’t be construed this way; however, with modern advances in technology a way to correct for hurried decisions can be through data science. Readily available geospatial and remote sensing data that wasn’t as user-friendly in the near past is now available in many platforms (e.g., Google Earth Engine, statistical learning programs, rugged field devices, etc). By using these platforms appropriately (i.e., data science) we as practitioners can make more informed decisions and help correct for ‘hand waving’ and confirmation biases.

Methods

This document was built off a lot of research done on the relationships between remotely sensed climate data, topographic indices, and hydrologic processes (BEVEN and Kirkby 1979; Dobrowski et al. 2013; Hird et al. 2017; Holden and others, in prep.; Hoylman et al. 2018; Pelletier et al. 2018; Raduła, Szymura, and Szymura 2018; Robinson et al. 2018; Sörensen, Zinko, and Seibert 2006; Stephenson 1998). These data sets along with machine learning were used to create a spatial prediction map on the Ksanka Ranger district in northwest Montana. The methodology, concepts, and workflow for the analysis were adapted from Hird et al. (2017), Hoylman et al. (2018), and Meyer et al. (2018). Therefore, data categories will follow three different types (e.g., digital elevation models (DEM), radar images, and optical images) that Hird et al. (2017) used for predicting “wet” and “dry” areas in northern Alberta, Canada. I think these categories will provide an adequate “net” for casting into the spatio-temporal realm of predictor variables in upper to lower montane environments here in northwest Montana.

Geospatial Data

Digital elevation models (DEM) were used to generate the following: topographic wetness index (TWI), topographic position index (TPI), upslope accumulated area (UAA). The methods used to derive these indices followed Tarboton (2013) by using the command-line function in R (Team 2014). A 10-m DEM USGS National Elevation Dataset (Gesch et al. 2009) was used to calculate upslope accumulated area (UAA) by using the d-infinity algorithm (Tarboton 2013). Topographic wetness index was then calculated at 10-m resolution by using AA and slope. Both TWI and UAA were then aggregated to 30-m resolutions (nearest neighbor) to retain the intricacies of the flow path directions but also be able to stack with 30-m optical and radar images. Topographic position index (TPI) used the 10-m DEM USGS National Elevation Dataset and was also aggregated to 30-m resolution using nearest neighbor.

Radar image collections included 10-m Sentinel-1 polarimetric Synthetic Aperture Radar (SAR) images from (‘05-15’ to ‘08-31’) for each year (2015-2019) following the methods described in Hird et al. (2017): Normalized Polarization (Pol), Vertical Polarization (VV), VV Standard Deviation (VVsd). These images were collected using Google Earth Engine (GEE) (Gorelick et al. 2017) and code by (Supplementary Material (Code S1) to: Hird et al. 2017). These were then aggregated to 30-m resolutions using nearest neighbor.

Optical image collections included 10-m Sentinel-2 optical satellite images from August-September (‘08-01’ to ‘09-30’) for each year (2015-2019) to produce Normalized Difference Vegetation Index (NDVI) and Normalized Difference Water Index (NDWI). Median Net Primary Productivity (NPP) (Robinson et al. 2018) from 1986-2018 and annual 30-m Climatic Water Deficit (CWD) (Holden and others, in prep.) from 1986-2016 were also used as predictors. Normalized Difference Vegetation Index (NDVI), Normalized Difference Water Index (NDWI), and median Net Primary Productivity (NPP) were collected using Google Earth Engine (GEE). The annual Climatic Water Deficit (CWD) was taken from Holden and others (in prep.).

update
I added some data courtesy of Zach Holden. Per email,

"Josh, I’m attaching a CSV file with 4 additional covariates you should try in your model.

Here’s a brief description of what they are:

HDI - hillslope deficit index; combines deficit and TWI described by Hoylman et al
CAD - Cold air drainage potential; spatial map of cold air pools
SRAD - solar radiation (downward shortwave)
Decid - an index derived from landsat/GEE that shows change in NDVI between summer and fall. Shows deciduous understory vegetation in valley bottoms."

Machine Learning

Model Selection

I chose to use two different machine learning (ML) models in this analysis as a way to compare similarities in the response but also to achieve the best model for predicting. The two ML models used were random forest and gradient boost. These ML models were chosen due to the flexibility and performance with spatial classification data. They all are very flexible but also have tuning parameters that can help lower the variance usually associated with highly flexible models.

Model Assesment

ML model’s used the caret (Kuhn et al. 2017) and CAST packages (https://github.com/environmentalinformatics-marburg/CAST.) in R to evaluate the model’s performance. Three reasons for using the caret and CAST packages: 1). caret has crossover with CAST, which can assess the models performance based on a target oriented cross validation (i.e., location and time), 2). CAST provides a forward feature selection (FFS) model assessment in conjunction with the target oriented cross validation , 3). Meyer et al. (2018) clearly explains the advantages of using this “target oriented FFS” approach in a spatio-temporal context and compared to other model assessments (e.g., traditional k-fold cv, recursive feature elimination (RFE)) which can lead to overoptimistic results due to over-fitting by misleading variables.

Results

Below is a table of the results from the two ML models compared to National Hydrography Dataset (NHD) and Upslope Accumulated Area (UAA). Climate zones were based on a threshold of 325mm annual CWD, which was used as a proxy for “energy limiting” and “water limiting” geographical zones. The results are ordered by descending specificity values. This was chosen as a sorting method due to the fact that we as practioners are mostly fine with “false positives” but not “false negatives”, or in other words, predicting there to not be a stream and there actually being a stream. We want to account for this type II error and I think this is a way to compare and contrast models. These results will be discussed into more detail below.

Results of ML models (random forest and gradient boosting) compared to controls (NHD, UAA)
	Model	Accuracy	Sensitivity	Specificity	Kappa	X95..CI
6	Wet RF	0.8878	0.8932	0.9259	0.7397	(0.808, 0.9426)
1	RF	0.8774	0.8723	0.8873	0.7340	(0.8255, 0.9183)
7	Dry GBM	0.8772	0.8841	0.8667	0.7450	(0.8025, 0.9312)
5	Dry RF	0.8684	0.8714	0.8636	0.7259	(0.7923, 0.9244)
2	GBM	0.8632	0.8643	0.8611	0.7040	(0.8095, 0.9064
8	Wet GBM	0.8469	0.8451	0.8519	0.6451	(0.7601, 0.9117)
9	Dry 30m NHD	0.8500	0.8629	0.8326	0.6938	(0.8164, 0.8796)
11	Dry UAA > 10,783	0.8596	0.9130	0.7873	0.7092	(0.8268, 0.8883
3	30m NHD	0.8470	0.8948	0.7716	0.6742	(0.8241, 0.8681)
4	UAA > 10,783	0.8610	0.9543	0.7139	0.6954	(0.8388, 0.8812)
10	Wet 30m NHD	0.8442	0.9216	0.7026	0.6467	(0.8112, 0.8735)
12	Wet UAA > 10,783	0.8623	0.9888	0.6308	0.6728	(0.8307, 0.89)

Discussion

I thought I’d give a quick summary before all the code and workflow. It was clear early on that one of the variables accum30, which was the upslope accumulated area (UAA) variable, was really important at predicting stream occurrence. This variable has been known to be important for a long time at predicting subsurface soil moisture as indicated in numerous publications (Beven 1978; BEVEN and Kirkby 1979; Jencso et al. 2009), therefore I’m not too surprised. But, it doesn’t explain the whole story within a climate context; in short, there are tradoffs with using models like the UAA thresholds and NHD layer (see sensitivity and specificity table) which suggests the importance of climatic variables in the ML models (i.e., both ML models are stable acrossed climates). This has been shown in a recent study where shallow subsurface flow and CWD interact differently depending on water and energy indices (Zachary H. Hoylman, in prep.) but also in a meta-analysis (Pelletier et al. 2018) where water and energy feedbacks significantly control the critical zone hillslope processes over space and time.

Both the NHD and UAA models have a hard time predicting these abrupt or gradual hydroclimatic changes, which is important to distiguish within a changing climate regarding climatic niches and resource management. This can also be seen in the partial dependency plots where CWD, Cold Air drainage (CAD), hillslope deficit index (HDI), median NPP, and decid were all helpful throughout the range of data. In addition, it was apparent that all the models were similar with accuracy results (around 84-89%) even when splitting UAA by a threshold. This goes back again to the tradeoffs; we want to have stable false positive and negative rates across spatio-temporal water and energy feedbacks; which, the ML models achieve unlike the controls. Topographic thresholds only account for terrain, and as I’ve seen in the field and the data (i.e., terrain models equal low performance and misapplication) more hydroclimatic covariates are necessary to strengthen the predictive power of stream occurence. We want to use terrain for all it’s salient properties but at the same time use the important feedback patterns of water and energy to help make better spatial predictions. You might be thinking; hydrologic response, climate and topography are almost so intuitive and self-evident that it begs the question of, “why is this important?” Here is an adage from Stephenson (1998), which I think sums up what practioners may be thinking.

“Both intuition and measured soil moisture tell us that sites with high precipitation, on shaded slopes, on soils with high water-holding capacity, or close to groundwater are wetter than their opposites.” Stephenson (1998)

This is self-evident, but practitioner’s are having a hard time predicting and/or interpretting these large scale processes without some marriage between climate, topography, machine learning, and remote sensing. This overlap of climate and hydrology is seriously lacking in the applied sciences and needs more investigating so that we as practioner’s can make more informed decisions. Knowing where streams are likely to be located and observing patterns of water and energy will only make us better practitioner’s in the long run.

Some thoughts and feedback I’ve received about the study indicate that maybe adding trend data from remotely sensed images and also adding synthetic data (data i know is not a stream and vice versa) could help model performance and assessment. I agree, this model needs more work and I think those suggestions are a good start. However, I will continue with what I’ve done so far and so the rest will be really fun R code! It shouldn’t be too hard to add those changes or any more changes to this workflow either but just wanted to provide the framework first. Thanks. So please critique.

Modeling

I will start off by looking at the proportion of classes (“stream”, “no stream”). Zero (0) = “no stream” and one (1) = “stream”.

library(arcgisbinding)
arc.check_product()

## product: ArcGIS Pro (12.2.0.12813)
## license: Advanced
## version: 1.0.1.237

pts.model <- arc.open("prespointsLLO.shp") #points to the folder/feature
pts.model <- arc.select(pts.model, c("copy_TWI_1", "FID_ksank1")) #selects the data in the folder
pts.model <- arc.data2sp(pts.model)
barplot(prop.table(table(pts.model$copy_TWI_1)),
        main = "Proportion of stream occurence", 
        names.arg = c("No Stream", "Stream"))

(prop.table(table(pts.model$copy_TWI_1)))

## 
##         0         1 
## 0.6119403 0.3880597

Looks like a decent split between the 0’s and 1’s. This proportion may change if adding more data but seems to be a decent split.

We will now bring in the data. This data has been aggregated and resampled by nearsest neighbor to retain inherent values. Below is the process to get each TIFF to the correct CRS, resolution, extent, and dimensions. I did this for all the variables but will just show TPI30.

tpi30 <- aggregate(tpi, fact = 3, fun = mean) #aggregate from 10-m TPI
tpi30 <- resample(tpi30, feet, method = "ngb")
writeRaster(tpi30, filename = "tpi30agg.tif", overwrite = TRUE)
tpi30 <- raster("tpi30agg.tif")

Now bring in all the TIFFs to be used in the anlaysis and stack/visualise.

library(raster)
feet <- raster("feet.tif") #PHEAT

twi30 <- raster("twi30agg.tif") #TWI

vvsd30 <- raster("vvsd30agg.tif") #vertical vertical sd

vv30 <- raster("vv30agg.tif") #vertical vertical mean

npol30 <- raster("npol30agg.tif") #normalized polarization 

ndviAS30 <- raster("ndvias30agg.tif") #NDVI aug-sept

ndwiAS30 <- raster("ndwias30agg.tif") #NDWI aug-sept

accum30 <- raster("accum30.tif") #UAA d-infinity aggregated from 10-m

nppMid30 <- raster("nppmmid30agg.tif") #NPP median '86-18'

ndvi30yr <- raster("ndvi30yrRS.tif") #NDVI 30 yr

deficit <- raster("deficitRS.tif") # annual CWD 30 yr

wtrbdy30 <- raster("wtrbdy30agg.tif") #waterbodies

tpi30 <- raster("tpi30agg.tif") #TPI

topo_opt_rad30 <- stack(twi30, tpi30, accum30,vv30, vvsd30, npol30, ndvi30yr, ndviAS30, ndwiAS30, nppMid30, deficit,feet)

plot(topo_opt_rad30)

Extract points from stack topo_opt_rad30 and then combined with pts.model.

thrty.3 <- mask(topo_opt_rad30, wtrbdy30, inverse = TRUE) #mask out waterbodies
thrty.3 <- extract(thrty.3, pts.model)

top30.opt.rad <- cbind(pts.model, thrty.3)
top30.opt.rad <- data.frame(top30.opt.rad)
top30.opt.rad <- na.omit(top30.opt.rad)
top30.opt.rad <- top30.opt.rad[-17]

write.csv(top30.opt.rad, file = "top30.opt.rad.csv")

Random Forest

Start the forward feature selection workflow as described in Meyer et al. (2018) using the random forest method. Trying to reduce spatial-auto-correlation but also select for unhelpful or unnecessary variables. Takes about 5-10 minutes.

library(caret)
library(CAST)
#tidy
top30.opt.rad <- read.csv("top30.opt.rad.csv", header = TRUE)[-c(1,16,17)]
covs <- read.csv("holdenCovariates.csv")
top30.opt.rad <- cbind(top30.opt.rad, covs[,c(7:10)])
trainDat <- top30.opt.rad

trainDat$stream <- factor(trainDat$stream)

names(trainDat)[2] <- "HUC"
names(trainDat)[1] <- "stream"
trainDat <- na.omit(trainDat)
head(trainDat)
write.csv(trainDat, file = "trainDat.csv")

set.seed(1234)
#split data into training and test data. 80% for training. 
trainids <- createDataPartition(trainDat$stream, list = FALSE, p=0.80)
train <- trainDat[trainids,]
test <- trainDat[-trainids,]

set.seed(1234)
#this is used as the "leave location out (LLO)" parameter.
indices <- CreateSpacetimeFolds(train,spacevar = "HUC") #separate by 14th HUC
                                                

set.seed(1234)
ffsmodel_LLO <- ffs(train[,-c(1,2)],train$stream,
                    method="rf",verbose=FALSE,
                    trControl=trainControl(method="cv",
                                           index = indices$index))


#####"Note: No increase in performance found using more than 7 variables"

Now look at the results from the model, then model with the selected variables.

library(caret)
library(CAST)

ffsmodel_LLO

## Random Forest 
## 
## 849 samples
##   5 predictor
##   2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 776, 772, 811, 761, 766, 776, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##   2     0.8766864  0.7259439
##   3     0.8804269  0.7341424
##   5     0.8772367  0.7292442
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 3.

ffsmodel_LLO$selectedvars

## [1] "accum30"   "HDI"       "tpi30agg"  "CAD"       "deficitRS"

plot_ffs(ffsmodel_LLO)

plot_ffs(ffs_model = ffsmodel_LLO, plotType = "selected")

set.seed(1234)

model_LLO <- train(train[,c(4,5,13, 15, 18)], #using selected variable from ffs model
                   train$stream,       #response variable
                   method="rf",importance=TRUE,
                   trControl=trainControl(method="cv", 
                                          index = indices$index)) # spacetimeFold

model_LLO

## Random Forest 
## 
## 849 samples
##   5 predictor
##   2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 776, 772, 811, 761, 766, 776, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##   2     0.8791881  0.7328775
##   3     0.8794966  0.7331337
##   5     0.8756085  0.7250208
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 3.

confusionMatrix.train(model_LLO, norm = "none", dnn = c("Predicted", "Actual"))

## Cross-Validated (10 fold) Confusion Matrix 
## 
## (entries are un-normalized aggregated counts)
##  
##          Actual
## Predicted   0   1
##         0 487  63
##         1  37 262
##                             
##  Accuracy (average) : 0.8822

output <- predict(model_LLO, test)

confusionMatrix(test$stream, output) #now compare with test data

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 123   8
##          1  18  63
##                                           
##                Accuracy : 0.8774          
##                  95% CI : (0.8255, 0.9183)
##     No Information Rate : 0.6651          
##     P-Value [Acc > NIR] : 1.064e-12       
##                                           
##                   Kappa : 0.734           
##                                           
##  Mcnemar's Test P-Value : 0.07756         
##                                           
##             Sensitivity : 0.8723          
##             Specificity : 0.8873          
##          Pos Pred Value : 0.9389          
##          Neg Pred Value : 0.7778          
##              Prevalence : 0.6651          
##          Detection Rate : 0.5802          
##    Detection Prevalence : 0.6179          
##       Balanced Accuracy : 0.8798          
##                                           
##        'Positive' Class : 0               
##

We can also look at partial plots of the selected variables.

library(pdp)

r1 <- partial(model_LLO, pred.var = "tpi30agg", plot = TRUE)
r2 <- partial(model_LLO, pred.var = "accum30", plot = TRUE)
r3 <- partial(model_LLO, pred.var = "deficitRS", plot = TRUE)
r4 <- partial(model_LLO, pred.var = "CAD", plot = TRUE)
r5 <- partial(model_LLO, pred.var = "HDI", plot = TRUE)


grid.arrange(r1, r2, r3, r4, r5, ncol = 4)

Gradient Boosting Machine

This section will run the same forward feature selection technique with the caret and CAST packages but with gbm as the method. Gradient boosting can be a highly accurate model because it learns from a bunch of weak learners and reduces the chances of overfitting by this approach.

set.seed(1234)
model_LLO2 <- ffs(train[,-c(1,2)],train$stream,
                   method="gbm", distribution = "bernoulli",
                   trControl= trainControl(method="cv",
                                           index = indices$index))
####[1] "maximum number of models that still need to be trained: 55"
####[1] "vars selected: accum30,npol30agg,CAD,decid,nppmmid30agg with Accuracy 0.891"
####Note: No increase in performance found using more than 5 variables

Now look at the model results and use the selected variables.

model_LLO2

## Stochastic Gradient Boosting 
## 
## 849 samples
##   5 predictor
##   2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 776, 772, 811, 761, 766, 776, ... 
## Resampling results across tuning parameters:
## 
##   interaction.depth  n.trees  Accuracy   Kappa    
##   1                   50      0.8740816  0.7191617
##   1                  100      0.8722069  0.7150510
##   1                  150      0.8693590  0.7098342
##   2                   50      0.8716193  0.7144970
##   2                  100      0.8794925  0.7326078
##   2                  150      0.8831007  0.7416291
##   3                   50      0.8811710  0.7355537
##   3                  100      0.8830024  0.7393152
##   3                  150      0.8911650  0.7595401
## 
## Tuning parameter 'shrinkage' was held constant at a value of 0.1
## 
## Tuning parameter 'n.minobsinnode' was held constant at a value of 10
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were n.trees = 150,
##  interaction.depth = 3, shrinkage = 0.1 and n.minobsinnode = 10.

model_LLO2$selectedvars

## [1] "accum30"      "npol30agg"    "CAD"          "decid"       
## [5] "nppmmid30agg"

plot_ffs(model_LLO2)

plot_ffs(ffs_model = model_LLO2, plotType = "selected")

set.seed(1234)
tgrid <- expand.grid(interaction.depth = 3,
                     n.trees = 150,
                     n.minobsinnode = 10,
                     shrinkage = c(0.01, 0.1))
GBM_LLO2 <- train(train[,c(5,8,12,15,17)],train$stream,
                   method="gbm", verbose = FALSE,distribution = "bernoulli",
                   trControl= trainControl(method="cv",
                                           index = indices$index),
                  tuneGrid = tgrid)

GBM_LLO2

## Stochastic Gradient Boosting 
## 
## 849 samples
##   5 predictor
##   2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 776, 772, 811, 761, 766, 776, ... 
## Resampling results across tuning parameters:
## 
##   shrinkage  Accuracy   Kappa    
##   0.01       0.8728278  0.7152301
##   0.10       0.8822080  0.7383677
## 
## Tuning parameter 'n.trees' was held constant at a value of 150
## 
## Tuning parameter 'interaction.depth' was held constant at a value of
##  3
## Tuning parameter 'n.minobsinnode' was held constant at a value of 10
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were n.trees = 150,
##  interaction.depth = 3, shrinkage = 0.1 and n.minobsinnode = 10.

confusionMatrix.train(GBM_LLO2, norm = "none")

## Cross-Validated (10 fold) Confusion Matrix 
## 
## (entries are un-normalized aggregated counts)
##  
##           Reference
## Prediction   0   1
##          0 483  63
##          1  41 262
##                             
##  Accuracy (average) : 0.8775

output2 <- predict(GBM_LLO2, test)
confusionMatrix(test$stream, output2)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 121  10
##          1  19  62
##                                           
##                Accuracy : 0.8632          
##                  95% CI : (0.8095, 0.9064)
##     No Information Rate : 0.6604          
##     P-Value [Acc > NIR] : 1.668e-11       
##                                           
##                   Kappa : 0.704           
##                                           
##  Mcnemar's Test P-Value : 0.1374          
##                                           
##             Sensitivity : 0.8643          
##             Specificity : 0.8611          
##          Pos Pred Value : 0.9237          
##          Neg Pred Value : 0.7654          
##              Prevalence : 0.6604          
##          Detection Rate : 0.5708          
##    Detection Prevalence : 0.6179          
##       Balanced Accuracy : 0.8627          
##                                           
##        'Positive' Class : 0               
##

We can also look at partial dependency plots.

library(pdp)
g1 <- partial(GBM_LLO2, pred.var = "CAD", plot = TRUE)
g2 <- partial(GBM_LLO2, pred.var = "accum30", plot = TRUE)
g3 <- partial(GBM_LLO2, pred.var = "nppmmid30agg", plot = TRUE)
g4 <- partial(GBM_LLO2, pred.var = "decid", plot = TRUE)
g5 <- partial(GBM_LLO2, pred.var = "npol30agg", plot = TRUE)

grid.arrange(g1, g2, g3, g4, g5, ncol = 3)

Comparing

A way to check the models is to compare with something already in place or a common threshold. Here we will look at the National Hydrography Dataset (NHD) at 30m resolution and UAA at a 10,783 threshold which was the optimum class identifier split.

nhd30 <- raster("nhdsixty.tif") #not 60m, just named wrong. Glad your paying attention!

pts.model <- arc.open("prespointsLLO.shp") #points to the folder/feature
pts.model <- arc.select(pts.model, c("copy_TWI_1", "FID_ksank1")) #selects the data in the folder
pts.model <- arc.data2sp(pts.model)

nhd30 <- raster::extract(nhd30, pts.model) #extract to points collected

## Warning in .local(x, y, ...): Transforming SpatialPoints to the CRS of the
## Raster

nhd30 <- data.frame(nhd30) 
names(nhd30)[1] <- "stream"
pts.model <- arc.sp2data(pts.model)

## Warning: arc.sp2data() will be removed. Use sp object directly in arc.*()

nhd30 <- cbind(pts.model, nhd30) #now combined points with data.frame

nhd30$stream <- ifelse(nhd30$stream == "NA", 0,1) #change to 0's and 1's
nhd30$stream[is.na(nhd30$stream)] <- 0 #NA's stayed, so overide

#now gather accum30 > 10,783

accumTest <- raster("accum30.tif") 

pts.model <- arc.open("prespointsLLO.shp") #points to the folder/feature
pts.model <- arc.select(pts.model, c("copy_TWI_1", "FID_ksank1")) #selects the data in the folder
pts.model <- arc.data2sp(pts.model)

accumTest <- raster::extract(accumTest, pts.model) #extract to points collected

## Warning in .local(x, y, ...): Transforming SpatialPoints to the CRS of the
## Raster

accumTest <- data.frame(accumTest) 
names(accumTest)[1] <- "stream"

pts.model <- arc.sp2data(pts.model)

## Warning: arc.sp2data() will be removed. Use sp object directly in arc.*()

accumTest <- cbind(pts.model, accumTest) #now combined points with data.frame

accumTest$stream <- ifelse(accumTest$stream >= 10783, 1,0) #change to 0's and 1's

Now we can check the results by looking at a confusion matrix

#nhd 30 m Resolution

confusionMatrix(as.factor(nhd30$stream), as.factor(nhd30$copy_TWI_1))

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 587  95
##          1  69 321
##                                           
##                Accuracy : 0.847           
##                  95% CI : (0.8241, 0.8681)
##     No Information Rate : 0.6119          
##     P-Value [Acc > NIR] : < 2e-16         
##                                           
##                   Kappa : 0.6742          
##                                           
##  Mcnemar's Test P-Value : 0.05092         
##                                           
##             Sensitivity : 0.8948          
##             Specificity : 0.7716          
##          Pos Pred Value : 0.8607          
##          Neg Pred Value : 0.8231          
##              Prevalence : 0.6119          
##          Detection Rate : 0.5476          
##    Detection Prevalence : 0.6362          
##       Balanced Accuracy : 0.8332          
##                                           
##        'Positive' Class : 0               
##

#Upslope Accumulated Area Greater than 10,783

confusionMatrix(as.factor(accumTest$stream), as.factor(accumTest$copy_TWI_1))

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 626 119
##          1  30 297
##                                           
##                Accuracy : 0.861           
##                  95% CI : (0.8388, 0.8812)
##     No Information Rate : 0.6119          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.6954          
##                                           
##  Mcnemar's Test P-Value : 5.626e-13       
##                                           
##             Sensitivity : 0.9543          
##             Specificity : 0.7139          
##          Pos Pred Value : 0.8403          
##          Neg Pred Value : 0.9083          
##              Prevalence : 0.6119          
##          Detection Rate : 0.5840          
##    Detection Prevalence : 0.6950          
##       Balanced Accuracy : 0.8341          
##                                           
##        'Positive' Class : 0               
##

Now we can look at all the results from a climatic water deficit (CWD) perspective, where CWD is split at 325 for all the models
Here we will look at the Random Forest model for two different climates and see how well it predicts.

testingGRCWD <- dplyr::filter(test, test$deficitRS >= 325)
testingLSCWD <- dplyr::filter(test, test$deficitRS < 325)

outputGRCWD <- predict(model_LLO, testingGRCWD)
confusionMatrix(testingGRCWD$stream, outputGRCWD)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 61  6
##          1  9 38
##                                           
##                Accuracy : 0.8684          
##                  95% CI : (0.7923, 0.9244)
##     No Information Rate : 0.614           
##     P-Value [Acc > NIR] : 1.835e-09       
##                                           
##                   Kappa : 0.7259          
##                                           
##  Mcnemar's Test P-Value : 0.6056          
##                                           
##             Sensitivity : 0.8714          
##             Specificity : 0.8636          
##          Pos Pred Value : 0.9104          
##          Neg Pred Value : 0.8085          
##              Prevalence : 0.6140          
##          Detection Rate : 0.5351          
##    Detection Prevalence : 0.5877          
##       Balanced Accuracy : 0.8675          
##                                           
##        'Positive' Class : 0               
##

outputLSCWD <- predict(model_LLO, testingLSCWD)
confusionMatrix(testingLSCWD$stream, outputLSCWD)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 62  2
##          1  9 25
##                                          
##                Accuracy : 0.8878         
##                  95% CI : (0.808, 0.9426)
##     No Information Rate : 0.7245         
##     P-Value [Acc > NIR] : 7.545e-05      
##                                          
##                   Kappa : 0.7397         
##                                          
##  Mcnemar's Test P-Value : 0.07044        
##                                          
##             Sensitivity : 0.8732         
##             Specificity : 0.9259         
##          Pos Pred Value : 0.9688         
##          Neg Pred Value : 0.7353         
##              Prevalence : 0.7245         
##          Detection Rate : 0.6327         
##    Detection Prevalence : 0.6531         
##       Balanced Accuracy : 0.8996         
##                                          
##        'Positive' Class : 0              
##

Now look at the GBM model for two different climates and see how well it predicts.

outputGRCWD <- predict(GBM_LLO2, testingGRCWD)
confusionMatrix(testingGRCWD$stream, outputGRCWD)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 61  6
##          1  8 39
##                                           
##                Accuracy : 0.8772          
##                  95% CI : (0.8025, 0.9312)
##     No Information Rate : 0.6053          
##     P-Value [Acc > NIR] : 1.378e-10       
##                                           
##                   Kappa : 0.745           
##                                           
##  Mcnemar's Test P-Value : 0.7893          
##                                           
##             Sensitivity : 0.8841          
##             Specificity : 0.8667          
##          Pos Pred Value : 0.9104          
##          Neg Pred Value : 0.8298          
##              Prevalence : 0.6053          
##          Detection Rate : 0.5351          
##    Detection Prevalence : 0.5877          
##       Balanced Accuracy : 0.8754          
##                                           
##        'Positive' Class : 0               
##

outputLSCWD <- predict(GBM_LLO2, testingLSCWD)
confusionMatrix(testingLSCWD$stream, outputLSCWD)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 60  4
##          1 11 23
##                                           
##                Accuracy : 0.8469          
##                  95% CI : (0.7601, 0.9117)
##     No Information Rate : 0.7245          
##     P-Value [Acc > NIR] : 0.003169        
##                                           
##                   Kappa : 0.6451          
##                                           
##  Mcnemar's Test P-Value : 0.121335        
##                                           
##             Sensitivity : 0.8451          
##             Specificity : 0.8519          
##          Pos Pred Value : 0.9375          
##          Neg Pred Value : 0.6765          
##              Prevalence : 0.7245          
##          Detection Rate : 0.6122          
##    Detection Prevalence : 0.6531          
##       Balanced Accuracy : 0.8485          
##                                           
##        'Positive' Class : 0               
##

Now test the control predictors “NHD” and “UAA” for two different climates and see how well it predicts.

deficit <- raster("deficitRS.tif")

pts.model <- arc.open("prespointsLLO.shp") #points to the folder/feature
pts.model <- arc.select(pts.model, c("copy_TWI_1", "FID_ksank1")) #selects the data in the folder
pts.model <- arc.data2sp(pts.model)

def <- raster::extract(deficit, pts.model)

## Warning in .local(x, y, ...): Transforming SpatialPoints to the CRS of the
## Raster

accTest <- cbind(def, accumTest) #extract deficit values and combine with UAA threshold df
nhdTest <- cbind(def, nhd30)      #extract deficit values and combine with NHD30 df

#now run for UAA

AccGRCWD <- dplyr::filter(accTest, accTest$def >= 325)
AccLSCWD <- dplyr::filter(accTest, accTest$def < 325)

#greater than 325 for UAA
confusionMatrix(as.factor(AccGRCWD$stream), as.factor(AccGRCWD$copy_TWI_1))

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 273  47
##          1  26 174
##                                           
##                Accuracy : 0.8596          
##                  95% CI : (0.8268, 0.8883)
##     No Information Rate : 0.575           
##     P-Value [Acc > NIR] : < 2e-16         
##                                           
##                   Kappa : 0.7092          
##                                           
##  Mcnemar's Test P-Value : 0.01924         
##                                           
##             Sensitivity : 0.9130          
##             Specificity : 0.7873          
##          Pos Pred Value : 0.8531          
##          Neg Pred Value : 0.8700          
##              Prevalence : 0.5750          
##          Detection Rate : 0.5250          
##    Detection Prevalence : 0.6154          
##       Balanced Accuracy : 0.8502          
##                                           
##        'Positive' Class : 0               
##

#less than 325 for UAA
confusionMatrix(as.factor(AccLSCWD$stream), as.factor(AccLSCWD$copy_TWI_1))

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 353  72
##          1   4 123
##                                         
##                Accuracy : 0.8623        
##                  95% CI : (0.8307, 0.89)
##     No Information Rate : 0.6467        
##     P-Value [Acc > NIR] : < 2.2e-16     
##                                         
##                   Kappa : 0.6728        
##                                         
##  Mcnemar's Test P-Value : 1.525e-14     
##                                         
##             Sensitivity : 0.9888        
##             Specificity : 0.6308        
##          Pos Pred Value : 0.8306        
##          Neg Pred Value : 0.9685        
##              Prevalence : 0.6467        
##          Detection Rate : 0.6395        
##    Detection Prevalence : 0.7699        
##       Balanced Accuracy : 0.8098        
##                                         
##        'Positive' Class : 0             
##

#now run for NHD

nhdGRCWD <- dplyr::filter(nhdTest, nhdTest$def >= 325)
nhdLSCWD <- dplyr::filter(nhdTest, nhdTest$def < 325)

#greater than 325 for NHD
confusionMatrix(as.factor(nhdGRCWD$stream), as.factor(nhdGRCWD$copy_TWI_1))

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 258  37
##          1  41 184
##                                           
##                Accuracy : 0.85            
##                  95% CI : (0.8164, 0.8796)
##     No Information Rate : 0.575           
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.6938          
##                                           
##  Mcnemar's Test P-Value : 0.7341          
##                                           
##             Sensitivity : 0.8629          
##             Specificity : 0.8326          
##          Pos Pred Value : 0.8746          
##          Neg Pred Value : 0.8178          
##              Prevalence : 0.5750          
##          Detection Rate : 0.4962          
##    Detection Prevalence : 0.5673          
##       Balanced Accuracy : 0.8477          
##                                           
##        'Positive' Class : 0               
##

#less than 325 for NHD
confusionMatrix(as.factor(nhdLSCWD$stream), as.factor(nhdLSCWD$copy_TWI_1))

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 329  58
##          1  28 137
##                                           
##                Accuracy : 0.8442          
##                  95% CI : (0.8112, 0.8735)
##     No Information Rate : 0.6467          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.6467          
##                                           
##  Mcnemar's Test P-Value : 0.001765        
##                                           
##             Sensitivity : 0.9216          
##             Specificity : 0.7026          
##          Pos Pred Value : 0.8501          
##          Neg Pred Value : 0.8303          
##              Prevalence : 0.6467          
##          Detection Rate : 0.5960          
##    Detection Prevalence : 0.7011          
##       Balanced Accuracy : 0.8121          
##                                           
##        'Positive' Class : 0               
##

References

Beven, Keith. 1978. “The Hydrological Response of Headwater and Sideslope Areas/La Réponse Hydrologique Des Zones de Cours Supérieurs et Des Zones de Pente Latérale.” Hydrological Sciences Journal 23 (4): 419–37.

BEVEN, Keith J, and Michael J Kirkby. 1979. “A Physically Based, Variable Contributing Area Model of Basin Hydrology/Un Modèle à Base Physique de Zone d’appel Variable de L’hydrologie Du Bassin Versant.” Hydrological Sciences Journal 24 (1): 43–69.

Dobrowski, Solomon Z, John Abatzoglou, Alan K Swanson, Jonathan A Greenberg, Alison R Mynsberge, Zachary A Holden, and Michael K Schwartz. 2013. “The Climate Velocity of the Contiguous U Nited S Tates During the 20th Century.” Global Change Biology 19 (1): 241–51.

Gesch, Dean, Gayla Evans, James Mauck, John Hutchinson, William J Carswell Jr, and others. 2009. “The National Map—Elevation.” US Geological Survey Fact Sheet 3053 (4).

Gorelick, Noel, Matt Hancher, Mike Dixon, Simon Ilyushchenko, David Thau, and Rebecca Moore. 2017. “Google Earth Engine: Planetary-Scale Geospatial Analysis for Everyone.” Remote Sensing of Environment. https://doi.org/10.1016/j.rse.2017.06.031.

Hird, Jennifer, Evan DeLancey, Gregory McDermid, and Jahan Kariyeva. 2017. “Google Earth Engine, Open-Access Satellite Data, and Machine Learning in Support of Large-Area Probabilistic Wetland Mapping.” Remote Sensing 9 (12): 1315.

Holden, A. Swanson, Z. A., and others. In prep. “Using Climatic and Biophysical Variables to Model the Presence and Severity of Root Disease Across the U.s. Northern Rocky Mountains.”

Hoylman, Zachary H, Kelsey G Jencso, Jia Hu, Justin T Martin, Zachary A Holden, Carl A Seielstad, and Eric M Rowell. 2018. “Hillslope Topography Mediates Spatial Patterns of Ecosystem Sensitivity to Climate.” Journal of Geophysical Research: Biogeosciences 123 (2): 353–71.

Jencso, Kelsey G, Brian L McGlynn, Michael N Gooseff, Steven M Wondzell, Kenneth E Bencala, and Lucy A Marshall. 2009. “Hydrologic Connectivity Between Landscapes and Streams: Transferring Reach-and Plot-Scale Understanding to the Catchment Scale.” Water Resources Research 45 (4).

Kuhn, Max, Jed Wing, Steve Weston, Andre Williams, Chris Keefer, Allan Engelhardt, and others. 2017. “Caret: Classification and Regression Training. 2016.” R Package Version 4.

Lave, Rebecca. 2012. Fields and Streams: Stream Restoration, Neoliberalism, and the Future of Environmental Science. Vol. 12. University of Georgia Press.

Meyer, Hanna, Christoph Reudenbach, Tomislav Hengl, Marwan Katurji, and Thomas Nauss. 2018. “Improving Performance of Spatio-Temporal Machine Learning Models Using Forward Feature Selection and Target-Oriented Validation.” Environmental Modelling & Software 101: 1–9.

Pelletier, Jon D, Greg A Barron-Gafford, Hugo Gutiérrez-Jurado, Eve-Lyn S Hinckley, Erkan Istanbulluoglu, Luke A McGuire, Guo-Yue Niu, et al. 2018. “Which Way Do You Lean? Using Slope Aspect Variations to Understand Critical Zone Processes and Feedbacks.” Earth Surface Processes and Landforms 43 (5): 1133–54.

Raduła, Małgorzata W, Tomasz H Szymura, and Magdalena Szymura. 2018. “Topographic Wetness Index Explains Soil Moisture Better Than Bioindication with Ellenberg’s Indicator Values.” Ecological Indicators 85: 172–79.

Robinson, Nathaniel P, Brady W Allred, William K Smith, Matthew O Jones, Alvaro Moreno, Tyler A Erickson, David E Naugle, and Steven W Running. 2018. “Terrestrial Primary Production for the Conterminous United States Derived from Landsat 30 M and Modis 250 M.” Remote Sensing in Ecology and Conservation 4 (3): 264–80.

Sörensen, Rasmus, Ursula Zinko, and Jan Seibert. 2006. “On the Calculation of the Topographic Wetness Index: Evaluation of Different Methods Based on Field Observations.” Hydrology and Earth System Sciences Discussions 10 (1): 101–12.

Stephenson, Nathan. 1998. “Actual Evapotranspiration and Deficit: Biologically Meaningful Correlates of Vegetation Distribution Across Spatial Scales.” Journal of Biogeography 25 (5): 855–70.

Supplementary Material (Code S1) to: Hird, Jennifer, Evan DeLancey, Gregory McDermid, and Jahan Kariyeva. 2017. “Google Earth Engine, Open-Access Satellite Data, and Machine Learning in Support of Large-Area Probabilistic Wetland Mapping.” Remote Sensing 9 (12): 1315.

Tarboton, David G. 2013. “TauDEM 5.2 Guide to Using the Taudem Command Line Functions for Taudem Multi-File.”

Team, R Core. 2014. “A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing.” ISBN 3-900051-07-0. http://www. R-project. org.

Zachary H. Hoylman, Jia Hu, Kelsey G. Jencso. In prep. “Hydrology Across the Critical Zone: Topography and Climate Control the Spatial and Temporal Organization of Vapor Pressure Deficit, Soil Moisture and Shallow Subsurface Flow Dynamics.”