Stream Occurence

Introduction

Introduction

This is a workflow used to create the probabalistic spatial prediction map of stream occurence in northwest Montana. Hopefully by providing this workflow we can improve this analysis. The reason for such a dense, superfluous, and tedious workflow is that I like doing this type of analysis and also it helps me keep everything in perspective. However, it should be noted that this not the only way or is in any way intended to be a ‘cookbook’ so please take this into consideration. There are many, many ways and methods for this type analysis and a lot of it depends on the project details and objectives. Thus I want to start by highlighting some goals of this analysis and workflow, which may help summarise my rationale.

Balancing prediction performance with model interpretability while emphasising a spatial modeling framework.
Use feature selection techniques that will lower model processing time while reducing variance (i.e. accounting for spatial autocorrelation).
Optimizing climate, topographic, and remote sensing data to better predict stream occurence and understand hydrologic processess at multiple scales.
Test new spatial prediction techniques (forward feature selection, blockCV) with others (recursive feature elimination) so that future endeavors can be efficient while considering limitations.

Please use the tabs to navigate through this workflow process. Here is a brief summary of each tab.

Accessing Data - If you want to get your hands on the data go here.
Methods - how we got to modeling:
- Variable(s) - summary of predictor and response variable(s).
- Pre-processing - examining predictor variables.
- Data Splitting Guidance - determining how to split the data based on:
  - Proportion - what’s the response’s proportion?
  - Balanced? - to resample or not?
  - Relevant Variables.
  - Prevalance - sample prevalance rate?
  - Data Leakage.
  - Dependence Strutures (i.e. spatial autocorrelation).
- Evaluation Techniques - feature selection and tuning/evaluation techniques
  - Feature Selection - summary of different techniques.
  - Tuning/Evaluation - summary of different techniques.
Modeling - the actual modeling:
- Model Assessment - the actual modeling with training/evaluation data.
- Model Selection - selecting the best model while considering goals.
- Model Performance - testing the final selected model on the test data.
Discussion - final thoughts on results.
References - bibliography

Thank you for your time and critiques as this will only make it better. Please, any comments add to the document or email [joshua.l.erickson@usda.gov].

Accessing Data

Accessing Data

Real quick, the csv data can be downloaded at csv, the points downloaded at points, and the tifs downloaded at TIF. The .tif files have been pre-processed so that projections and datums are the same and the csv data is raw so that the user can explore if they want to improve with other models or techniques (eg, cv methods, sampling, models). The points are provided to help show where the data was collected or if wanting to do a ‘ground up’ analysis as well. The point samples were collected from past and current surveys using the Montana “SMZ law” definitions, which you can find here SMZ, page 3 and also described in Variables.

Methods

Summary

Summary

This analysis was built off a lot of research done on the relationships between climate data, topographic indices, and hydrologic processes (BEVEN and Kirkby 1979; Dobrowski et al. 2013; Hird et al. 2017; Holden and others, in prep.; Hoylman et al. 2018, 2019; Jaeger et al. 2019; Lidberg, Nilsson, and Ågren 2020; Pelletier et al. 2018; Raduła, Szymura, and Szymura 2018; Robinson et al. 2018; Sando and Blasch 2015; Sörensen, Zinko, and Seibert 2006; Stephenson 1998). The methodology, concepts, and workflow for the analysis were adapted from (Friedman, Hastie, and Tibshirani 2001; James et al. 2013; Kuhn and Johnson 2013; Hird et al. 2017; Jaeger et al. 2019; Meyer et al. 2018, 2019) where the authors used machine learning techniques to predict different types of response variables. Variable categories will follow three different types (e.g., digital elevation models (DEM), radar images, and optical images) that Hird et al. (2017) used for predicting “wet” and “dry” areas in northern Alberta, Canada. This is also a common variable selection approach as others have done (Jaeger et al. 2019; Lidberg, Nilsson, and Ågren 2020; Sando and Blasch 2015) within a hydrologic context.

Within the methods section, the goal is to emphasize the importance of accounting for spatial autocorrelation in regards to statistical learning validation techniques. It should be noted that autocorrelation in statistical learning will invalidate some of the assumptions that regression and parametric type models must have (e.g. independence) ; however, with ensemble modeling these assumptions are not neccessarily as important. But, ensemble modeling doesn’t get a free lunch and spatial autocorrelation affects the validation techniques as well as feature selection (Meyer et al. 2018, 2019; Roberts et al. 2017; Valavi et al. 2018). Spatial autocorrelation affects all models in this way (e.g. validation is over-optimistic); however, it may be even more severe with ensemble modeling due to the propensity to overfit in the first place, i.e. flexible models typically have low bias and high variance which can overfit (high variance) more than a less flexible model. Also, I will bring up the concern for using artificially balanced data sets.

This has been echoed by numerous authors (Kuhn and Johnson 2013; Matloff 2017; Meyer et al. 2018, 2019; Roberts et al. 2017; Valavi et al. 2018), which have provided different approaches when dealing with these concerns (e.g. validation techniques, prevalence rates). However a feature selection process is still imperative within a spatial context as Meyer et al. (2018);Meyer et al. (2019) has shown and should be investigated as well so the model can lessen any sneaky surrogate or geo-located type variables that consistently overfit and plague spatial prediction covariates. This coupling of overfitting (e.g. validation and prevalence methods) leads to over-optimistic results and can significantly affect extrapolation and interpolation (Matloff 2017; Meyer et al. 2018, 2019; Roberts et al. 2017). I will take these approaches and apply them to the data collected in northwest Montana along with some traditional methods of machine learning workflows, e.g. data splitting, leakage, etc. This may seem stringent at times but the goal is to have realistic predictions, within constrained timeframes, and while being spatially responsible. Please, if any of these approaches or methods are confusing or contradictory let me know and I would be glad to rethink my approach and respond. Thanks.

Variables

Predictor Variables

Digital elevation models (DEM) were used to generate the following: topographic wetness index (TWI), topographic position index (TPI), Deficit continuos parameter grids (CPG), and upslope accumulated area (UAA). The methods used to derive these indices followed Tarboton (2013) by using the command-line function in R (Team 2014). A 10-m DEM USGS National Elevation Dataset (Gesch et al. 2009) was used to calculate upslope accumulated area (UAA) by using the d-infinity algorithm (Tarboton 2013). Topographic wetness index was then calculated at 10-m resolution by using UAA and slope. Both TWI and UAA were then aggregated to 30-m resolutions with a 3x3 mean window function and resampled with nearest neighbor to retain the intricacies of the flow path direction and be able to stack with 30-m optical and radar images. Topographic position index (TPI) used the 10-m DEM USGS National Elevation Dataset and was also aggregated to 30-m resolution using a 3x3 mean window function and resampled with nearest neighbor. The PRISM CPG was taken from R. Sando (2018) where the variable is created by accumlating a weighted grid (average annual PRISM data (1981-2010)) and normalizing it by the UAA, which gives a ratio of precipitation and accumulated area. For the deficit CPG, the indice was derived from the 10-m flow direction grid and then aggregated to 30-m flow direction grid (to retain intricacies) from the same processeses as above (Tarboton 2013), from which the grid was then weighted by a CWD grid (i.e. 30-m). This was then normalized by the UAA 30-m grid. The reason to add climatic water deficit (CWD) the observed interplay between these processes in the field.

Radar image collections included 10-m Sentinel-1 polarimetric Synthetic Aperture Radar (SAR) images from (‘05-15’ to ‘08-31’) for each year (2015-2019) following the methods described in Hird et al. (2017): Normalized Polarization (Pol), Vertical Polarization (VV), VV Standard Deviation (VVsd). These images were collected using Google Earth Engine (GEE) (Gorelick et al. 2017) and code by (Supplementary Material (Code S1) to: Hird et al. 2017). These were then aggregated to 30-m resolutions using a 3x3 mean window function and resampled with nearest neighbor.

Optical image collections included 10-m Sentinel-2 optical satellite images from August-September (‘08-01’ to ‘09-30’) for each year (2015-2019) to produce Normalized Difference Vegetation Index (NDVI) and Normalized Difference Water Index (NDWI). In addition to NDVI/NDWI, the 10-m Sentinel-2 optical satellite Blue (B2), Green (B3), Red (B4), Near Infrared (B8) images from August-September (‘08-01’ to ‘09-30’) for each year (2015-2019) were also used. These were again aggregated to 30-m resolutions using a 3x3 mean window function and resampled with nearest neighbor. Median Net Primary Productivity (NPP) (Robinson et al. 2018) from 1986-2018 and annual 30-m Climatic Water Deficit (CWD) (Holden and others, in prep.) from 1986-2016 were also used as predictors. Normalized Difference Vegetation Index (NDVI), Normalized Difference Water Index (NDWI), and median Net Primary Productivity (NPP) were collected using GEE. The 30-year Annual (July-Sept; 1986-2016) Normalized Difference Vegetation Index (NDVI), annual Climatic Water Deficit (CWD), Cold Air Drainage (CAD), Decidous (Decid), and Hydrometeorlogical Dryness Index (HDI) were taken from Holden and others (in prep.).

Now bring in all the TIFs to be used in the anlaysis and stack/visualise.

library(raster)

twi30 <- raster("twi30agg.tif") #TWI

vvsd30 <- raster("vvsd30agg.tif") #vertical vertical sd

vv30 <- raster("vv30agg.tif") #vertical vertical mean

npol30 <- raster("npol30agg.tif") #normalized polarization 

ndviAS30 <- raster("ndvias30agg.tif") #NDVI aug-sept

ndwiAS30 <- raster("ndwias30agg.tif") #NDWI aug-sept

accum30 <- raster("accum30.tif") #UAA d-infinity aggregated from 10-m

nppMid30 <- raster("nppmmid30agg.tif") #NPP median '86-18'

ndvi30yr <- raster("ndvi30yrRS.tif") #NDVI 30 yr

deficit <- raster("deficitRS.tif") # annual CWD 30 yr

wtrbdy30 <- raster("wtrbdy30agg.tif") #waterbodies

tpi30 <- raster("tpi30agg.tif") #TPI

HDI <- raster("hdi30RS.tif") #hydrologic deficit index

CAD <- raster("cad30RS.tif") #cold air drainage

decid <- raster("decid30RS.tif") #deciduous

B2 <- raster("B2_30agg.tif") #blue (B2)

B3 <- raster("B3_30agg.tif") #green (B3)

B4 <- raster("B4_30agg.tif") #red (B4)

B8 <- raster("B8_30agg.tif") #Near Infrared (B8)

cpgPrecip <- raster("cpg30precip.tif") #Continuous parameter grid (precipitation/PRISM).

cpgDeficit <- raster("cpg30Deficit.tif") #continuous parameter grid (deficit).

topo_opt_rad30 <- stack(twi30, tpi30, accum30,vv30, vvsd30, npol30, ndvi30yr, ndviAS30, ndwiAS30, nppMid30, deficit, HDI, CAD, decid, B2, B3, B4, B8, cpgPrecip, cpgDeficit)

plot(topo_opt_rad30, maxnl=32, nc = 4, legend = FALSE)

Response Variable

The response variable was collected in a few different ways: historic and current data observations. The historic observations were collected from 1992-2004 by US Forest Service personel. They would walk drainages and identify streams by using stream classification protocols at the time (recommend copy link address) (SMZ, page 3, Kootenai National Forest stream classifications).

These were then evaluated using current data observations in places where these methods overlapped. Also, points were applied to the surveyed data at a 30 meter distance. In adddtion, areas with past timber sales were used as places to interpret “no stream.” Now there are a lot of assumptions with this type of data collection (observer bias, overlooked areas, streams change, etc); however, by observing these areas with the currect data observtions it was validated in an unsupervised manner (using current data, old maps, professional judgement, remote sensing, etc) with high agreement.

The data that overlapped showed strong similarities in the results, which made me more confident in using these data collection methods. The current data collection was performed from June-Oct in 2019. These collection methods followed the same protocol as the SMZ method. Data was collected with a Samsung Galaxy Tab A Tablet with +/- 15-30 foot accuracy. This was another reason to only use 30-meter resolution grids in the analysis as well; the accuacy at which the data was collected would have been suspect at 10 meters. The current data collected only accounted for appoximently 1500 points but was very helpful in validating the historic data as well. Below shows how the raster stack of all the predictor variables were extracted by point samples.

Extract points from stack topo_opt_rad30 and then combined with pts.

library(arcgisbinding)

## *** Please call arc.check_product() to define a desktop license.

arc.check_product()

## product: ArcGIS Pro (12.2.0.12813)
## license: Advanced
## version: 1.0.1.237

#combine point objects
pts <- arc.open("points_spaced30m.shp")  #points to the folder/feature
pts <- arc.select(pts, c("copy_TWI_1", "FID_ksank1", "FID_ksan_1")) #selects the data in the folder

                        #copy_TWI_1 = stream or no stream, FID_ksan1 & FID_ksan_1 are just the
                          #HUC 12 and HUC 14 polygon attributes
                            #we will use later for leave location out (LLO).

pts <- arc.data2sp(pts) #now we have a spatial points data frame

library(sp)
library(rgdal)
library(rgeos)

deficit <- raster("deficitRS.tif") # annual CWD 30 yr

thrty.3 <- mask(topo_opt_rad30, wtrbdy30, inverse = TRUE) #mask out waterbodies if you want

#Get forest service boundary
FSland <- readOGR("KNF_ownership.shp")
FSland <- FSland[FSland$OWNER == 'FS',]

#Get District Boundary
DistrictBoundary <- readOGR("fortine_rexford_RD_Bdy.shp")

#now intersect and project
landClip <- gIntersection(FSland, DistrictBoundary, byid = TRUE)

tpiCRS <- crs(tpi30)

landClip <- spTransform(landClip, tpiCRS)


thrty.3 <- mask(thrty.3, landClip) #mask out non-Forest Service land

thrty.3 <- raster::extract(thrty.3, pts) #extract the values from raster by point data

top30.opt.rad <- cbind(pts, thrty.3)
top30.opt.rad <- data.frame(top30.opt.rad)
plot(thrty.3)

write.csv(top30.opt.rad, file = "top30.opt.rad.csv")

Pre-processing

Pre-processing

Pre-processing data is an important step in any data anaysis. With pre-processing we will look at how the data is distrubuted (shape) and correlated. Typically within environmental science data, transformations are needed since the data normally follows a skewed pattern (i.e. lots of small events with periodic high events or vice versa) (Van Bell 2011). However, we will be using tree-based classification methods (e.g., RandomForest, Gradient Boost) for the predictive model; thus, transformations will not be as important because tree-based classification models are not sensitive to scale or distributions (i.e., metrics used for partitioning nodes don’t involve regression-type statistics (RSE) instead splits feature space(s) by purity methods, e.g. Gini index, cross entropy). However, just like any statistical learning practice, context matters. Thus you may want to transform your data, even with a tree-based classification method. That being said, I still think it is prudent to look at the data descriptively before we jump in regardless of model type.

To start we can look at the shape of the data and how it’s plotted against other variables. This involves looking at histograms and scatter plots of the data visually. Since we have 19 predictors this will be pretty difficult to look at scatter plots so we will just look at histograms (sidebar - I’ve look scatter plots individually and there are some interesting trends but mostly uncorrelated). Let’s bring in the data we just created in the Variables section and start exploring.

top30.opt.rad <- read.csv("top30.opt.rad.csv", header = TRUE)[-1]

data <- top30.opt.rad #change for ease
data <- data[,-26] #take out irrelavant index
#change names
names(data)[2] <- "HUC12"
names(data)[3] <- "HUC14"
names(data)[1] <- "stream"
data$stream <- factor(data$stream)
data <- na.omit(data)
 # write csv

write.csv(data, file = "trainDat.csv")

Now lets look at the histograms of the predictor variables.

From the graph above it looks like there are some outliers in the data set. However, the tree-based models we will use are not so sensitive to outliers; again, it should be noted that you don’t neccessarily want to remove outliers so hastily and understanding of the data is key. No free lunch! On the other hand, correlation statistics can be sensitive to outliers and tree-based classification models can have a hard time with correlated variables. So we will want to explore any noticeable outliers before doing any correlation estimates. I’ll just use a simple outlier detection by looking at the histograms. From the histogram there were three graphs that looked suspect: npol30agg, hdi30RS, and accum30 so we will look at those.

Both npol30agg and accum30 look like they have a couple of significant outliers. Thus leaving accum30 alone will be fine (since an UAA that large is likely) but the npol30agg needs to be between -1 and 1, theorhetically. This is totally appropriate because it doesn’t affect the ‘true’ distribution of the data. For example, if we took outliers out that were theorhetically possible, then the test data would have a peak at the ‘new’ distribution, i.e. data leakage. We’ll talk abou data leakage later but remember to not hastily take out data without taking leakage into consideration.

data <- data[!data$npol30agg < -1,]
data <- data[!data$npol30agg > 1,]

Now look at the new boxplot and histogram for npol30agg, which looks much better.

The next step is to look for correlated data. Correlations in the data will give us an idea of some redundant data (e.g. collinearity). We can investigate correlation visually through the function corrplot and look for any colinearity with the predictors or see if there is any clustering hclust. These type of plots are very useful when the amount of predictors is above 20 (pairs.plot gets overwhelmed) and gives a quick view of what’s going on. After visualising, we can take a deeper look and search the data frame for correlation using a certain threshold (e.g., 90%). This is possible by caret’s function called findCorrelation. This is the cutoff threshold that Hird et al. (2017) and Lidberg, Nilsson, and Ågren (2020) used but it might be worth exploring some lower cutoffs. I tried 60% to start off and got eight variables.

library(corrplot)

## corrplot 0.84 loaded

library(caret)
correlations <- cor(data[,-c(1,2,3,24,25)]) #remove response,coords and HUCs
corrplot(correlations, order = "hclust")

highCorr <- findCorrelation(correlations, cutoff = 0.6)
highCorr

## [1] 15 14 10 11 17  4

There are eight variables recommended to be taken out. Let’s see what they are.

colnames(correlations)[c(16,15,20,11,12,18,8,4)]

## [1] "B4_30agg"    "B3_30agg"    "coords.x2"   "hdi30RS"     "cad30RS"     "cpg30precip" "ndwias30agg" "vv30agg"

There are a few ways we can approach this; take them out, do some sort of dimension reduction or raise the threshold bar. We can check the threshold by moving it up to let’s say 75%, which is a little more of a conservative value.

highCorr2 <- findCorrelation(correlations, cutoff = 0.75)
highCorr2

## [1] 15 14 10 17

colnames(correlations)[c(16,15,20,11,18,8)]

## [1] "B4_30agg"    "B3_30agg"    "coords.x2"   "hdi30RS"     "cpg30precip" "ndwias30agg"

With this threshold of 75% we ended up only dropping hdi30RS and vv30agg in the list. The remaining variables make sense why they would be correlated; cpg30Deficit and deficitRS both have CWD, the B’s are all reflectances, and ndvias30agg is highly correlated with its counterpart ndwias30agg and B8. By keeping these variables in the analysis it is likely to be couterproductive unless some sort of dimension reduction is applied. The disadvantage of dimension reduction is that you lose interprebility with these variables and you also might miss the mark on causation. Advantages include using decorrelated variables that normally would have been taken out of the model and reducing dimensions in statistical learning is always a worthwhile consideration (e.g. ‘curse of dimensionality’). It’s a tricky decision.

There is no clearcut way of doing this, so suggestions are welcomed. Again, no free lunch!. Thus I decided to drop ndvias30agg and here’s why; I don’t mind dropping ndvias30agg because we have ndwias30agg and we also have a long-term NDVI ndvi30yrRSas well. Therefore, by dropping this variable we remove some redundancy and keep some interprebility which I think is a good trade-off in this analysis. However, I decided to do a principal component analysis (PCA) with the reflectance band variables and the two CWD variables. One could probably argue to just take away one of the deficit variables but for some reason my confirmation bias want to keep them. So bear with me on the rationale below as it would be just as easy to take one out.

The reflectances interprebility to me seems hard pressed in the first place so the trade-off comes down to possible model performance over interprebility. Therefore, I would rather keep them in the analysis with the downfall of losing some interprebility at the expense of possibly having better model performance. It’s a little tricky with the CWD variables, we would want some interprebility from them but we are going to have to lose that with a reduction, again trade-offs. I think that they are both below a 90% threshold, so leaving them wouldn’t be the worst thing but we also have hdi30RS as a predictor, which is a recently studied, validated, and published model for water and energy feedbacks. Therefore reducing the dimension of deficitRS and cpg30Deficit at the expense of interprebility isn’t as big of loss. However, if we didn’t have hdi30RS I would probably consider dropping one.

Now we want to make sure that we don’t leak any data in the process (i.e. data leakage) as well so we will need to process the components during cross-validation. What this means is that you don’t want to preprocess your PCA data on both test and training data; thus, a PCA prior to cross-validation is a concern and should be avoided. The PCA will be done within the cross-valdition procedure in the caret function using a recipe, which allows us to be specific with variables that we want to preProcess. So lets take out the predictor variable ndvias30agg that we determined to be redundant and collinear and wait to do anything with the PCA until later during the modeling.

*Update I’m dropping cpg30Deficit. In short, not worth it. Will be exploring in the future though.

data <- data[, -which(names(data) %in% c("ndvias30agg", "cpg30Deficit"))]

Now we can move on to the spatial pre-proccesing section.

Spatial Pre-proccessing

Pre-proccessing the spatial data is a very important process (e.g., projections, resolution, etc) and can be troublesome if not done correctly or overlooked. The predictor variables mentioned in the Variables section used the following pre-processing techniques below. These data started at 10m resolution (e.g., tpi, twi, uaa, ndvi/ndwi Aug-Sept, vv, vv sd, npol) and were then aggregated by a mean function (3x3 window) and resampled by nearest neighbor to retain inherent values from the aggregation. The aggregated and resampled data were then rasterized and saved as a tif. Below is the process to get each TIF from 10m to 30m resolution. Resampling was done to make sure any changes in the aggregation would be rectified, if any. This pre-processing was also done to all the 30m spatial data that wasn’t 10m to retain correct CRS.

tpi30 <- aggregate(tpi, fact = 3, fun = mean) #aggregate from 10-m TPI
tpi30 <- resample(tpi30, ndvi30yr, method = "ngb")
writeRaster(tpi30, filename = "tpi30agg.tif", overwrite = TRUE)
tpi30 <- raster("tpi30agg.tif")

To check the correct CRS, resolution, extent, and dimensions for all the variables I used the process below. However, I will only show the first couple of grids. The data is already pre-processed so i’ll bring in a differing grid to show the process.

library(raster)
dem30test <- raster("D:/Rcodes/FDir/dem30test.tif")
dem30test

## class      : RasterLayer 
## dimensions : 2236, 2411, 5390996  (nrow, ncol, ncell)
## resolution : 30, 30  (x, y)
## extent     : 150040.6, 222370.6, 507396.9, 574476.9  (xmin, xmax, ymin, ymax)
## crs        : +proj=aea +lat_1=46 +lat_2=48 +lat_0=44 +lon_0=-109.5 +x_0=600000 +y_0=0 +datum=NAD83 +units=m +no_defs +ellps=GRS80 +towgs84=0,0,0 
## source     : D:/Rcodes/FDir/dem30test.tif 
## names      : dem30test 
## values     : 221, 2517  (min, max)

tpi30

## class      : RasterLayer 
## dimensions : 2237, 2411, 5393407  (nrow, ncol, ncell)
## resolution : 30, 30  (x, y)
## extent     : 150035.8, 222365.8, 507390, 574500  (xmin, xmax, ymin, ymax)
## crs        : +proj=aea +lat_1=46 +lat_2=48 +lat_0=44 +lon_0=-109.5 +x_0=600000 +y_0=0 +ellps=GRS80 +towgs84=0,0,0,0,0,0,0 +units=m +no_defs 
## source     : D:/Rcodes/Water_Prediction/Hird_Water_Prediction/waterD/waterPred/Final_workflow/tpi30agg.tif 
## names      : tpi30agg 
## values     : -24.60378, 35.87562  (min, max)

Now project dem30test to tpi30 using the function projectRaster.

tpiCRS <- crs(tpi30)

dem30proj <- projectRaster(dem30test, tpi30, res = 30, crs = tpiCRS, method = "ngb")

dem30proj <- raster::resample(dem30proj, tpi30, method = "ngb")

writeRaster(dem30proj, "dem30proj.tif", overwrite = TRUE)

Now the data is in the same projection, ellipsoid, datum, extent, and dimension. Do this for all the grids in the data set. The reason for albers conic equal area (+proj=aea) was that the DEM was in this projection and so the flow accumulation, twi, tpi were built off of these. In the end I don’t think it matters too much since the sample space is relatively local. But any comments would be appreciated.

dem30proj

## class      : RasterLayer 
## dimensions : 2237, 2411, 5393407  (nrow, ncol, ncell)
## resolution : 30, 30  (x, y)
## extent     : 150035.8, 222365.8, 507390, 574500  (xmin, xmax, ymin, ymax)
## crs        : +proj=aea +lat_1=46 +lat_2=48 +lat_0=44 +lon_0=-109.5 +x_0=600000 +y_0=0 +ellps=GRS80 +towgs84=0,0,0,0,0,0,0 +units=m +no_defs 
## source     : memory
## names      : dem30test 
## values     : 221, 2517  (min, max)

tpi30

## class      : RasterLayer 
## dimensions : 2237, 2411, 5393407  (nrow, ncol, ncell)
## resolution : 30, 30  (x, y)
## extent     : 150035.8, 222365.8, 507390, 574500  (xmin, xmax, ymin, ymax)
## crs        : +proj=aea +lat_1=46 +lat_2=48 +lat_0=44 +lon_0=-109.5 +x_0=600000 +y_0=0 +ellps=GRS80 +towgs84=0,0,0,0,0,0,0 +units=m +no_defs 
## source     : D:/Rcodes/Water_Prediction/Hird_Water_Prediction/waterD/waterPred/Final_workflow/tpi30agg.tif 
## names      : tpi30agg 
## values     : -24.60378, 35.87562  (min, max)

Heres how I derived the CPG grid for deficit below.

tpi <- raster("tpi.tif") #bring in tpi for cropping

z=raster("D:/Rcodes/FDir/dem10m.tif") #bring in 10-m DEM

c <- crop(z,tpi) #crop to tpi

writeRaster(c, "croppedDEM10.tif")

c <- resample(z, tpi, method = "ngb") #resample with tpi
c <- mask(c,tpi) #mask with tpi

writeRaster(c, "dem10m.tif", overwrite = TRUE)



#####This is the command-line Tarboton TauDEM process####

# Pitremove
system("mpiexec -n 8 pitremove -z dem10m.tif -fel dem10fel.tif")
fel=raster("dem10fel.tif")
plot(fel)
zoom(fel)

# DInf flow directions
system("mpiexec -n 8 DinfFlowdir -ang dem10ang.tif -slp dem10slp.tif -fel dem10fel.tif -nc",show.output.on.console=F,invisible=F)

dem10ang <- raster("dem10ang.tif") 

defdis <- disaggregate(deficit, fact=3) #downscale deficit 30-m raster

s <- resample(defdis, dem10ang, method = "ngb") #then resample

s <- mask(s, dem10ang) #mask to get it aligned with the angle raster

s[is.na(s[])] <- 0 #now set NA's equal to zero so TauDEM deoesn't fuss

writeRaster(s, "defdis.tif", overwrite = TRUE)

# Dinf contributing area

#weighted with deficit

system("mpiexec -n 8 AreaDinf -ang dem10ang.tif -sca DEF10sca.tif -wg defdis.tif -nc")

#normal UAA
system("mpiexec -n 8 AreaDinf -ang dem10ang.tif -sca DEM10sca.tif -nc")

#now do the math
cpg30Deficit <- overlay(DEF10sca, DEM10sca, fun = function(r1,r2){return(r1/r2)})

#then aggregate to 30-m to retain finer scale flow patterns

cpg30Deficit <- aggregate(cpg30Deficit, fact=3, fun=mean)

#then get it lined up with the other rasters

cpg30Deficit <- projectRaster(cpg30Deficit, tpi30, res = 30, crs = tpiCRS, method = "ngb")

cpg30Deficit <- raster::resample(cpg30Deficit, tpi30, method = "ngb")

writeRaster(cpg30Deficit, "cpg30Deficit.tif", overwrite = TRUE)

Splitting Data

Guidance

Data Splitting Guidance

Data splitting is a crucial part of the modeling workflow because it involves how the data gets trained/tuned and tested, which ultimately affects performance (i.e. interpolation and extrapolation) (Kuhn and Johnson 2013). This section will have a lot of details that at times will seem tedius, redundant, stringent and nuanced. My appologies. However, I think the cost benefit of following these approaches will only help correct for overfitting and insufficient model interpretations. That is to say, the cost is low to explore these details while the benefit is high when compared to the complement, e.g. plug and chug.

Thus different data splitting approaches can influence the risk of overfitting (Friedman, Hastie, and Tibshirani 2001; Matloff 2017; James et al. 2013; Kuhn and Johnson 2013) and data leakage (Kaufman et al. 2012); especially within a spatial context (Bahn and McGill 2013; Elith, Leathwick, and Hastie 2008; Ives and Zhu 2006; Meyer et al. 2018, 2019; Roberts et al. 2017). One way to pursue these splitting approaches while avoiding pitfalls can be through how you split your data and/or interpret and choose your data (predictors) i.e. Data Splitting Guidance. This is one of the sections in the workflow where I got into some really deep rabbit holes and went back and forth on different strategies. That be said, perpspective or comments in this section would be much appreciated. The actual data splitting will be done in the ‘Modeling’ section and think of this as the rationale of how we got there. Thus I will take these steps below, which is a blend of Kuhn and Johnson (2013), Matloff (2017), Meyer et al. (2019), Roberts et al. (2017) and Valavi et al. (2018) as ways to approach structured and/or imbalanced/rare data in a statistical learning context. Below is an outline of this section.

Proportion of response.
To balance or not to balance.
Cause and effect of variables to response.
Prevalance. Who cares…
Minimize data leakage.
Assess dependence structures in data

Proportion

Proportion

I will start off by looking at the proportion of classes (“stream”, “no stream”) in the response variable. Zero (0) = “no stream” and one (1) = “stream”. I used the arcgisbinding package but of course other methods can be used. I like interfacing between ArcGIS and R because of the visuals, however packages like mapview are just as good for exploring visually and are free! As always, you can bring in the points_spaced30m.shp with the sf package.

barplot(prop.table(table(pts$copy_TWI_1)),
        main = "Proportion of stream occurence", 
        names.arg = c("No Stream", "Stream"), ylab = "Proportion %")

Count
Stream	Freq
0	5661
1	6910
Sum	12571

Proportion
Stream	Freq
0	0.45
1	0.55

The proportion and count of the response brings up some important questions. Should we downsample or upsample or leave it how it is? How much data should we use for evaluating performance e.g., training/tuning? What is the prevalence of stream occurence in the first place (prior)? These are all important questions and I hope to acknowledge them below but looking for some additional insights into this process as well.

Balanced?

Balanced?

Within environmental classification models there can be a natural imbalance when collecting the data randomly, i.e. target class is typically rare. Downsampling or upsampling can correct for this class imbalance when training the data, which means the model doesn’t have to deal with this imbalance while training. However, it’s not that imbalanced data sets are a problem and must be balanced but it ultimately depends on the goals and objectives of the project, i.e. sometimes the minority group is so rare (<1%) that balancing might help spot trends in prediction which can help with interpretation and selection. We will explore the “to balance or not to balance?” in the paragraphs below.

Typically you want to sample randomly so that you end up with a unbiased sampling distribution in the first place. If a random sampling method was pursued, in our case, results would most likely be 1-10% proportion for the minority class (stream occurrence) and the rest going to no occurrence (>90%), depending on location. This means having really really low samples of our response target class, i.e. stream occurence! As you saw in the proportion section we have a higher proportion of streams occuring compared to streams not occurring, which as you can imagine is not a realistic distribution. This artificial distribution between classes highlights the observation bias in our sampling scheme and points to some flaws we might run into when we model.

Consequently, collecting data this way (e.g. observation bias) lends itself to a balanced data set. This is not a bad thing but should be noted and always in the back of the head for the modeler, i.e. we have a lot of “stream occuring” samples. Therefore, leaving the sample proportion the way it is could potentially be adequate for our analysis. However, there are conflicting interpretations and suggetions for dealing with these imbalances (Kuhn and Johnson 2013; Matloff 2017). Therefore, just remember it’s biased towards the minority in the training context and it might be advantageous to train on natural prevalence distributions between classes.

It does bring up a good point though e.g. prevalence, how should the test data be tested or used? This is where the importance of spatial feature and validation techniques are necessary because the data are collections of non idependent realizations in the sampling process and also have a serious imbalance (if collected randomly). Thus spatial autocorrelation is present in our sampling scheme along with a ‘rare-ish’ prevalance rate. I want you to just keep this topic in mind. That is to say, minority group is stream, majority group is no stream and what’s the actual rate of prevalance (it’s definitely not 50/50!) nested within a spatial context. This might give some insight into how to split and evaluate the data so that we can achieve optimal results while accounting for these conditions!

Ultimately we want a model that can perform well but also be aware of what the natural conditions are so that we can make more informed decisions under uncertainty. Testing these two approaches (balanced vs. unbalanced/natural) will be an interesting topic and study as others have alluded to its concern (Kuhn and Johnson 2013; Matloff 2017; Roberts et al. 2017).

Cause and effect

Cause and effect

One way to start anticipating the data splitting process is to go over predictor variables and examine whether or not these variables would be appropriate predictor variables in the first place. This is really important if the goal is extrapolation because irrelevant variables can fail big time when going beyond your ‘small world’ model. Surrogate variables are common traps to avoid with any model and should be met with skeptisism (i.e. spurious correlation). Typical surrogate variables in environmental sciences include but are not limited to: elevation, coordinates, slope, ids, similar ratios, dummy variables, etc (Roberts et al. 2017; Meyer et al. 2018, 2019; Van Bell 2011). The goal is to avoid these type of variables in the first place and/or transform appropriately; therfore, a thourough evaluation of your variables is a good idea. However there are many times where the predictor variables exceed such a value that any sane person would not dare to approach a one-by-one strategy. Hence, spatial feature selection and validation techniques (Roberts et al. 2017; Meyer et al. 2018, 2019; Valavi et al. 2018) have been provided by these authors to lessen the strain and time of the modeller within these structured dependance domains. These feature and validation techniques are an option to search your variables if size of p is too demanding but it’s always good to give an adequate examination if not (this is especially true if you’re out of your domain e.g. hydrologist determining wildlife predictors).

In our example there are 16 predictors that were specifically chosen from hydrologically relevant papers (Hird et al. 2017; Hoylman et al. 2019; Jaeger et al. 2019; Lidberg, Nilsson, and Ågren 2020; Sando and Blasch 2015) where the response (water) and predictors (remote sensing data) were used to optimize prediction. Three other variables: decid, cad, and deficitCPG are new to modeling hydrologic processes, from what I could gather. Of these three variables, the one variable that I could see being spurious is deficitCPG. This variable will be one I keep my eye on through the modeling process. On the other hand, all the other variables we’ll be using are considered relevant within a hydrological context and appropriate for cause and effect relationships, especially when considering critical zone processes (Pelletier et al. 2018). One concern that was brought up though was whether or not Upslope Accumulated Area (UAA) was a surrogate variable or not? This was the case for the Jaeger et al. (2019) paper where the authors decided not to include UAA exactly for this reason. Jaeger et al. (2019) objectives were different than ours in the sense that they were concerned with stream permanance but not with stream occurrence. There’s a subtle difference here and to us it makes sense to include UAA. However, this is a good example of where data that’s not relevant or causal to sneak in.

Hence we’ll leave the predictors we already selected. It would be challenging to test the UAA surrogate hypothesis. The only other option would be to see if the foward feature selection picks UAA or not? The hypothesis is that if UAA is ‘geo-located/surrogate’ variable then predicting outside its dependance structure space (i.e. range) on the the response would by theory result in the forward feature selection from Meyer et al. (2018) to not pick it as a variable or it’s highly unlikely to pick it. This is due to these type-variables propensity to under-and-over fit, i.e. very high error with model performance metrics, see Meyer et al. (2018). Therfore, this will be a topic to address in the anlaysis results/discussion section. If there are other ways to possibly test for this particular issue please let me know.

Prevalance

Prevalance

Please read this with the caveat that it’s a interesting topic but is very nuanced and heavy on theory. I can sum it up pretty quick; do we use natural prevalence rates or not? To balance these two methods I propose using minority prevalance rates \((0.05, 0.10, \dots, 0.50)\) to test the final models on and using minority prevalence to compare against a balanced data set. For example, after tuning the models we will have a good idea of what the models perfomance will be over multiple tuning parameters (this is for the balanced data set). Then we can use the final test data to compare our tuned models using the different minority thresholds, e.g. let’s say we have 2000 in the final test set; then for each model, \(2000*(0.05, 0.10, \dots, 0.50) = (100,200, \dots , 1000)\) we will have a minority amount we can test the model on. This seems like a nice idea but to acutally test it we would need to model a seperate set of data with natural prevalence rates.

The reason why this is so important is that if you train your model with a artificial proportion, then your model will assume a 50/50 split in the real world as your unconditional probability. In our case, this could lead to overfitting because when the model starts to predict (unknown locations) the model assumes a unconditional probability of 50%, which effects the feature splitting (e.g. mtry or root node and Gini Index and loss function in our case) and the logistic function in GBM, i.e. GBM uses pseudo residuals to slowly grow learners, however it starts with an initial probability and that’s where we run into a problem because it assumes that both classes are 50/50 or whatever the class proportion is (55% in our case), which is not correct and ultimately affects the residuals. Again, this is very nuanced and the level to which this will affect the model in our case is unknown but it is also the reason we want to test for it. Below shows a rfe model with 55% “stream” and 10% “stream”. Notice the big difference between the spec/sens relationship.

So like I said above, one way is to use all the data we have (balanced data) and then test on natural prevalence rates with the ‘vaulted’ test set and the other way is to implement the prevalence rate within the training data and the final ‘vaulted’ test set. The goal is to see how much of a difference these two techniques are and how we can move foward with modeling this type of spatial data, i.e. data that is spatial and target class is rare so that we can have suffient models. Please let me know what your thoughts are as this is very subtle. Now if you want to read below it basically goes into some very nuanced stuff. Read on if you like. Below could be a way to visualize the compromise, i.e. train on 50/50 proportion and assess on different prevalence rates.

When splitting the data a critical part becomes how will you test the data, or how much will you leave out for testing? (side bar - there are a lot of nuances in model selection and validation techniques, all have their place and reasons and there is not a perfect ‘one-size-fits-all’ approach, but leaving a validation set out for a sanity check is common practice regardless of training techniques). The test set should be sampled to be more representitive of the response frequencies (Kuhn and Johnson 2013) if an inherent imbalance occurs naturally (these can change and are dynamic in some cases). So, if you’re going to validate on a hold-out test set be sure the test set matches the inherent prevalance rate (most of the time it’s unknown and cannot infer…). This becomes tricky again since we are dealing with spatial data (eg., climate, geology, landtype). This seems to be very nuanced and pedantic. I am aware.

However, this still doesn’t get at the issue that others Kuhn and Johnson (2013) and Roberts et al. (2017) have brought to the users attention, i.e. testing model with a representitive prevalance rate. This may seem too stringent and unrealistic for most circumstances but in our case we have a sufficient amount of data and we can calculate the prevalance rate in different spatial structures. Thus we will use this methodology for testing the final data along side the traditional approach of using the stratified random sample that was produced when splitting, i.e. your test set is partioned to have a \(\approx\) 50/50 split between the binary response proportions. Therefore we will move on with calculated the prevalance rates below.

I decided to use the climatic water deficit (CWD) tertile method similar to Hoylman et al. (2019) as a way to break up the sample space into water-limiting and energy-limiting geographic zones. The advatages of using CWD as geographic partitioning is that it takes into account numerous biophysical processes, albeit modeled. Numerous papers have shown the interplay between these processes and the advantage it has over traditional geographic partitioning, e.g. elevation, precipitation, vegetation type, etc (Hoylman et al. 2018, 2019; Lutz, Van Wagtendonk, and Franklin 2010; Stephenson 1998). From these tertiles we can calculate prior distributions of stream prevalence from watersheds that we have copious amounts of data on, which will help us when we test the data on the final model. These distristributions will then get transferred to a global (whole sample space) and three local (tertiles) geogrpahical zones from which the inherent prevalance rate can be used to test the final selected model on; allowing us to compare and contrast statistics (e.g., Kappa, ROC, etc). This will hopefully give us a more realistic model assessment result to be used when comparing with controls (eg., NHD, PROSPER). However, you can only imagine how low these rates would be!

That being said, just doing a few distributions one can see that even a low deficit zone (energy limiting) has only a 10% prevalance rate. That means that if we left out 3,000 points to test the final model we would then only have 300 points of “stream” to acutally test the model performance metrics on. So, to balance this problem comes an easy solution; just test both rates (e.g., 50/50 and ‘natural’). This is easy because the test set is locked in a ‘vault’ and not touched until the final model assessment from which we can use either method (training proportion vs ‘natural’ prevalance) on. No harm, no foul. Just keeping your sanity through all this mess of nuances and tedious minutue is a struggle. Hence I will derive a very basic prior using some approximation methods (e.g. division :).

First we will split the data into tertiles of CWD. We will clip the data to US Forest Service ground because that’s where we want to predict and where our data is from.

Calculate the tertile of deficit and use these values to reclassify.

#now calculate the tertile and reclassify

quantile(defTert, probs = c(.33,.66))

## 33% 66% 
## 243 313

defTertRC <- reclassify(defTert, c(0,243,1, 243,313,2, 313,Inf,3))

Now Plot

Now we have a raster broken into tertiles of CWD. However, we want to be able to sort these characteristics by watershed beacause that’s the spatial structure our response is most relavant to, i.e. streams are cumulative to basin architexture and HUCs are perfect fit. So, we will use the 12th HUC watershed as a mediator for CWD and then reclassifing per HUC12 by taking the mode of deficit. The reason to use HUC12 is that it limits the amount of spatial autocorrelation when compiling zonal summary statistics, i.e. it’s more representative of the overarching CWD and not the discrete CWD value.

library(raster)
library(sf)
library(exactextractr)

HUC12 <- readOGR("ksank12thclip.shp")

HUC12sf <- st_as_sf(HUC12)
HUC12sf$mean_Def <- exactextractr::exact_extract(deficit, HUC12sf, 'mean')

#or gather other info
HUC12sf[,c("modeDEF", "countDef")] <- exact_extract(deficit, HUC12sf, c('mode', 'count'))

Now plot

I still want to use the mode after looking at the plots as it represents the processes of CWD more clearly than the mean. I think this can be seen in the histograms below. The mean tends to smooth out the HUC12 watersheds when in fact there are more “drier” type landscapes in the valley, i.e. valley may have a few larger streams/rivers carving through but adjacent hillslope processes are most likely drier and this rationale gets smoothed with the mean.

Now I’ll select HUC12’s that we have copious amounts of data on where we can then extract a distribution from. Visualize in ArcGIS to get the names.

From here we can sample from these different HUC12s and come up with a basic and I mean basic idea of stream prevalance. Remember, this is just to help us test the final model during assessment. Very tedious, possibly stringent, and most likely a little off; but, will be more representitive of a natural stream prevalance rate. For these type analysis it might not be neccessary (i.e. minimal difference) but others (Kuhn and Johnson 2013; Roberts et al. 2017) have alluded to the importance and thus we will see how it does with our data. It’s important to note the differences between detecting certain responses and the wiggle room that’s acceptible. For example, cancer and stream detection are extremely different; hence it’s probably worth the trouble of finding a prevalance rate for cancer detection than streams but this is up to the goals and objectives of the project in the first. All we are doing is exploring this option to get a glimpse. My personal bias is that it matters and will give a more realistic final model assessment; however, that’s my bias so I account for the confirmation bias in the results/discussion section, i.e. take a hard look. So, please take these concepts into consideration when reviewing this section in such that the main goal is hypothesis testing not hypothesis generation. Hopefully this will give a glimse into some of the nuanced debates about testing spatial data on prior distributions. Onward!

Now we need to rasterize the polygons and points so we can sample.

#polygon to raster
deezy <- as_Spatial(densityHUCs)
deezy <- spTransform(deezy, tpiCRS)
densHUC12rast <- rasterize(deezy, deficit, name="values")

#now points to raster

ptsRast <- rasterize(pts, deficit , name="values")

#now select only raster values equal to 1
filter_pts <- ptsRast$copy_TWI_1 %in% 1

After a quick view it was apparant that the point method was not going to work. Thus I pivoted and used an accumulation threshold to be sampled as the global density. The reason for doing this is it’s a balance between the results being 0 for a dry area and 5% for wetter areas. I felt this was an appropriate compromise. Therefore, we will only use a global prevalance rate for the final model assessment against the 50/50.

#then extract from the polygons
accumThresh <- reclassify(accum30, c(-Inf, 10000, 1, 10000, Inf, 2))
rHucs <- mask(accumThresh, densHUC12rast)
rHucs <- mask(rHucs, landClip)
writeRaster(rHucs, "rHucs.tif", overwrite = TRUE)

From here we can sample from these watersheds to get an idea of the distribution.

global.dist <- data.frame(sampleRandom(rHucs, size = 100000))
kableExtra::kable(table(global.dist), caption = "Global Distribution from Random Sample")

Global Distribution from Random Sample
global.dist	Freq
1	96217
2	3783

So, as I said earlier; tedious, stringent, and most likely trivial. We ended up with 4%. So you can see how low a conservative stream prevalance rate would be. I’m sure you’re thinking well duh. But, I had to get it out there.

Data Leakage

Minimize Data Leakage

Data leakage is a term used in data mining where information has been leaked and has had a peak on the training data (Kaufman et al. 2012). It’s usually found when the results are “too good to be true” or unusual variables climb up to the top of variable importance (e.g. coordinates, elevation, ids, time-stamps, etc). Hopefully the investigation we did during predictor selection helped reduce that chance but mistakes can and will be made. Thus using some sort of feature selection technique like Meyer et al. (2018) or spatial cross validataion can help with mitigating variable leakage; however, there are some less obvious ways that data leakage can happen i.e. data splitting.

When splitting your data for training/tuning and testing, this is a perfect time for data leakage to happen. One common mistake is to downsample/upsample your raw data and then split it into training and test sets after doing this resampling procedure. You just committed the ultimate leakage sin. You might be thinking “this is so harmless” but this allows information from the training data to be passed to the test data via the resampling procedure, which can possibly allow the model to overfit by reducing the actual population thus possibly trimming the data (very subtle). Another is if you are imputing or transforming any data pre-data-splitting, then the test data has seen the training data and will lead to over-fitting in the model, again very subtle. Finally, when deriving cutoff thresholds it might be tempting to use the same data you used for training or some sort of post-hoc derivation, again this is leakage (Kuhn and Johnson 2013). Therefore, to avoid these pitfalls we looked at data that could be surrogate variables (cause and effect section), we will transform data after data-splitting and during CV (i.e. test sets in CV can’t see transformed data, very subtle) or downsample during CV if needed and finally derive thresholds off an independent evaluation set or during the validation process using recipes. Since we have a lot of data we have the adantage of splitting the data into multiple sets that will help avoid these issues. However some times this isn’t the case so exploring these pitfalls should always be good practice to help us avoid the wrath of data leakage in the future.

Structures

Access Dependance Structures

Accessing dependance structures is a step recommended by Roberts et al. (2017) and should be considered common practice with any spatial prediction type analysis. The objective is to analyze your covariates (e.g. rasters) so that you can get an idea of the associated spatial autocorrelation levels (e.g. ranges from variogram). The method that is used in concurrence with Roberts et al. (2017) via Valavi et al. (2018) is a variogram type approach (isotopic). The main reason to use this procedure is that if you don’t structure your spatial data systematically (i.e. in this case by variograms) then you are likely going to underestimate or overestimate (depends on metrics) your error in your evaluation, again leading to over optimistic results and over-fitting in the models (Roberts et al. 2017; Meyer et al. 2018, 2019).

Roberts et al. (2017) and Valavi et al. (2018) suggest using a blocking system with different techniques (e.g. systematically, checkerboard, or by rows or columns) and then partitioning the sample space based on the set of variogram ranges (median distance selected) (Valavi et al. 2018). This is a balanced approach to demarcating the dependance structures; however, there are covariates remaining with high spatial autocorrelation (i.e. median is the threshold). Again, no free lunch. However, there are ways to find a balance; we will search for this balance by testing four different structures, all with there inherent spatial autocorrelation ranges, e.g. 12th HUC, blockCV median, 14th HUC and K-means. The goal is to go big and small, i.e. minimal autocorrelation to a lot.

Below is how we got the block cv size and how we are adding HUCs to the structure cv using Valavi et al. (2018). This is a really sweet package with great tools and interface options for spatial modelling! Here is a link to the vignette blockCV.

remotes::install_github("rvalavi/blockCV", dependencies = TRUE)
library(blockCV)
library(sf)
library(raster)

First we want to bring in our point data and our covariates.

library(sf)
ptsSF <- arc.open("points_spaced30m.shp")  #points to the folder/feature
ptsSF <- arc.select(ptsSF, c("copy_TWI_1")) #selects the data in the folder

ptsSF <- arc.data2sf(ptsSF)
ptsSF <- st_as_sf(data, coords = c("coords.x1","coords.x2"))
pts
st_crs(ptsSF) <- "+proj=aea +lat_1=46 +lat_2=48 +lat_0=44 +lon_0=-109.5 +x_0=600000 +y_0=0 +datum=NAD83 +units=m +no_defs +ellps=GRS80 +towgs84=0,0,0"
nrow(data)
topo_new <- dropLayer(topo_opt_rad30, c("ndvias30agg", "cpg30Deficit")) #covariate stack
topo_new

Plot to make sure everything is lined up. Looks good.

## Warning in plot.sf(ptsSF[which(ptsSF$stream == 0), ], pch = 16, col = "red", : ignoring all but the first attribute

## Warning in plot.sf(ptsSF[which(ptsSF$stream == 1), ], pch = 1, col = "blue", : ignoring all but the first attribute

Now we can figure out our ‘effictive range of spatial autocorrelation’ by using the function spatialAutoRange in blockCV. This is a cool function because it takes the covariates, calculates the range of autocorrelation (from \(n\) sampled points), and then produces two plots showing the range of the covariates and the recommended block size for validation.

sac <- blockCV::spatialAutoRange(rasterLayer = topo_new,
                                 sampleNumber = 5000,
                                 doParallel = TRUE,
                                 showPlots = TRUE)

Plots of variograms: lowest (TPI), median (UAA), highest (CWD).

So as you can see there are some variables that are very spatially autocorrelated. The BlockCV package recommends the median range as a structure dimension and then disects the sample space into dimensions, e.g. blocks. From here you add the points and BlockCV will provide a sampling scheme for cross validation. Pretty sweet! Real quick, here’s the summary from the variograms giving the range with each covariate.

## 
## Attaching package: 'kableExtra'

## The following object is masked from 'package:dplyr':
## 
##     group_rows

Table of Range and Covariates
	layers	range
2	tpi30agg	350.5236
6	npol30agg	1096.7657
13	decid30RS	2148.1797
1	twi30agg	2486.7891
8	ndwias30agg	2520.0715
5	vvsd30agg	2567.1247
17	B8_30agg	4329.5648
7	ndvi30yrRS	5378.3710
16	B4_30agg	5672.9572
3	accum30	5715.8568
15	B3_30agg	5903.6377
4	vv30agg	6286.4209
9	nppmmid30agg	10557.6102
11	hdi30RS	10940.0857
14	B2_30agg	11893.0245
12	cad30RS	12188.1081
18	cpg30precip	14436.5249
10	deficitRS	18410.0072

Now we can systematically seperate the blocks out with a k=10 for a 10-fold CV and a range from the derived median.

sbMed <- spatialBlock(speciesData = ptsSF, # presence-background data
                    species = "stream",
                    rasterLayer = topo_new,
                    k = 10,
                      theRange = 5694,
                    selection = "systematic",
                    biomod2Format = TRUE)

Now we can partition the data by Hydrological Unit Codes (12th and 14th HUCs) into ‘blocks’ based on a systematic approach. Same thing: 10 folds, selection = “systematic”.

HUC14SP <- arc.open("ksank14thclip.shp")  #points to the folder/feature
HUC14SP <- arc.select(HUC14SP, c("FID")) #selects the data in the folder

HUC14SP <- arc.data2sp(HUC14SP)

HUC14SP <- spTransform(HUC14SP,CRSobj = crs(topo_opt_rad30))

sb14 <- spatialBlock(speciesData = ptsSF, 
                    species = "stream", 
                    rasterLayer = topo_new, 
                    blocks = HUC14SP,
                    k = 10,
                    selection = "systematic")

HUC12SP <- arc.open("ksank12thclip.shp")  #points to the folder/feature
HUC12SP <- arc.select(HUC12SP, c("FID")) #selects the data in the folder

HUC12SP <- arc.data2sp(HUC12SP)

HUC12SP <- spTransform(HUC12SP,CRSobj = crs(topo_opt_rad30))

sb12 <- spatialBlock(speciesData = ptsSF, 
                    species = "stream", 
                    rasterLayer = topo_new, 
                    blocks = HUC12SP,
                    k = 10,
                    selection = "systematic")

Finally, we can do a kmeans with a cluster of 20. This will be our small structure attempt.

#this is the Leave-One-Out-CV (LOOCV) method (using a cluster of 20)
#need coordinates.
#remember to use the coordinates, that's why we've kept them!

Mycluster <- kmeans(data[,c(23,24)], (nrow(data)/20)) 

# add the new variable back to your dataframe here
data$spatial_cluster =Mycluster$cluster

Now you can see the clusters.

Now bring in the blocks we created (sb12,sb14,sb2) and bind with our data frame data so that we can index these during CV.

data <- cbind(data, sbMed = sbMed$foldID, sb14 = sb14$foldID, sb12 = sb12$foldID)
data[1:20,c(23:26)]
data <- data[,-c(28,29,30)]

Evaluation Techniques

Evaluation Techniques

Building a model can seem pretty straight forward and easy, almost like a plug and chug; but, this will most likely lead to unreliable results and dissappointed end-users (Kuhn and Johnson 2013). We want to correct for these potential oversights as discussed in earlier sections and equally not take so much time while doing so (i.e. model efficiency). Therefore, balancing validation/selection techniques with time will be the goal of this analysis so that future workflows can be performed in a modest amount of time while also achieving optimal model performance within a spatial context. Traditional ways of balancing these issues can be through different validation techniques (e.g., k-Fold Cross-Validation, bootstrapping (out-of-bag error), validation set approach, Leave-One-Out Cross-Validation (LOOCV), etc) and feature selection techniques (e.g., forward first, recursive, or natural feature extraction (lasso)) (Friedman, Hastie, and Tibshirani 2001; James et al. 2013; Kuhn and Johnson 2013); however, within a spatial context these techniques can run into problems as found in (Meyer et al. 2018, 2019; Roberts et al. 2017). Therefore we must account for these problems and use other techniques to lessen the effect of overfitting within a spatial context.

As I’ve already alluded to in previous sections, we will use a feature selection model from Meyer et al. (2018) which is a forward feature selection (ffs) type technique within then CAST package and compare with a recursive feature elimination (rfe) technique from Kuhn and Johnson (2013). This is very similar to the approach that Meyer et al. (2018);Meyer et al. (2019) has done. In addition, we will also use four different target-oriented cross-validation approaches to find a dependance structure balance between autocorrelation and evaluation (eg., HUC12, BlockCV, HUC14 and K-means). The importance of feature selection and evaluation (eg., cross-validation, out-of-bag error, etc) within a spatial context has been expressed in numerous papers (Bahn and McGill 2013; Ives and Zhu 2006; Meyer et al. 2018, 2019; Roberts et al. 2017), and the the reason why we’re putting so much emphasis on this topic is that we want be confident about the performance statistics before the model is going into production, i.e. so that the end-user will not be mislead. How these frameworks play into management is a matter of the objectives and goals; therefore, in the spatial domain where highly flexible models (i.e. ensemble type models) are becoming the norm cation should be met.

Feature Selection

I highlighted some advantages of using a feature selection in previous sections, but there can also be some disadvantages as well. One of the most recognized problems with feed forward type selection (e.g. stepwise, forward first, etc) is that the model algorithm puts a threshold on how far the model will go before the selections stop , which leaves the possibility of important variables not being selected (see Friedman, Hastie, and Tibshirani 2001; James et al. 2013; Kuhn and Johnson 2013; Meyer et al. 2018). To possibly account for this affect, there are ways or options within the CAST package to model all possible best subsets (bss), but will take ages! Discusssing this with one of the authors of (Meyer et al. 2018, 2019), she said “I compared bss with fss for some examples and found that the ffs is doing a good job and rarely misses important variables.”

Therefore, ffs seems get it right a majority of the time. Is this enough? I think exploring the theory behind ffs within a spatial context might help explain this question or concern. To solve its problems of occasionally not picking an ‘important’ predictor (that is ffs) it emphasizes combinations of variables that reduce variance in such reduce overfitting. Again, there are trade-offs and it comes down to how much time and energy you want to put into capturing the covariate(s) (e.g. bss will do this) that will possibly increase model performance. For example, as discussed in earlier sections we want to avoid surrogate or geolocated-type variables because of the overfitting that occurs when these variables are in a model (Meyer et al. 2018, 2019); therefore, by reducing the overfitting up-front (ffs) compared to the tradeoff of possibly missing an important variable during selection (i.e. interpretable or optimal, goals and objectives matter…) outweighs the cost of having an over-optimistic model for the end-user.

The compliment of ffs is rfe and Meyer et al. (2019) has shown that within a spatial context the rfe approach can lead to overfitting. This overfitting can happen due to the algorithm holding on to important variables that may have significant variance that goes unoticed (Kuhn and Johnson 2013; Meyer et al. 2019). We want to see how these two feature selection approaches deal with the stream occurrence data. They most likely will be similar due to the stringent pre-processing we did and Meyer et al. (2019) has noted that rfe can achieve similar results with ffs; it just depends on the data being used. That is to say if you pre-process your data with ‘cause and effect’ in mind, then the differences may be obsolete. We will test this hypothesis.

Tuning/Evaluation

Tuning and evaluation go hand-in-hand. We will set aside a tuning/evaluation set that hasn’t seen the feature selection sets (e.g. ffs, RFE). The reason is due to data leakage and the overfitting caused by ovlapping, i.e. training with testing data. During this process we will tune the parameters of the models so that we can achieve otimal performance. There are different techniques to do this like a ‘search’ function for the most optimal parameters, which randomly searches different parameters in randomly uniform sample. I will use this for the gradient boosting machine (GBM) using methods from Kuhn and Johnson (2013) but not for random forest, due to random forest having minimal parameters. This will hopefully dismiss irrelevant parameters and focus on useful ones in a time-saving manner, i.e. less models will be fitted compared to a expansive grid approach.

Along with the tuning process we will also be doing cutoff threshold derivations so we can descriptively look into the interplay between specificity/sensitivity which some refer to recall/precision. This is a confusing topic that involves different confusing confusion matrix schemes online. In short, caret uses the Altman and Bland (1994) method which is normally what you’ll find in literature; however others ‘online’ will use an alternative which has the table flip-flopped, e.g. actual and predicted get switched but the terms don’t! Be aware of this and when reading this workflow refer to this link here or here for more on confusion matrix terminology and the specific method we’ll be using.

Within classification models and imbalanced data or rare data the goal is to optimize the splitting thresholds (e.g. standard cutoffs are at 50%) so that performance statistics can be reported to reflect the ‘on the ground’ transitions from ‘no stream’ to ‘stream’ appropriately. Thus one approach is to select for the optimal model by the Receiver operating characteristic (ROC). When compared to soley selecting for accuracy this can lead to a more balanced model performance between sensitivity/specificity (i.e. cutoff threshold of 50%); however, it is not always the optimal goal for the project like reducing false positives. To achieve optimal ROC there are methods like Youdens J or distance between Sens/Spec (Dist) that can increase ROC while balancing Sens/Spec. This is a pretty balanced approach because it balances the false positives and negatives.

By deriving a cutoff threshold that will balance the two type I/II errors, we can produce a more accurate model. There are tradeoffs of course, again no free lunch. The tradeoff come with giving or taking; from either false positive or negative. In our case, this will most likely reduce the false positives but increase the false negatives, which is what we want. So, deriving cutoff thresholds can help see these tradeoffs visually and can help find an optimal cutoff threshold to balance the two errors; however, it can also go against your objectives so be careful. We will just look at the possibilites but will ultimately go with the initial ROC/AUC value from the model,i.e. similar to Hird et al. (2017). We could however pick a model that has the best specificity (e.g. less false positives) instead of using the ROC/AUC. This approach can help us achieve our objectives but the tradeoff is possibly having a model that is less balanced, which may be an alternative. To avoid any data leakage, typically you would want to do your tuning and evaluating (ROC/AUC) on a seperate hold out sets, e.g. tuning on one and evaluating on another. However, there are ways that you can combined these techniques together so that you don’t have to split the data up, which is what we will do.

The reason to do it within the tuning/evaluation set is 1). we don’t have to create another partition of data (e.g. a set), 2). From this we will be able to use more data in the final test set and also in the tuning/evalution set, 3). It keeps everything tidy. That being said, I also think by doing it all together and keeping the results tidy it will be easier for performance exploration and visualization. Within this process you can do any prep work that you would want to do without violating data leakage scenarios, e.g. transform, PCA, down-upsample, etc. We will do this with the PCA variables in our data set, e.g. B2, B3, B4, B8 and will bring some variables along for the ride to use later, e.g. post hoc visualization.

Modeling

Model Selection

Model Splitting

Model Splitting

Above in the data splitting section I proposed that we wanted to split the data into three partitions with the test data being the final model performance measure (sanity check). To do this we will split the data into two chunks and then split into test proportions once our final model is selected. This will hopefully give us a natural statistic into the final model performance.

library(caret)
library(CAST)



set.seed(1234)
#split data into training and test data. 75% for training is the goal. 
datindex <- createDataPartition(data$stream, list = FALSE, p=0.75)

train <- data[datindex,]

test <- data[-datindex,]

#then split the training data into two so we have 'overall' 25% for feature selection and 'overall' 50% for training.
set.seed(1234)
trainindex <- createDataPartition(train$stream, list = FALSE, p = .66)

traintune <- train[trainindex,]

trainsel <- train[-trainindex,]

trainprevtune <- train[trainindex,]

trainprevsel <- train[-trainindex,]

#now for the tuning training set using prevalence
trainprevtuneSample1 <- trainprevtune %>% filter(stream == "1") 

set.seed(1234)
 trainprevtuneSample1 <- trainprevtuneSample1[sample(nrow(trainprevtuneSample1),274),]
 
trainprevtuneSample0 <- trainprevtune %>% filter(stream == "0") 

  trainprevtune <- data.frame(rbind(trainprevtuneSample0,trainprevtuneSample1))
  
  summary(trainprevtune$stream)

#now for the selection training set using prevalence
trainprevselSample1 <- trainprevsel %>% filter(stream == "1") 

set.seed(1234)
 trainprevselSample1 <- trainprevselSample1[sample(nrow(trainprevselSample1),141),]
 
trainprevselSample0 <- trainprevsel %>% filter(stream == "0") 

  trainprevsel <- data.frame(rbind(trainprevselSample0,trainprevselSample1))
  
  summary(trainprevsel$stream)

As we’ve discussed earlier, another important step in evaluating spatial data is accounting for spatial autocorrelation. We can do this by doing spatial cross validation where folds are created at random; however, when specified (i.e., HUCs, block, cluster) these folds are systematically seperated by k-folds (Roberts et al. 2017; Valavi et al. 2018). This will help correct for any spatial autocorrelation in the validation process, which will lead to more realistic results (Meyer et al. 2018, 2019). The goal is to see how model performances react with different dependance structures. Hence finding a balance between ‘too large’ and ‘too small.’ Below we will use the CAST package to index these dependance structures.

library(CAST)

#this is the index for the HUC12 method.
set.seed(1234)
indices12sel <- CreateSpacetimeFolds(trainsel, spacevar = "sb12", k = 10)

set.seed(1234)
indices12tune <- CreateSpacetimeFolds(traintune,spacevar = "sb12", k = 10) 

set.seed(1234)
indices12selprev <- CreateSpacetimeFolds(trainprevsel, spacevar = "sb12", k = 10)

set.seed(1234)
indices12tuneprev <- CreateSpacetimeFolds(trainprevtune,spacevar = "sb12", k = 10) 

#now for the blockCV method

set.seed(1234)
indicesMedsel <- CreateSpacetimeFolds(trainsel, spacevar = "sbMed", k = 10)

set.seed(1234)
indicesMedtune <- CreateSpacetimeFolds(traintune,spacevar = "sbMed", k = 10) 

set.seed(1234)
indicesMedselprev <- CreateSpacetimeFolds(trainprevsel, spacevar = "sbMed", k = 10)

set.seed(1234)
indicesMedtuneprev <- CreateSpacetimeFolds(trainprevtune,spacevar = "sbMed", k = 10) 

#this is the index for the HUC14 method

set.seed(1234)
indices14sel <- CreateSpacetimeFolds(trainsel, spacevar = "sb14", k = 10)

set.seed(1234)
indices14tune <- CreateSpacetimeFolds(traintune,spacevar = "sb14", k = 10) 

set.seed(1234)
indices14selprev <- CreateSpacetimeFolds(trainprevsel, spacevar = "sbMed", k = 10)

set.seed(1234)
indices14tuneprev <- CreateSpacetimeFolds(trainprevtune,spacevar = "sbMed", k = 10) 

#now for the cluster method

set.seed(1234)
indicesKsel <- CreateSpacetimeFolds(trainsel, spacevar = "spatial_cluster", k = 10)

set.seed(1234)
indicesKtune <- CreateSpacetimeFolds(traintune,spacevar = "spatial_cluster", k = 10) 

set.seed(1234)
indicesKselprev <- CreateSpacetimeFolds(trainprevsel, spacevar = "sbMed", k = 10)

set.seed(1234)
indicesKtuneprev <- CreateSpacetimeFolds(trainprevtune,spacevar = "sbMed", k = 10)

Now that we have our training data let’s start building the models. We are going to do two feature selection models (e.g. rfe and ffs) per spatial CV for two different model types. So in this section there will be 8 different model results. First we need to create a recipe to handle the pca extraction during the feature selection.

library(recipes)
library(tidyselect)

#for balanced data
rec_sel <- 
  recipe(stream ~ ., data = trainsel) %>% 
  update_role(sb12,sb14,sbMed,spatial_cluster, HUC12, HUC14, coords.x1,coords.x2, new_role = "bring along") %>% 
  step_center(contains("_30agg"))  %>%
  step_scale(contains("_30agg")) %>% 
  step_pca(contains("_30agg"), prefix = "pca_B", threshold = 0.9)

rec_tune <-   recipe(stream ~ ., data = traintune) %>% 
  update_role(sb12,sb14,sbMed,spatial_cluster, HUC12, HUC14, coords.x1,coords.x2, new_role = "bring along") %>% 
  step_center(contains("_30agg"))  %>%
  step_scale(contains("_30agg")) %>% 
  step_pca(contains("_30agg"), prefix = "pca_B", threshold = 0.9)

#for prevalence data
rec_selprev <- 
  recipe(stream ~ ., data = trainprevsel) %>% 
  update_role(sb12,sb14,sbMed,spatial_cluster, HUC12, HUC14, coords.x1,coords.x2, new_role = "bring along") %>% 
  step_center(contains("_30agg"))  %>%
  step_scale(contains("_30agg")) %>% 
  step_pca(contains("_30agg"), prefix = "pca_B", threshold = 0.9)

rec_tuneprev <-   recipe(stream ~ ., data = trainprevtune) %>% 
  update_role(sb12,sb14,sbMed,spatial_cluster, HUC12, HUC14, coords.x1,coords.x2, new_role = "bring along") %>% 
  step_center(contains("_30agg"))  %>%
  step_scale(contains("_30agg")) %>% 
  step_pca(contains("_30agg"), prefix = "pca_B", threshold = 0.9)

RFE Random Forest

Now we can start with the rfe feature selection. This feature selection technique (Kuhn and Johnson 2013) uses a recursive algorithm which models independent of different variable lengths and then selects the variables that leads to the lowest metric (e.g. Kappa, accuracy, etc). We will use the Kappa statistic as the feature selection criteria. Later in the tuning and the evaluation process we will use ROC/AUC as a metric to evaluate the overall performance. This workflow is very similar to Hird et al. (2017) except we will use a feature selection step before tuning/evaluation.

We will use the default tuning grids for feature seletion below for both Random Forest and GBM (ffs and rfe). The reason has to do with time and it most likely won’t get you anything; however, if someone wants to try go right ahead. I’m going to use the tuning/validation set to beef up the models! This is more variable selection, but like I said you may want to add tuning and tailor how you want.

RFE 12th HUC Random Forest

library(plyr)
#bring in cores
cluster <- makeCluster(detectCores() - 1) # convention to leave 1 core for OS
registerDoParallel(cluster)

#add to controls
rfeCtrl12 <- rfeControl(
  functions = rfFuncs,
  method = "repeatedcv",
  repeats = 5,
  returnResamp = "all",
  verbose = FALSE, 
  allowParallel = TRUE,
  index = indices12sel$index
)
rfeCtrl12prev <- rfeControl(
  functions = rfFuncs,
  method = "repeatedcv",
  repeats = 5,
  returnResamp = "all",
  verbose = FALSE, 
  allowParallel = TRUE,
  index = indices12selprev$index
)

#balanced model
set.seed(1234)
rfe12 <- rfe(rec_sel,
  data = trainsel, metric = "Kappa",
  rfeControl = rfeCtrl12,
  sizes = c(2:18))

#now for prevalence model
set.seed(1234)
rfe12prev <- rfe(rec_selprev,
  data = trainprevsel, metric = "Kappa",
  rfeControl = rfeCtrl12prev,
  sizes = c(2:18))

#now remove cluster. consistency...
stopCluster(cluster)
registerDoSEQ()

	Overall
accum30	83.51688
nppmmid30agg	33.25004
tpi30agg	28.87214
cpg30precip	22.50602
cad30RS	19.26081
deficitRS	15.32609
decid30RS	15.16742
pca_B1	14.30337
hdi30RS	13.12019
npol30agg	12.35123
pca_B2	12.34902
twi30agg	12.20689
ndwias30agg	12.13892
vv30agg	12.08997
ndvi30yrRS	10.92846

	Overall
accum30	42.974124
nppmmid30agg	16.954357
tpi30agg	14.890134
cpg30precip	10.886297
cad30RS	9.562900
twi30agg	9.193509
hdi30RS	8.729399
decid30RS	8.011614
vv30agg	7.409818

RFE blockCV Random Forest

#bring in cores
cluster <- makeCluster(detectCores() - 1) # convention to leave 1 core for OS
registerDoParallel(cluster)

#add to controls
rfeCtrlMed <- rfeControl(
  functions = rfFuncs,
  method = "repeatedcv",
  repeats = 5,
  returnResamp = "all", 
  verbose = FALSE, 
  allowParallel = TRUE,
  index = indicesMedsel$index
)

rfeCtrlMedprev <- rfeControl(
  functions = rfFuncs,
  method = "repeatedcv",
  repeats = 5,
  returnResamp = "all", 
  verbose = FALSE, 
  allowParallel = TRUE,
  index = indicesMedselprev$index
)
#balanced data
set.seed(1234)
rfeMed <- rfe(rec_sel,
  data = trainsel, metric = "Kappa",
  rfeControl = rfeCtrlMed,
  sizes = c(2:18))

#prevalence data
set.seed(1234)
rfeMedprev <- rfe(rec_selprev,
  data = trainprevsel, metric = "Kappa",
  rfeControl = rfeCtrlMedprev,
  sizes = c(2:18))
#now remove cluster. consistency...
stopCluster(cluster)
registerDoSEQ()

	Overall
accum30	82.58006
nppmmid30agg	33.00096
tpi30agg	29.01813
cpg30precip	22.92511
cad30RS	19.06989
deficitRS	15.86478
decid30RS	15.22377
pca_B1	14.47780
hdi30RS	13.69397
vv30agg	12.63457

	Overall
accum30	42.937748
nppmmid30agg	17.394339
tpi30agg	15.040608
cpg30precip	10.116063
cad30RS	9.655658
twi30agg	7.748022
hdi30RS	7.730162
decid30RS	7.693185
npol30agg	7.261287
vv30agg	7.257008
pca_B2	7.137440
deficitRS	6.890446
pca_B1	6.463274

RFE 14th HUC Random Forest

#bring in cores
cluster <- makeCluster(detectCores() - 1) # convention to leave 1 core for OS
registerDoParallel(cluster)

#add to controls
rfeCtrl14 <- rfeControl(
  functions = rfFuncs,
  method = "repeatedcv",
  repeats = 5,
  returnResamp = "all", 
  verbose = FALSE, 
  allowParallel = TRUE,
  index = indices14sel$index
)

rfeCtrl14prev <- rfeControl(
  functions = rfFuncs,
  method = "repeatedcv",
  repeats = 5,
  returnResamp = "all", 
  verbose = FALSE, 
  allowParallel = TRUE,
  index = indices14selprev$index
)
#balanced data
set.seed(1234)
rfe14 <- rfe(rec_sel,
  data = trainsel, metric = "Kappa",
  rfeControl = rfeCtrl14,
  sizes = c(2:18))

#prevalence data
set.seed(1234)
rfe14prev <- rfe(rec_selprev,
  data = trainprevsel, metric = "Kappa",
  rfeControl = rfeCtrl14prev,
  sizes = c(2:18))

#now remove cluster. consistency...
stopCluster(cluster)
registerDoSEQ()

	Overall
accum30	83.26300
nppmmid30agg	32.89858
tpi30agg	29.05962
cpg30precip	23.07001
cad30RS	19.21140
deficitRS	16.34066
decid30RS	15.36604
pca_B1	14.60449
vv30agg	13.67239
hdi30RS	13.59942
pca_B2	13.53022
npol30agg	11.86189

	Overall
accum30	42.305642
nppmmid30agg	16.951341
tpi30agg	14.996234
cpg30precip	10.234166
cad30RS	9.664406
hdi30RS	7.896817
decid30RS	7.551560
twi30agg	7.411923
vv30agg	6.618988
deficitRS	6.461227
npol30agg	6.363360
vvsd30agg	5.873927
pca_B1	5.635370
pca_B2	5.582240
ndwias30agg	5.105362
ndvi30yrRS	3.913382

RFE Cluster Random Forest

#bring in cores
cluster <- makeCluster(detectCores() - 1) # convention to leave 1 core for OS
registerDoParallel(cluster)

#add to controls
rfeCtrlK <- rfeControl(
  functions = rfFuncs,
  method = "repeatedcv",
  repeats = 5,
  returnResamp = "all", 
  verbose = FALSE, 
  allowParallel = TRUE,
  index = indicesKsel$index
)

rfeCtrlKprev <- rfeControl(
  functions = rfFuncs,
  method = "repeatedcv",
  repeats = 5,
  returnResamp = "all", 
  verbose = FALSE, 
  allowParallel = TRUE,
  index = indicesKselprev$index
)

#balanced data
set.seed(1234)
rfeK <- rfe(rec_sel,
  data = trainsel, metric = "Kappa",
  rfeControl = rfeCtrlK,
  sizes = c(2:18))

#prevalence data
set.seed(1234)
rfeKprev <- rfe(rec_selprev,
  data = trainprevsel, metric = "Kappa",
  rfeControl = rfeCtrlKprev,
  sizes = c(2:18))


#now remove cluster. consistency...
stopCluster(cluster)
registerDoSEQ()

	Overall
accum30	81.91500
nppmmid30agg	32.87138
tpi30agg	28.58194
cpg30precip	21.90547
cad30RS	18.39622
deficitRS	15.65414
decid30RS	15.01028
pca_B1	14.21273
hdi30RS	12.94191
twi30agg	11.79597
pca_B2	11.76070
vv30agg	11.66915
npol30agg	11.66672
ndwias30agg	11.57453
ndvi30yrRS	11.26538

	Overall
accum30	42.937748
nppmmid30agg	17.394339
tpi30agg	15.040608
cpg30precip	10.116063
cad30RS	9.655658
twi30agg	7.748022
hdi30RS	7.730162
decid30RS	7.693185
npol30agg	7.261287
vv30agg	7.257008
pca_B2	7.137440
deficitRS	6.890446
pca_B1	6.463274

As we expected, when you go from large hold outs to small you will see a trend in performance.

RFE GBM

Now we can move on to the GBM rfe. First we need to modify and bring in gbm properties for rfe so that it can recognize the gbm operations, similar to rfFuncs. This is due to rfe not having a predefined function for gbm.

## this just brings in a basic list from which we can then modify
gbmRFE <-  list(summary = defaultSummary,
               fit = function(x, y, first, last, ...){
                 library(randomForest)
                 randomForest(x, y, importance = first, ...)
               },
               pred = function(object, x)  predict(object, x),
               rank = function(object, x, y) {
                 vimp <- varImp(object)
                 vimp <- vimp[order(vimp$Overall,decreasing = TRUE),,drop = FALSE]
                 vimp$var <- rownames(vimp)                  
                 vimp
               },
               selectSize = pickSizeBest,
               selectVar = pickVars)

#summary function is how are we storing the stats per resample
gbmRFE$summary <- function (data, lev = NULL, model = NULL) 
{
  if (is.character(data$obs)) 
    data$obs <- factor(data$obs, levels = lev)
  postResample(data[, "pred"], data[, "obs"])
}


#now bring in the gbm function and use method = "gbm"
gbmRFE$fit <- function (x, y, first, last, ...) {
  library(gbm)
  caret::train(x, y, method = "gbm")}

#this is the predict function from the "gbm" package
gbmRFE$pred <- function (object, x) 
{
  tmp <- predict(object, x)
  if (object$modelType == "Classification" & object$control$classProbs) {
    out <- cbind(data.frame(pred = tmp), as.data.frame(predict(object, 
                                                               x, type = "prob")))
  }
  else out <- tmp
  out
}

#this is the rank function from the "gbm" package
gbmRFE$rank <- function(object, x, y){
  vimp <- varImp(object, scale = TRUE)$importance
  if (!is.data.frame(vimp)) 
    vimp <- as.data.frame(vimp)
  if (object$modelType == "Regression") {
    vimp <- vimp[order(vimp[, 1], decreasing = TRUE), , drop = FALSE]
  }
  else {
    if (all(levels(y) %in% colnames(vimp)) & !("Overall" %in% 
                                               colnames(vimp))) {
      avImp <- apply(vimp[, levels(y), drop = TRUE], 1, 
                     mean)
      vimp$Overall <- avImp
    }
  }
  vimp <- vimp[order(vimp$Overall, decreasing = TRUE), , drop = FALSE]
  vimp$var <- rownames(vimp)
  vimp
}

#selectSize and selectVar will stay the same

Now we can start modeling. It should be noted that gbm does a internal feature selection so this might not be necessary; however, i’m investigating some more and still think that it is applicable for size of variables and trends.

RFE 12th HUC GBM

#bring in cores
cluster <- makeCluster(detectCores() - 1) # convention to leave 1 core for OS
registerDoParallel(cluster)

#add to controls
rfeCtrl12g <- rfeControl(
  functions = gbmRFE,
  method = "repeatedcv",
  repeats = 5,
  returnResamp = "all", 
  verbose = FALSE, 
  allowParallel = TRUE,
  index = indices12sel$index
)

rfeCtrl12gprev <- rfeControl(
  functions = gbmRFE,
  method = "repeatedcv",
  repeats = 5,
  returnResamp = "all", 
  verbose = FALSE, 
  allowParallel = TRUE,
  index = indices12selprev$index
)

#Balanced 
set.seed(1234)
rfe12g <- rfe(rec_sel,
  data = trainsel, metric = "Kappa",
  rfeControl = rfeCtrl12g,
  sizes = c(2:18))

#Prevalence
set.seed(1234)
rfe12gprev <- rfe(rec_selprev,
  data = trainprevsel, metric = "Kappa",
  rfeControl = rfeCtrl12gprev,
  sizes = c(2:18))

#now remove cluster. consistency...
stopCluster(cluster)
registerDoSEQ()

	Overall
accum30	100.0000000
nppmmid30agg	11.6228997
cpg30precip	9.1523383
tpi30agg	6.4603433
decid30RS	3.7626892
cad30RS	3.5671839
deficitRS	3.5347121
pca_B1	2.7427601
pca_B2	1.7476607
ndvi30yrRS	1.4295748
vvsd30agg	1.2350326
hdi30RS	1.1218140
npol30agg	1.1052137
vv30agg	0.7664776
twi30agg	0.6972290
ndwias30agg	0.5769807

	Overall
accum30	100.000000
nppmmid30agg	23.677770
tpi30agg	21.221456
cpg30precip	13.001646
decid30RS	9.168694
pca_B1	8.606456
pca_B2	8.169847
deficitRS	8.153113
cad30RS	8.044979
hdi30RS	7.360255
vvsd30agg	6.387338
vv30agg	6.041464
twi30agg	5.812632
ndwias30agg	5.513158
npol30agg	5.366556
ndvi30yrRS	4.466310

RFE blockCV GBM

#bring in cores
cluster <- makeCluster(detectCores() - 1) # convention to leave 1 core for OS
registerDoParallel(cluster)

#add to controls
rfeCtrlMedg <- rfeControl(
  functions = gbmRFE,
  method = "repeatedcv",
  repeats = 5,
  returnResamp = "all", 
  verbose = FALSE, 
  allowParallel = TRUE,
  index = indicesMedsel$index
)

rfeCtrlMedgprev <- rfeControl(
  functions = gbmRFE,
  method = "repeatedcv",
  repeats = 5,
  returnResamp = "all", 
  verbose = FALSE, 
  allowParallel = TRUE,
  index = indicesMedselprev$index
)

#balanced
set.seed(1234)
rfeMedg <- rfe(rec_sel,
  data = trainsel, metric = "Kappa",
  rfeControl = rfeCtrlMedg,
  sizes = c(2:18))

#prevalence
set.seed(1234)
rfeMedgprev <- rfe(rec_selprev,
  data = trainprevsel, metric = "Kappa",
  rfeControl = rfeCtrlMedgprev,
  sizes = c(2:18))

#now remove cluster. consistency...
stopCluster(cluster)
registerDoSEQ()

	Overall
accum30	100.0000000
nppmmid30agg	11.3067246
cpg30precip	8.4058929
tpi30agg	5.7877504
decid30RS	3.8646041
pca_B1	3.0308583
deficitRS	2.9064551
cad30RS	2.8100509
pca_B2	1.4868424
hdi30RS	1.2173641
ndvi30yrRS	1.1002770
vvsd30agg	1.0421211
npol30agg	0.8201669
vv30agg	0.4961169
twi30agg	0.2180094
ndwias30agg	0.1624402

	Overall
accum30	100.000000
nppmmid30agg	25.411118
tpi30agg	23.708048
cpg30precip	12.607525
pca_B1	9.653712
hdi30RS	8.526657
decid30RS	8.196263
pca_B2	8.160256
deficitRS	7.502601
cad30RS	5.513605
vvsd30agg	5.473749
twi30agg	5.337670
vv30agg	4.688262
npol30agg	4.169500
ndwias30agg	4.003539
ndvi30yrRS	3.496428

RFE 14th HUC GBM

#bring in cores
cluster <- makeCluster(detectCores() - 1) # convention to leave 1 core for OS
registerDoParallel(cluster)

#add to controls
rfeCtrl14g <- rfeControl(
  functions = gbmRFE,
  method = "repeatedcv",
  repeats = 5,
  returnResamp = "all", 
  verbose = FALSE, 
  allowParallel = TRUE,
  index = indices14sel$index
)

rfeCtrl14gprev <- rfeControl(
  functions = gbmRFE,
  method = "repeatedcv",
  repeats = 5,
  returnResamp = "all", 
  verbose = FALSE, 
  allowParallel = TRUE,
  index = indices14selprev$index
)

#balanced
set.seed(1234)
rfe14g <- rfe(rec_sel,
  data = trainsel, metric = "Kappa",
  rfeControl = rfeCtrl14g,
  sizes = c(2:18))

#prevalence
set.seed(1234)
rfe14gprev <- rfe(rec_selprev,
  data = trainprevsel, metric = "Kappa",
  rfeControl = rfeCtrl14gprev,
  sizes = c(2:18))

#now remove cluster. consistency...
stopCluster(cluster)
registerDoSEQ()

	Overall
accum30	100.000000
nppmmid30agg	11.646221
cpg30precip	9.185082
tpi30agg	6.084723
decid30RS	3.872173
deficitRS	3.612897
cad30RS	3.267076
pca_B1	3.184627

	Overall
accum30	100.000000
tpi30agg	23.978427
nppmmid30agg	23.277920
cpg30precip	12.187836
pca_B1	9.195571
deficitRS	8.000454
hdi30RS	6.821554
cad30RS	6.545682
pca_B2	6.428239
decid30RS	6.304483
vv30agg	5.866341
twi30agg	4.644523
ndwias30agg	3.972278
vvsd30agg	3.484467
npol30agg	2.878771
ndvi30yrRS	1.585071

RFE Kmeans GBM

#bring in cores
cluster <- makeCluster(detectCores() - 1) # convention to leave 1 core for OS
registerDoParallel(cluster)

#add to controls
rfeCtrlKg <- rfeControl(
  functions = gbmRFE,
  method = "repeatedcv",
  repeats = 5,
  returnResamp = "all", 
  verbose = FALSE, 
  allowParallel = TRUE,
  index = indicesKsel$index
)

rfeCtrlKgprev <- rfeControl(
  functions = gbmRFE,
  method = "repeatedcv",
  repeats = 5,
  returnResamp = "all", 
  verbose = FALSE, 
  allowParallel = TRUE,
  index = indicesKselprev$index
)

#balanced
set.seed(1234)
rfeKg <- rfe(rec_sel,
  data = trainsel, metric = "Kappa",
  rfeControl = rfeCtrlKg,
  sizes = c(2:18))

#prevalence
set.seed(1234)
rfeKgprev <- rfe(rec_selprev,
  data = trainprevsel, metric = "Kappa",
  rfeControl = rfeCtrlKgprev,
  sizes = c(2:18))

#now remove cluster. consistency...
stopCluster(cluster)
registerDoSEQ()

	Overall
accum30	100.0000000
nppmmid30agg	11.0036543
cpg30precip	8.5173055
tpi30agg	6.5050789
decid30RS	3.5434939
cad30RS	3.2325680
deficitRS	2.9980272
pca_B1	2.9335504
pca_B2	1.5771027
vvsd30agg	1.2652670
hdi30RS	1.1346387
ndvi30yrRS	0.9530277
npol30agg	0.7642998
vv30agg	0.6926950
ndwias30agg	0.6084585
twi30agg	0.3628058

	Overall
accum30	100.0000000
tpi30agg	22.8016631
nppmmid30agg	22.7733825
cpg30precip	11.8369937
deficitRS	8.1688138
decid30RS	7.8369517
pca_B1	7.5783419
hdi30RS	6.1432806
cad30RS	5.8776606
pca_B2	5.8593121
twi30agg	4.8789386
ndwias30agg	4.5529335
vv30agg	3.9139695
vvsd30agg	3.2845994
npol30agg	2.5528054
ndvi30yrRS	0.3803842

As we expected, when you go from large hold outs to small you will see a trend in performance.

FFS Random Forest

So with the FFS we will just use the default parameters for both Random Forest and GBM. This shouldn’t be too big of concern as I mentioned earlier. Kappa will still be the performance metric for selecting.

FFS 12th HUC Random Forest

#bring in cores
cluster <- makeCluster(detectCores() - 1) # convention to leave 1 core for OS
registerDoParallel(cluster)

#set up control
ctrl12rf <- trainControl(method="repeatedcv",
                     repeats = 5,
                     allowParallel = TRUE,
                     returnResamp = "all", 
                     verbose = FALSE, 
                     index = indices12sel$index)


set.seed(1234)
ffs12rf <- ffs(rec_sel,
  data = trainsel, 
  metric = "Kappa",
  method = "rf",
  trControl = ctrl12rf)


ctrltest <- trainControl(method="repeatedcv",
                     repeats = 5,
                     allowParallel = TRUE,
                     returnResamp = "all", 
                     verbose = FALSE)

set.seed(1234)
ffstest <- ffs(rec_test,
  data = testing, 
  metric = "Kappa",
  method = "rf",
  trControl = ctrltest)

#now remove cluster. consistency...
stopCluster(cluster)
registerDoSEQ()

FFS GBM

Model Performance

Quick Intro

Model Performance

In this section we will now start to tune the selected variables from the feature selection process in the earlier section. This is where we will really try to optimize the performance using different techniques and approaches. There are different ways (advantages and disadvantages) to tuning certain model types. Random forest is pretty straight foward since it has only two parameters (e.g. mtry and ntrees) but GBM has multiple and can be sensitive to certain tuning schemes (Kuhn and Johnson 2013). Therefore, it should be handled with care when exploring tuning options. Also, depending on what the objectives are for your project or study will ultimately direct your tuning and evaluation techniques. For more details on the ‘why are we tuning?’ go to the evaluation techniques section as that will give the rationale. This is just the nuts and bolts, so lots of graphs and code!

RFE downsizing

With rfe we end up with a lot of subsets (e.g. important variables), which we then need to decide how to filter or choose. Let us suppose we took a subset with the highest Kappa statistic, which contains 14 variables, but there were 5 other subsets with 4,5,6,7,10 variables within 2% of the Kappa value. Then it can be advantageous to take the one with the least amount of variables (e.g. curse of dimensionality) within a certain range (e.g. 1-5%). We could also look at the ‘one standard error’ approach by Breiman et al. (1984) and take the least complex model (e.g. low variable or parameter) within one standard error of the highest performing metric. These all have their trade-offs and some approaches work better for different types of models or feature selection; however, it’s prudent to explore these options and know the costs associated with them.

The goal is to not fall into the trap of doing this for every model selection or feature selection process, e.g. feature selection vs. feature engineering. In our case we are using it to trim down the rfe feature selection process and not the final GBM or random forest models. Also, it wouldn’t be appropriate to use with ffs, since ffs is essentially doing it for us. Thus if we use it for tuning (i.e. feature engineering) then we need to be careful because of different tuning parameters that are involved with the process. This is very subtle but should be highlighted. For example, how is a random forest model with 2 mtry and 1000 trees less complex than a model with 6 mtry and 250 trees? This is just with random forest, which only has two free parameters! This only gets more complicated with GBM, which has 4-5 free parameters and can lead to a very subjective selection process using the ‘least complex’ approach (Kuhn and Johnson 2013).

Therefore, with rfe we will look at both the standard error and tolerance approach. Hence the highest performing feature selection might not be the model we pick and tune, i.e. this is the default function of rfe. If this is too confusing I appologize and it’s my inability to express this concept (guilty as charged); however, what I’m trying to say is that by selecting the model based on performance (default settings) results should be handled carefully. Hopefully the descriptions below from the function manual will help explain this process more clearly or for more information go to topepo.

“oneSE is a rule in the spirit of the”one standard error" rule of Breiman et al. (1984), who suggest that the tuning parameter associated with the best performance may over fit. They suggest that the simplest model within one standard error of the empirically optimal model is the better choice. This assumes that the models can be easily ordered from simplest to most complex (see the Details section below).

tolerance takes the simplest model that is within a percent tolerance of the empirically optimal model. For example, if the largest Kappa value is 0.5 and a simpler model within 3 percent is acceptable, we score the other models using (x - 0.5)/0.5 * 100. The simplest model whose score is not less than 3 is chosen (in this case, a model with a Kappa value of 0.35 is acceptable)."

Here’s a function below that will take different tolerance inputs and graph them so we can see how the rfe subsets react with certain tolerance values. This will hopefully give us an idea of some of the gains we can get from reducing dimensions while keeping performance relatively close to the maximium.

toltune <- function(object, len, metric = "", title = "", plot = TRUE){
  
  object <- object$results
  res = integer()
  for (i in 1:len){
    
    res[i] <- tolerance(object, metric = metric, tol = i,maximize = TRUE)  
    
  }
  tolit <- object[res,1:6]
  tolit$tolerance <- seq(1,len, 1)
  
  if (plot == "TRUE"){
    
  pp <-  tolit %>% ggplot(aes_string("tolerance",metric)) + 
      geom_point(aes(color = factor(Variables)), size = 3) + geom_line() + 
      ggtitle(paste(title),"statistic against tolerance values") + 
    scale_x_continuous(breaks = 1:len) +
  labs(color = "# Variables")
 
 
  
  } else {
    
  }
    
  
  return(list(tolit, pp))
  
}

Now we can look at the different rfe models compared to the tolerance ranges. Ideally we want to pick the least complex model and in our case this would be related to variable size. The basic function above helps us do this.

As you can see it’s nice to look at how these different variable subsets plot against a range of tolerances. This will hopefully give us a better idea of what subset to choose and how variables/models are reacting with rfe. It seems like the random forest model flattens out when decreasing in tolerance and GBM steadily decreases as you can see above. Not sure what to make of this yet, so comments or thoughts would be much appreciated. Now we can look at the one standard error approach.

Let’s make a function so we can handle all the models. Create a list with all the models and then we can iterate through them and save the oneSD result.

oneStan <- data.frame()

oneReturn <- function(objlist, metric = "", num, maximize = TRUE){
  

  y <- NULL
 for (i in 1:length(objlist)){
  
  object <- objlist[[i]]
    
   object <- object$results
  
  a <- oneSE(object, metric = "Kappa", num = 10 , maximize = TRUE)
  
  a <- object[a,]
    y <- rbind(y, a)
    
  }
  return(y)
}

oneR <- oneReturn(objlist, "Kappa", num = 10)
oneR1 <- oneReturn(objlist2,"Kappa", num = 10)

We can now plot them to see how the add up. Not in order, sorry.

oneR %>% ggplot(aes(model, Kappa, color = factor(model))) + geom_point(size = 5) + ggtitle("Balanced oneSD Kappa values per Model")

oneR1 %>% ggplot(aes(model, Kappa, color = factor(model))) + geom_point(size = 5) + ggtitle("Prevalence oneSD Kappa values per Model")

Now we can add the tolerance selection and see how they add up between the two methods. To choose the tolerance group I went with the closest to the highest performance, e.g. flat sections that still had high performance. Thus it was easy with random forest but GBM only lost a few variables, so I was a little more conservative. We can now see below that Both methods look pretty close to each other so I feel pretty comfortable using either the tolerance approach or the oneR approach. The oneR approach seems to be a lot easier and less descriptive so I’ll most likely use that to get the final subset from rfe.

RFE tuning

Now that we’ve figured out which rfe subset we are going to use we can start tuning the parameters (e.g. trees, depth, etc). This tuning process will also include the thresholder function from caret to optimize the cutoff threshold in the response. However, this will mostly be a visual exercise as precision-recall AUC (prAUC) will be the ultimate measure of performance. To start we will add some features to the models for the random forest and GBM models.

Table of ‘One Standard Error’ Kappa Values per Model
	Variables	Kappa	model
4	5	0.5940055	RFE RF12
1	2	0.5630988	RFE GBM12
41	5	0.6205813	RFE RFMed
5	6	0.6144926	RFE GBM Med
42	5	0.6429501	RFE RF14
51	6	0.6336356	RFE GBM 14
43	5	0.6967188	RFE RFK
52	6	0.6721744	RFE GBM K

RFE 12th HUC Random Forest Tune

#get variable subset from oneSD results

rfe12$optVariables[1:5]

levels(traintune$stream) <- c("X0","X1")

rec_tune12 <-   recipe(stream ~ ., data = traintune) %>% 
  update_role(sb12,sb14,sbMed,spatial_cluster, HUC12, HUC14, coords.x1,coords.x2, new_role = "bring along") %>% 
  update_role(-accum30,-tpi30agg,-nppmmid30agg,-cpg30precip,-cad30RS,-stream, new_role = "bring along")
customRF <- getModelInfo(model = "rf", regex = FALSE)[[1]]

#make custom parameters that we'll use on all random forest models
names(customRF)

prm <- data.frame(parameter = c("mtry", "ntree"),
                  class = rep("numeric",2),
                  label = c("#Random Predictors", "#of Trees"))

customRF$parameters <- prm

customRF$grid <- function(x, y, len = NULL, search = "grid") {
                    if(search == "grid") {
                      out <- data.frame(mtry = caret::var_seq(p = ncol(x), 
            classification = is.factor(y), len = len))
                    } else {
                      out <- data.frame(mtry = unique(sample(1:ncol(x), size = len, replace = TRUE)))
                    }
                  }

customRF$fit <- function(x, y, wts, param, lev, last, classProbs, ...){
                    randomForest::randomForest(x, y, mtry = param$mtry,ntree = param$ntree, ...)}

cluster <- makeCluster(detectCores() - 1) # convention to leave 1 core for OS
registerDoParallel(cluster)

set.seed(1234)
rfe12tune <- train(rec_tune12, data = traintune,
              method = customRF,
              metric = "prAUC",
              tuneGrid = expand.grid(mtry = 1:4,ntree=seq(100,1500,200)),
              trControl = trainControl(method = "repeatedcv",
                                       repeats = 5,
                                       classProbs = TRUE,
                                       savePredictions = 'all',
                                       index = indices12tune$index,
                                       summaryFunction = customRFsum))

stopCluster(cluster)
registerDoSEQ()

rfe12tuneThresh <- thresholder(rfe12tune, threshold = seq(0,1,.02)) #we'll just select the final model

Confusion Matrix
	X0	X1
X0	2136	468
X1	597	2793

RFE Med BlockCV Random Forest Tune

#get variable subset from oneSD results

rfeMed$optVariables[1:5]

rec_tuneMed <-   recipe(stream ~ ., data = traintune) %>% 
  update_role(sb12,sb14,sbMed,spatial_cluster, HUC12, HUC14, coords.x1,coords.x2, new_role = "bring along") %>% 
  update_role(-accum30,-tpi30agg,-nppmmid30agg,-cpg30precip,-cad30RS,-stream, new_role = "bring along")


cluster <- makeCluster(detectCores() - 1) # convention to leave 1 core for OS
registerDoParallel(cluster)

set.seed(1234)
rfeMedtune <- train(rec_tuneMed, data = traintune,
              method = customRF,
              metric = "prAUC",
              tuneGrid = expand.grid(mtry = 1:4,ntree=seq(100,1500,200)),
              trControl = trainControl(method = "repeatedcv",
                                       repeats = 5,
                                       classProbs = TRUE,
                                       savePredictions = 'all',
                                       index = indicesMedtune$index,
                                       summaryFunction = multiClassSummary))

stopCluster(cluster)
registerDoSEQ()

rfeMedtuneThresh <- thresholder(rfeMedtune, threshold = seq(0,1,.02))

Confusion Matrix
	X0	X1
X0	2135	475
X1	598	2786

RFE 14th HUC Random Forest Tune

#get variable subset from oneSD results

rfe14$optVariables[1:5]

rec_tune14 <-   recipe(stream ~ ., data = traintune) %>% 
  update_role(sb12,sb14,sbMed,spatial_cluster, HUC12, HUC14, coords.x1,coords.x2, new_role = "bring along") %>% 
  update_role(-accum30,-tpi30agg,-nppmmid30agg,-cpg30precip,-cad30RS,-stream, new_role = "bring along")


cluster <- makeCluster(detectCores() - 1) # convention to leave 1 core for OS
registerDoParallel(cluster)

set.seed(1234)
rfe14tune <- train(rec_tune14, data = traintune,
              method = customRF,
              metric = "prAUC",
              tuneGrid = expand.grid(mtry = 1:4,ntree=seq(100,1500,200)),
              trControl = trainControl(method = "repeatedcv",
                                       repeats = 5,
                                       classProbs = TRUE,
                                       savePredictions = 'all',
                                       index = indices14tune$index,
                                       summaryFunction = multiClassSummary))

stopCluster(cluster)
registerDoSEQ()

rfe14tuneThresh <- thresholder(rfe14tune, threshold = seq(0,1,.02))

Confusion Matrix
	X0	X1
X0	2228	509
X1	505	2752

RFE Kmeans Random Forest Tune

#get variable subset from oneSD results

rfeK$optVariables

rec_tuneK <-   recipe(stream ~ ., data = traintune) %>% 
  update_role(sb12,sb14,sbMed,spatial_cluster, HUC12, HUC14, coords.x1,coords.x2, new_role = "bring along") %>% 
  update_role(-accum30,-tpi30agg,-nppmmid30agg,-cpg30precip,-cad30RS,-stream, new_role = "bring along")


cluster <- makeCluster(detectCores() - 1) # convention to leave 1 core for OS
registerDoParallel(cluster)

set.seed(1234)
rfeKtune <- train(rec_tuneK, data = traintune,
              method = customRF,
              metric = "prAUC",
              tuneGrid = expand.grid(mtry = 1:4,ntree=seq(100,1500,200)),
              trControl = trainControl(method = "repeatedcv",
                                       repeats = 5,
                                       classProbs = TRUE,
                                       savePredictions = 'all',
                                       index = indicesKtune$index,
                                       summaryFunction = multiClassSummary))

stopCluster(cluster)
registerDoSEQ()

rfeKtuneThresh <- thresholder(rfeKtune, threshold = seq(0,1,.02))

Confusion Matrix
	X0	X1
X0	2335	433
X1	398	2828

RFE 12th HUC GBM Tune

#get variable subset from oneSD results

rfe12g$optVariables
library(recipes)
library(CAST)
library(caret)
library(BradleyTerry2)
rec_tune12g <-   recipe(stream ~ ., data = traintune) %>% 
  update_role(-accum30,-nppmmid30agg,-stream, new_role = "bring along") 

cluster <- makeCluster(detectCores() - 1) # convention to leave 1 core for OS
registerDoParallel(cluster)

set.seed(1234)
rfe12gtune <- train(rec_tune12g, data = traintune,
              method = "gbm", tuneLength = 10,
                 metric = "prAUC", distribution = "bernoulli",
                  trControl = trainControl(method = "adaptive_cv",
                                           repeats = 5,
                                           adaptive = list(min = 5, alpha = 0.05, 
                                                           method = "BT", complete = TRUE),
                                           classProbs = TRUE,
                                           savePredictions = 'all',
                                           search = "random",
                                           index = indices12tune$index,
                                           summaryFunction = multiClassSummary))


stopCluster(cluster)
registerDoSEQ()

rfe12gtuneThresh <- thresholder(rfe12gtune, threshold = seq(0,1,.02))

Model Assessment

lsakdjfsalkdjf as;ldfkjasldfj

Discussion/Results

References

Bibliography

Altman, Douglas G, and J Martin Bland. 1994. “Diagnostic Tests. 1: Sensitivity and Specificity.” BMJ: British Medical Journal 308 (6943): 1552.

Bahn, Volker, and Brian J McGill. 2013. “Testing the Predictive Performance of Distribution Models.” Oikos 122 (3): 321–31.

BEVEN, Keith J, and Michael J Kirkby. 1979. “A Physically Based, Variable Contributing Area Model of Basin Hydrology/Un Modèle à Base Physique de Zone d’appel Variable de L’hydrologie Du Bassin Versant.” Hydrological Sciences Journal 24 (1): 43–69.

Breiman, Leo, Jerome Friedman, Charles J Stone, and Richard A Olshen. 1984. Classification and Regression Trees. CRC press.

Dobrowski, Solomon Z, John Abatzoglou, Alan K Swanson, Jonathan A Greenberg, Alison R Mynsberge, Zachary A Holden, and Michael K Schwartz. 2013. “The Climate Velocity of the Contiguous U Nited S Tates During the 20th Century.” Global Change Biology 19 (1): 241–51.

Elith, Jane, John R Leathwick, and Trevor Hastie. 2008. “A Working Guide to Boosted Regression Trees.” Journal of Animal Ecology 77 (4): 802–13.

Friedman, Jerome, Trevor Hastie, and Robert Tibshirani. 2001. The Elements of Statistical Learning. Vol. 1. 10. Springer series in statistics New York.

Gesch, Dean, Gayla Evans, James Mauck, John Hutchinson, William J Carswell Jr, and others. 2009. “The National Map—Elevation.” US Geological Survey Fact Sheet 3053 (4).

Gorelick, Noel, Matt Hancher, Mike Dixon, Simon Ilyushchenko, David Thau, and Rebecca Moore. 2017. “Google Earth Engine: Planetary-Scale Geospatial Analysis for Everyone.” Remote Sensing of Environment. https://doi.org/10.1016/j.rse.2017.06.031.

Hird, Jennifer, Evan DeLancey, Gregory McDermid, and Jahan Kariyeva. 2017. “Google Earth Engine, Open-Access Satellite Data, and Machine Learning in Support of Large-Area Probabilistic Wetland Mapping.” Remote Sensing 9 (12): 1315.

Holden, A. Swanson, Z. A., and others. In prep. “Using Climatic and Biophysical Variables to Model the Presence and Severity of Root Disease Across the U.s. Northern Rocky Mountains.”

Hoylman, Zachary H, Kelsey G Jencso, Jia Hu, Zachary A Holden, Justin T Martin, and W Payton Gardner. 2019. “The Climatic Water Balance and Topography Control Spatial Patterns of Atmospheric Demand, Soil Moisture, and Shallow Subsurface Flow.” Water Resources Research 55 (3): 2370–89.

Hoylman, Zachary H, Kelsey G Jencso, Jia Hu, Justin T Martin, Zachary A Holden, Carl A Seielstad, and Eric M Rowell. 2018. “Hillslope Topography Mediates Spatial Patterns of Ecosystem Sensitivity to Climate.” Journal of Geophysical Research: Biogeosciences 123 (2): 353–71.

Ives, Anthony R, and Jun Zhu. 2006. “Statistics for Correlated Data: Phylogenies, Space, and Time.” Ecological Applications 16 (1): 20–32.

Jaeger, KL, R Sando, Ryan R McShane, Jason B Dunham, DP Hockman-Wert, Kendra E Kaiser, K Hafen, JC Risley, and KW Blasch. 2019. “Probability of Streamflow Permanence Model (Prosper): A Spatially Continuous Model of Annual Streamflow Permanence Throughout the Pacific Northwest.” Journal of Hydrology X 2: 100005.

James, Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani. 2013. An Introduction to Statistical Learning. Vol. 112. Springer.

Kaufman, Shachar, Saharon Rosset, Claudia Perlich, and Ori Stitelman. 2012. “Leakage in Data Mining: Formulation, Detection, and Avoidance.” ACM Transactions on Knowledge Discovery from Data (TKDD) 6 (4): 1–21.

Kuhn, Max, and Kjell Johnson. 2013. Applied Predictive Modeling. Vol. 26. Springer.

Lidberg, William, Mats Nilsson, and Anneli Ågren. 2020. “Using Machine Learning to Generate High-Resolution Wet Area Maps for Planning Forest Management: A Study in a Boreal Forest Landscape.” Ambio 49 (2): 475–86.

Lutz, James A, Jan W Van Wagtendonk, and Jerry F Franklin. 2010. “Climatic Water Deficit, Tree Species Ranges, and Climate Change in Yosemite National Park.” Journal of Biogeography 37 (5): 936–50.

Matloff, Norman. 2017. Statistical Regression and Classification: From Linear Models to Machine Learning. CRC Press.

Meyer, Hanna, Christoph Reudenbach, Tomislav Hengl, Marwan Katurji, and Thomas Nauss. 2018. “Improving Performance of Spatio-Temporal Machine Learning Models Using Forward Feature Selection and Target-Oriented Validation.” Environmental Modelling & Software 101: 1–9.

Meyer, Hanna, Christoph Reudenbach, Stephan Wöllauer, and Thomas Nauss. 2019. “Importance of Spatial Predictor Variable Selection in Machine Learning Applications–Moving from Data Reproduction to Spatial Prediction.” Ecological Modelling 411: 108815.

Pelletier, Jon D, Greg A Barron-Gafford, Hugo Gutiérrez-Jurado, Eve-Lyn S Hinckley, Erkan Istanbulluoglu, Luke A McGuire, Guo-Yue Niu, et al. 2018. “Which Way Do You Lean? Using Slope Aspect Variations to Understand Critical Zone Processes and Feedbacks.” Earth Surface Processes and Landforms 43 (5): 1133–54.

Raduła, Małgorzata W, Tomasz H Szymura, and Magdalena Szymura. 2018. “Topographic Wetness Index Explains Soil Moisture Better Than Bioindication with Ellenberg’s Indicator Values.” Ecological Indicators 85: 172–79.

Roberts, David R, Volker Bahn, Simone Ciuti, Mark S Boyce, Jane Elith, Gurutzeta Guillera-Arroita, Severin Hauenstein, et al. 2017. “Cross-Validation Strategies for Data with Temporal, Spatial, Hierarchical, or Phylogenetic Structure.” Ecography 40 (8): 913–29.

Robinson, Nathaniel P, Brady W Allred, William K Smith, Matthew O Jones, Alvaro Moreno, Tyler A Erickson, David E Naugle, and Steven W Running. 2018. “Terrestrial Primary Production for the Conterminous United States Derived from Landsat 30 M and Modis 250 M.” Remote Sensing in Ecology and Conservation 4 (3): 264–80.

R. Sando, K. E. Kaiser, T. D. Olsen. 2018. “Probability Fo Streamflow Permanence (Prosper) Continuous Parameter Grids (Cpgs).”

Sando, Roy, and Kyle W Blasch. 2015. “Predicting Alpine Headwater Stream Intermittency: A Case Study in the Northern Rocky Mountains.” Ecohydrology & Hydrobiology 15 (2): 68–80.

Sörensen, Rasmus, Ursula Zinko, and Jan Seibert. 2006. “On the Calculation of the Topographic Wetness Index: Evaluation of Different Methods Based on Field Observations.” Hydrology and Earth System Sciences Discussions 10 (1): 101–12.

Stephenson, Nathan. 1998. “Actual Evapotranspiration and Deficit: Biologically Meaningful Correlates of Vegetation Distribution Across Spatial Scales.” Journal of Biogeography 25 (5): 855–70.

Supplementary Material (Code S1) to: Hird, Jennifer, Evan DeLancey, Gregory McDermid, and Jahan Kariyeva. 2017. “Google Earth Engine, Open-Access Satellite Data, and Machine Learning in Support of Large-Area Probabilistic Wetland Mapping.” Remote Sensing 9 (12): 1315.

Tarboton, David G. 2013. “TauDEM 5.2 Guide to Using the Taudem Command Line Functions for Taudem Multi-File.”

Team, R Core. 2014. “A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing.” ISBN 3-900051-07-0. http://www. R-project. org.

Valavi, Roozbeh, Jane Elith, Jose J Lahoz-Monfort, and Gurutzeta Guillera-Arroita. 2018. “BlockCV: An R Package for Generating Spatially or Environmentally Separated Folds for K-Fold Cross-Validation of Species Distribution Models.” bioRxiv, 357798.

Van Bell, Gerald. 2011. Statistical Rules of Thumb. John Wiley & Sons.

Stream Occurence

Josh Erickson
USDA Forest Service, Northern Region, Kootenai National Forest
joshua.l.erickson@usda.gov

17 March, 2020

Introduction

Accessing Data

Methods

Summary

Variables

Pre-processing

Splitting Data

Guidance

Proportion

Balanced?

Cause and effect

Prevalance

Data Leakage

Structures

Evaluation Techniques

Modeling

Model Selection

Model Splitting

RFE Random Forest

RFE GBM

FFS Random Forest

FFS GBM

Model Performance

Quick Intro

RFE downsizing

RFE tuning

Model Assessment

Discussion/Results

References

Stream Occurence

Josh Erickson USDA Forest Service, Northern Region, Kootenai National Forest joshua.l.erickson@usda.gov

17 March, 2020

Introduction

Accessing Data

Methods

Summary

Variables

Pre-processing

Splitting Data

Guidance

Proportion

Balanced?

Cause and effect

Prevalance

Data Leakage

Structures

Evaluation Techniques

Modeling

Model Selection

Model Splitting

RFE Random Forest

RFE GBM

FFS Random Forest

FFS GBM

Model Performance

Quick Intro

RFE downsizing

RFE tuning

Model Assessment

Discussion/Results

References

Josh Erickson
USDA Forest Service, Northern Region, Kootenai National Forest
joshua.l.erickson@usda.gov