Open Source Spatial Analytics

Preparation

The table of samples and raster stack of predictor variables are read in and the “steve”, “dpsm”, and “drain” columns are defined as factors. An ID column is created based on current row location in order to not have overlapping samples between the training and validation datasets. Using the class column (“slopeD”, “not”), 250 slope failure and 250 pseudo absence samples, 500 total, are selected for validating and the remaining 1500 slope failure and 1500 pseudo, 3000 total, are used for training samples. The two data sets are saved together to create the training set with all 3000 samples that do not overlap with the validation samples.

lands$steve <- factor(lands$steve)
lands$dspm <- factor(lands$dspm)
lands$drain <- factor(lands$drain)

lands$ID <- seq.int(nrow(lands))

set.seed(34)
val <- lands %>%
  group_by(class) %>%
  sample_n(250, replace = FALSE)

train<- lands[!(lands$ID %in% val$ID), ] 
train$class <- factor(train$class)

Training

The randomForest function is applied to the training data set to generate five different models with the dependent variable as the class column and the predictor variables as the seperate models:

All Variables
Terrain Only
Lithology and Soil
Distance Only
All Non-Terrain

501 trees are used and set importance, confusion, and error rate to T to calculate more measures like variable importance and commission/omission error

set.seed(34)
rf_all <- randomForest(y= train[ ,1], train[ ,2:44], ntree=501, importance=T, confusion=T, err.rate=T)
rf_terrain <- randomForest(y = train[, 1], train[, 2:33], ntree=501, importance=T, confusion=T, err.rate=T)
rf_soil <- randomForest(y = train[, 1], train[, 42:44], ntree=501, importance=T, confusion=T, err.rate=T)
rf_dist <- randomForest(y = train[, 1], train[, 34:41], ntree=501, importance=T, confusion=T, err.rate=T)
rf_allnon <- randomForest(y = train[, 1], train[, 34:44], ntree=501, importance=T, confusion=T, err.rate=T)

Validation

All 5 models are used in prediction against the validation data. I set the type to “prob” to obtain the resulting class likelihood or probability, index to 2 as I want the probability for slope, norm.votes to True in order to have fractions instead of counts, predict.all to false so all predictions will not be written out - only the final results, and proximity and nodes to False as the final probablitiy value is what this model is after. To assess the model, ROC Curves are generated using the roc() function and mapped using the plot() function. The AUC metric is calculated using the auc() function.

set.seed(34)
all_test <- predict(rf_all, val, index = 2, type = "prob",  norm.votes=TRUE, predict.all=FALSE, proximity=FALSE, nodes=FALSE)
all_test <- data.frame(all_test)
all_test_roc <- roc(val$class, all_test$slopeD)
auc(all_test_roc)
## Area under the curve: 0.9434
all <- plot(all_test_roc)

terrain_test <- predict(rf_terrain, val, index = 2, type = "prob",  norm.votes=TRUE, predict.all=FALSE, proximity=FALSE, nodes=FALSE)
terrain_test <- data.frame(terrain_test)
ter_test_roc <- roc(val$class, terrain_test$slopeD)
auc(ter_test_roc)
## Area under the curve: 0.9434
terrain <- plot(ter_test_roc)

soil_test <- predict(rf_soil, val, index = 2, type = "prob",  norm.votes=TRUE, predict.all=FALSE, proximity=FALSE, nodes=FALSE)
soil_test <- data.frame(soil_test)
soil_test_roc <- roc(val$class, soil_test$slopeD)
auc(soil_test_roc)
## Area under the curve: 0.6173
plot(soil_test_roc)

dist_test <- predict(rf_dist, val, index = 2, type = "prob",  norm.votes=TRUE, predict.all=FALSE, proximity=FALSE, nodes=FALSE)
dist_test <- data.frame(dist_test)
dist_test_roc <- roc(val$class, dist_test$slopeD)
auc(dist_test_roc)
## Area under the curve: 0.8349
plot(dist_test_roc)

allnon_test <- predict(rf_allnon, val, index = 2, type = "prob",  norm.votes=TRUE, predict.all=FALSE, proximity=FALSE, nodes=FALSE)
allnon_test <- data.frame(allnon_test)
allnon_test_roc <- roc(val$class, allnon_test$slopeD)
auc(allnon_test_roc)
## Area under the curve: 0.8472
plot(allnon_test_roc)


ggroc(list(All = all_test_roc, AllNon = allnon_test_roc, Distance = dist_test_roc, SOil = soil_test_roc, Terrain = ter_test_roc), lwd = 1.15)  + 
theme_minimal() + 
scale_color_manual(labels=c("All Variables", "All-Non Terrain Only", "Distance Only", "Soil/Lithogoly Only", "Terrain Only"), values = c("yellow", "orange", "purple", "royalblue", "forestgreen"))+ ggtitle("ROC Comparison of Slope Failure Model Predictions") +
theme(legend.title = element_blank(), legend.background = element_rect(color = "black", fill = NA), legend.position = "bottom")

The results returned All Variables and Terrain Only predictor variables as having the largest AUC value. The two variables both returned an AUC of .9434. This indicates that All-Variables and Terrain Only perform the best out of the 5 models and predicts the most correct classes. The Lithology and Soil Only model has the smallest AUC value which suggests that it is a weaker model than the other models.

Spatial Prediction

The raster grid band names are changed to match the names in table from the second column to the last column in the original table. The model that was trained with all predictor variables is used to predict to the raster stack. The raster is then mapped used tmap functions to display the model’s predicted slope failure based on all predictor varibles.

table_names <- names(lands[2:44])
names(stack) <- table_names

predict(stack, rf_all, type="prob", index=2, na.rm=TRUE, progress="window", overwrite=TRUE, filename="C:/SW/Grad_WVU/693c/A13/predict_image.img")
## class      : RasterLayer 
## dimensions : 954, 1170, 1116180  (nrow, ncol, ncell)
## resolution : 2, 2  (x, y)
## extent     : 641564.5, 643904.5, 4298528, 4300436  (xmin, xmax, ymin, ymax)
## crs        : +proj=utm +zone=17 +datum=NAD83 +units=m +no_defs 
## source     : C:/SW/Grad_WVU/693c/A13/predict_image.img 
## names      : predict_image 
## values     : 0.1796407, 0.6566866  (min, max)
raster_result <- raster("C:/SW/Grad_WVU/693c/A13/predict_image.img")

tm_shape(raster_result) + 
tm_raster(palette=get_brewer_pal(palette = "YlOrBr", plot=FALSE), title = "Predicted Slope Failure") + 
tm_layout(legend.outside = TRUE, main.title = "All Predictor Variables Model of Slope Failure", main.title.size = 1.15,  main.title.fontface = "bold") +
tm_compass(position = c("right", "bottom"), bg.color = "white", bg.alpha = .75) + 
tm_scale_bar(position = c("right", "bottom"), bg.color = "white", bg.alpha = .75)

Open Source Spatial Analytics - A13

Sarah Woolard

4/22/2021

Preparation

Training

The randomForest function is applied to the training data set to generate five different models with the dependent variable as the class column and the predictor variables as the seperate models:

501 trees are used and set importance, confusion, and error rate to T to calculate more measures like variable importance and commission/omission error

Validation

Spatial Prediction