Random Forest Prediction soil map

Upload library(and several code still hidden)

Data preparation

## Warning in showSRID(uprojargs, format = "PROJ", multiline = "NO", prefer_proj =
## prefer_proj): Discarded datum Deutsches_Hauptdreiecksnetz in CRS definition

## Warning in showSRID(uprojargs, format = "PROJ", multiline = "NO", prefer_proj =
## prefer_proj): Discarded datum Deutsches_Hauptdreiecksnetz in CRS definition

The covariates are then converted to principal components to reduce covariance and dimensionality:

eberg_spc <- spc(eberg_grid, ~ PRMGEO6+DEMSRT6+TWISRT6+TIRAST6)

## Converting PRMGEO6 to indicators...

## Warning in proj4string(obj): CRS object has comment, which is lost in output

## Converting covariates to principal components...

eberg_grid@data <- cbind(eberg_grid@data, eberg_spc@predicted@data)

All further analysis is run using the so-called regression matrix (matrix produced using the overlay of points and grids), which contains values of the target variable and all covariates for all training points:

ov <- over(eberg, eberg_grid)
m <- cbind(ov, eberg@data)
dim(m)

## [1] 3670   44

In this case the regression matrix consists of 3670 observations and has 44 columns.

Spatial prediction of soil classes using MLA's

In the first example, we focus on mapping soil types using the auger point data. First, we need to filter out some classes that do not occur frequently enough to support statistical modelling. As a rule of thumb, a class to be modelled should have at least 5 observations:

xg <- summary(m$TAXGRSC, maxsum=(1+length(levels(m$TAXGRSC))))
str(xg)

##  Named int [1:14] 71 790 86 1 186 1 704 215 252 487 ...
##  - attr(*, "names")= chr [1:14] "Auenboden" "Braunerde" "Gley" "HMoor" ...

selg.levs <- attr(xg, "names")[xg > 5]
attr(xg, "names")[xg <= 5]

## [1] "HMoor" "Moor"

this shows that two classes probably have too few observations and should be excluded from further modeling:

m$soiltype <- m$TAXGRSC
m$soiltype[which(!m$TAXGRSC %in% selg.levs)] <- NA
m$soiltype <- droplevels(m$soiltype)
str(summary(m$soiltype, maxsum=length(levels(m$soiltype))))

##  Named int [1:11] 790 704 487 376 252 215 186 86 71 43 ...
##  - attr(*, "names")= chr [1:11] "Braunerde" "Parabraunerde" "Pseudogley" "Regosol" ...

We can also remove all points that contain missing values for any combination of covariates and target variable:

m <- m[complete.cases(m[,1:(ncol(eberg_grid)+2)]),]
m$soiltype <- as.factor(m$soiltype)
summary(m$soiltype)

##     Auenboden     Braunerde          Gley    Kolluvisol Parabraunerde 
##            48           669            68           138           513 
##  Pararendzina       Pelosol    Pseudogley        Ranker       Regosol 
##           176           177           411            17           313 
##      Rendzina 
##            22

We can now test fitting a MLA i.e. a random forest model using four covariate layers (parent material map, elevation, TWI and ASTER thermal band):

## subset to speed-up:
s <- sample.int(nrow(m), 500)
TAXGRSC.rf <- randomForest(x=m[-s,paste0("PC",1:10)], y=m$soiltype[-s],
                           xtest=m[s,paste0("PC",1:10)], ytest=m$soiltype[s])
## accuracy:
TAXGRSC.rf$test$confusion[,"class.error"]

##     Auenboden     Braunerde          Gley    Kolluvisol Parabraunerde 
##     0.8333333     0.5076923     0.7500000     0.7142857     0.5045872 
##  Pararendzina       Pelosol    Pseudogley        Ranker       Regosol 
##     0.6000000     0.6363636     0.6388889     1.0000000     0.7741935 
##      Rendzina 
##     0.5000000

Note that, by specifying xtest and ytest, we run both model fitting and cross-validation with 500 excluded points. The results show relatively high prediction error of about 60% i.e. relative classification accuracy of about 40%.

TAXGRSC.rf <- randomForest(x=m[,paste0("PC",1:10)], y=m$soiltype)
#fm <- as.formula(paste("soiltype~", paste(paste0("PC",1:10), collapse="+")))
#TAXGRSC.mn <- nnet::multinom(fm, m)
#TAXGRSC.svm <- e1071::svm(fm, m, probability=TRUE, cross=5)
#TAXGRSC.svm$tot.accuracy

Result Prediction

#probs1 <- predict(TAXGRSC.mn, eberg_grid@data, type="probs", na.action = na.pass) 
probs2 <- predict(TAXGRSC.rf, eberg_grid@data, type="prob", na.action = na.pass)
#probs3 <- attr(predict(TAXGRSC.svm, eberg_grid@data, 
#                       probability=TRUE, na.action = na.pass), "probabilities")
head(probs2,10)

##    Auenboden Braunerde  Gley Kolluvisol Parabraunerde Pararendzina Pelosol
## 1      0.078     0.174 0.080      0.116         0.252        0.036   0.022
## 2      0.080     0.174 0.084      0.112         0.256        0.040   0.020
## 3      0.064     0.172 0.084      0.096         0.246        0.064   0.040
## 4      0.000     0.214 0.000      0.000         0.000        0.406   0.242
## 5      0.000     0.694 0.000      0.000         0.000        0.222   0.034
## 6      0.000     0.644 0.000      0.000         0.000        0.272   0.018
## 7      0.000     0.246 0.000      0.000         0.000        0.292   0.208
## 8      0.000     0.054 0.000      0.000         0.000        0.196   0.422
## 9      0.000     0.026 0.000      0.000         0.028        0.336   0.232
## 10     0.000     0.072 0.000      0.000         0.042        0.082   0.736
##    Pseudogley Ranker Regosol Rendzina
## 1       0.140  0.016   0.086    0.000
## 2       0.144  0.008   0.082    0.000
## 3       0.132  0.010   0.082    0.010
## 4       0.074  0.000   0.048    0.016
## 5       0.034  0.000   0.012    0.004
## 6       0.038  0.000   0.006    0.022
## 7       0.044  0.000   0.100    0.110
## 8       0.098  0.002   0.160    0.068
## 9       0.338  0.012   0.018    0.010
## 10      0.034  0.012   0.020    0.002

derive average prediction:

leg <- levels(m$soiltype)
#lt <- list(probs1[,leg], probs2[,leg], probs3[,leg])
lt <- list(probs2[,leg])
probs <- Reduce("+", lt) / length(lt)
## copy and make new raster object:
eberg_soiltype <- eberg_grid
eberg_soiltype@data <- data.frame(probs)

Check that all predictions sum up to 100%:

ch <- rowSums(eberg_soiltype@data)
summary(ch)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       1       1       1       1       1       1

Plot the result