This is a practical case of using machine learning for classification landscape cover and finding polluted waterbodies. For data preparation I use QGIS and SAGA GIS. Downloading and atmospheric correction Sentinel-2A images I do with QGIS. For clipping, spatial referencing and making learning dataset I use SAGA GIS. For clipping cosmic images I use OpenStreetMap service shape files of areas.

But I must admit that I do not perform any ground-based validation of my models. My conclusions are based only on cosmic data. And also I am not specialist in waterbodies management.

After some data preparation I download rasters (rastername.sdat) and learning dataset (datasetname.shp) into R working environment.

The learning dataset contain points with predict values in field “Name”. It has 189 rows and 2 fields.

Both types of objects, rasters and shape files, has the same spatial reference EPSG 32636 (WGS 84/UTM zone 36N). Without spatial collocation you could not do making learn model and predicting.

For large raster objects I set temporary data directory in home directory. So. Let’s go!

First at all we need to download required packages and preset temporary data directory.

library(raster)
library(RStoolbox)
library(randomForest)
randomForest 4.6-12
Type rfNews() to see new features/changes/bug fixes.
#set a temporary data directory for large objects
rasterOptions(tmpdir = "/home/geka/TEMP_R/")

Secondly we download clipped raster files. One files is one band of Sentinel-2A image. For this work I use 2, 3, 4, 8 and 12 bands with DOS-1 atmospheric correction.

#Download the clipped Sentinel-2A images (March 31, 2017)
T36UWU20170331_B2 <- raster("/home/geka/Documents/DirtyWater/RT_T36UWU_A009256_20170331T085011_B02_clip.sdat")
T36UWU20170331_B3 <- raster("/home/geka/Documents/DirtyWater/RT_T36UWU_A009256_20170331T085011_B03_clip.sdat")
T36UWU20170331_B4 <- raster("/home/geka/Documents/DirtyWater/RT_T36UWU_A009256_20170331T085011_B04_clip.sdat")
T36UWU20170331_B8 <- raster("/home/geka/Documents/DirtyWater/RT_T36UWU_A009256_20170331T085011_B08_clip.sdat")
T36UWU20170331_B12 <- raster("/home/geka/Documents/DirtyWater/RT_T36UWU_A009256_20170331T085011_B12_clip.sdat")

Now we make a “images stack”. We compose all band into one object:

T36UWU20170331 <- stack(T36UWU20170331_B2, T36UWU20170331_B3, T36UWU20170331_B4, T36UWU20170331_B8, T36UWU20170331_B12)

And for verify the correctnes we can visualise our object:

plotRGB(T36UWU20170331, r = 3, g = 2, b = 1, stretch = "lin")

names(T36UWU20170331) <- c("B2", "B3", "B4", "B8", "B12")

Now we download the shape file with training data:

trainData <- shapefile("/home/geka/Documents/DirtyWater/LearningPoins_32636.shp")

For training we must concatenate Sentinel-2A images with training points:

train_var <- as.data.frame(extract(T36UWU20170331, trainData))
trainData@data <- data.frame(trainData@data, train_var[match(rownames(trainData@data), rownames(train_var)),])

We can find the best variant of variables number for Random Forest processing. Autodetection algorythm show us that optimal “mtry” value is 2. But for processing I choiced mtry = 3.

tuneRF(x = trainData@data[,3:ncol(trainData@data)], y = as.factor(trainData@data[,"Name"]))
mtry = 2  OOB error = 9.52% 
Searching left ...
mtry = 1    OOB error = 11.64% 
-0.2222222 0.05 
Searching right ...
mtry = 4    OOB error = 12.17% 
-0.2777778 0.05 
      mtry  OOBError
1.OOB    1 0.1164021
2.OOB    2 0.0952381
4.OOB    4 0.1216931