This is a practical case of using machine learning for classification landscape cover and finding polluted waterbodies. For data preparation I use QGIS and SAGA GIS. Downloading and atmospheric correction Sentinel-2A images I do with QGIS. For clipping, spatial referencing and making learning dataset I use SAGA GIS. For clipping cosmic images I use OpenStreetMap service shape files of areas.
But I must admit that I do not perform any ground-based validation of my models. My conclusions are based only on cosmic data. And also I am not specialist in waterbodies management.
After some data preparation I download rasters (rastername.sdat) and learning dataset (datasetname.shp) into R working environment.
The learning dataset contain points with predict values in field “Name”. It has 189 rows and 2 fields.
Both types of objects, rasters and shape files, has the same spatial reference EPSG 32636 (WGS 84/UTM zone 36N). Without spatial collocation you could not do making learn model and predicting.
For large raster objects I set temporary data directory in home directory. So. Let’s go!
First at all we need to download required packages and preset temporary data directory.
library(raster)
library(RStoolbox)
library(randomForest)
randomForest 4.6-12
Type rfNews() to see new features/changes/bug fixes.
#set a temporary data directory for large objects
rasterOptions(tmpdir = "/home/geka/TEMP_R/")
Secondly we download clipped raster files. One files is one band of Sentinel-2A image. For this work I use 2, 3, 4, 8 and 12 bands with DOS-1 atmospheric correction.
#Download the clipped Sentinel-2A images (March 31, 2017)
T36UWU20170331_B2 <- raster("/home/geka/Documents/DirtyWater/RT_T36UWU_A009256_20170331T085011_B02_clip.sdat")
T36UWU20170331_B3 <- raster("/home/geka/Documents/DirtyWater/RT_T36UWU_A009256_20170331T085011_B03_clip.sdat")
T36UWU20170331_B4 <- raster("/home/geka/Documents/DirtyWater/RT_T36UWU_A009256_20170331T085011_B04_clip.sdat")
T36UWU20170331_B8 <- raster("/home/geka/Documents/DirtyWater/RT_T36UWU_A009256_20170331T085011_B08_clip.sdat")
T36UWU20170331_B12 <- raster("/home/geka/Documents/DirtyWater/RT_T36UWU_A009256_20170331T085011_B12_clip.sdat")
Now we make a “images stack”. We compose all band into one object:
T36UWU20170331 <- stack(T36UWU20170331_B2, T36UWU20170331_B3, T36UWU20170331_B4, T36UWU20170331_B8, T36UWU20170331_B12)
And for verify the correctnes we can visualise our object:
plotRGB(T36UWU20170331, r = 3, g = 2, b = 1, stretch = "lin")
names(T36UWU20170331) <- c("B2", "B3", "B4", "B8", "B12")
Now we download the shape file with training data:
trainData <- shapefile("/home/geka/Documents/DirtyWater/LearningPoins_32636.shp")
For training we must concatenate Sentinel-2A images with training points:
train_var <- as.data.frame(extract(T36UWU20170331, trainData))
trainData@data <- data.frame(trainData@data, train_var[match(rownames(trainData@data), rownames(train_var)),])
We can find the best variant of variables number for Random Forest processing. Autodetection algorythm show us that optimal “mtry” value is 2. But for processing I choiced mtry = 3.
tuneRF(x = trainData@data[,3:ncol(trainData@data)], y = as.factor(trainData@data[,"Name"]))
mtry = 2 OOB error = 9.52%
Searching left ...
mtry = 1 OOB error = 11.64%
-0.2222222 0.05
Searching right ...
mtry = 4 OOB error = 12.17%
-0.2777778 0.05
mtry OOBError
1.OOB 1 0.1164021
2.OOB 2 0.0952381
4.OOB 4 0.1216931
Now we make a Random Forest prediction model:
rf.dirty <- randomForest(x = trainData@data[,3:ncol(trainData@data)], y = as.factor(trainData@data[,"Name"]), ntree = 500, mtry = 3, importance = TRUE, proximity = TRUE, keep.forest = TRUE)
After model making we can visualise some parameters of our model. We can see that bands 8 and 12 are most important for our classification.
plot(rf.dirty)
varImpPlot(rf.dirty)
And also we can seen Out-of-Bag errors for all classification classes:
rf.dirty
Call:
randomForest(x = trainData@data[, 3:ncol(trainData@data)], y = as.factor(trainData@data[, "Name"]), ntree = 500, mtry = 3, importance = TRUE, proximity = TRUE, keep.forest = TRUE)
Type of random forest: classification
Number of trees: 500
No. of variables tried at each split: 3
OOB estimate of error rate: 11.11%
Confusion matrix:
Bare Soil Buildings and Rocks Coniferous Forest Decidious Forest
Bare Soil 19 2 0 1
Buildings and Rocks 0 17 0 1
Coniferous Forest 0 0 19 0
Decidious Forest 0 0 0 18
Dirty Water 0 0 0 0
Grass and Crop 2 1 0 0
Shadow 0 0 0 0
Sludge Storage 0 0 0 0
Water 0 0 0 0
Dirty Water Grass and Crop Shadow Sludge Storage Water class.error
Bare Soil 0 0 1 0 0 0.1739130
Buildings and Rocks 0 2 0 0 0 0.1500000
Coniferous Forest 0 1 0 0 0 0.0500000
Decidious Forest 0 2 0 0 0 0.1000000
Dirty Water 18 0 0 1 1 0.1000000
Grass and Crop 0 23 0 0 0 0.1153846
Shadow 0 0 18 0 2 0.1000000
Sludge Storage 2 0 0 18 0 0.1000000
Water 1 0 1 0 18 0.1000000
Now we use our model for predicting classes on all researching area:
rf_Dirty_T36UWU20170331<- predict(T36UWU20170331, rf.dirty)
NAs introduced by coercion
We can visualise our result directly in R:
plot(rf_Dirty_T36UWU20170331, axes = TRUE, legend = TRUE)
For further work, we store the result of the classification in an external file. I use format of SAGA GIS for exporting georeferenced images:
writeRaster(rf_Dirty_T36UWU20170331, "/home/geka/Documents/DirtyWater/rf_Dirty_T36UWU20170331.sdat", format = "SAGA", overwrite = TRUE)
And we can visualise our result like a ggplot2 image. For this we use the ggr-function from RStoolbox package:
library(ggplot2)
Find out what's changed in ggplot2 at
http://github.com/tidyverse/ggplot2/releases.
Attaching package: ‘ggplot2’
The following object is masked from ‘package:randomForest’:
margin
# png("rf_Dirty_T36UWU20170331.png", width = 420,
# height = 380, units = "mm", res = 150)
ggR(rf_Dirty_T36UWU20170331, geom_raster = TRUE) +
scale_fill_brewer(palette = "Paired",
name = "Умовні класи\nландшафтного покриву:") +
labs(y = "північна широта\n",
x = "\nсхідна довгота",
title = "Класифікація типів ландшафтного покриву території Криворіжжя\n",
subtitle = "знімок космічного апарату Sentinel-2A від 31 березня 2017 року\nкласифікація: алгоритм «Random Forest» в середовищі мови програмування R\nсистема координат WGS 84/UTM, зона 36N (EPSG 32636)",
caption = "") +
theme(plot.title = element_text(hjust = 0.5),
text = element_text(family = "FreeSans",
face = "italic", size = 16, colour = "firebrick4", lineheight = 0.9),
legend.position = "bottom",
legend.text = element_text(family = "FreeSans",
face = "italic", size = 16, colour = "firebrick4", lineheight = 0.8, angle = 0),
axis.text.x = element_text(colour = "gray15", face="bold",
size= 14, angle = 0, vjust = 1, hjust = 0.5),
axis.text.y = element_text(colour = "gray15", face="bold",
size= 14, angle = 90, vjust = 1, hjust = 0.5),
axis.ticks = element_line(colour = "orange", size = 0.2),
plot.background = element_rect(fill = "gray95"),
panel.grid.major = element_line(colour = "orange", size = 0.2),
panel.grid.minor = element_line(colour = "gray90"),
panel.background = element_rect(fill="gray95"))
# dev.off()
And now we repeat all previous steps but for another satellite image:
#Download the clipped Sentinel-2A images (April 30, 2017)
T36UWU20170430_B2 <- raster("/home/geka/Documents/DirtyWater/T36UWU20170430_B2.sdat")
T36UWU20170430_B3 <- raster("/home/geka/Documents/DirtyWater/T36UWU20170430_B3.sdat")
T36UWU20170430_B4 <- raster("/home/geka/Documents/DirtyWater/T36UWU20170430_B4.sdat")
T36UWU20170430_B8 <- raster("/home/geka/Documents/DirtyWater/T36UWU20170430_B8.sdat")
T36UWU20170430_B12 <- raster("/home/geka/Documents/DirtyWater/T36UWU20170430_B12.sdat")
T36UWU20170430 <- stack(T36UWU20170430_B2, T36UWU20170430_B3,
T36UWU20170430_B4, T36UWU20170430_B8, T36UWU20170430_B12)
plotRGB(T36UWU20170430, r = 3,g = 2,b = 1,stretch = "lin")
names(T36UWU20170430) <- c("B2", "B3", "B4", "B8", "B12")
trainData2 <- shapefile("/home/geka/Documents/DirtyWater/LearningPoins_32636.shp")
train_var2 <- as.data.frame(extract(T36UWU20170430, trainData2))
trainData2@data <- data.frame(trainData2@data, train_var2[match(rownames(trainData2@data), rownames(train_var2)),])
tuneRF(x = trainData2@data[,3:ncol(trainData2@data)], y = as.factor(trainData2@data[,"Name"]))
mtry = 2 OOB error = 22.75%
Searching left ...
mtry = 1 OOB error = 23.28%
-0.02325581 0.05
Searching right ...
mtry = 4 OOB error = 22.22%
0.02325581 0.05
mtry OOBError
1.OOB 1 0.2328042
2.OOB 2 0.2275132
4.OOB 4 0.2222222
rf.dirty2 <- randomForest(x = trainData2@data[,3:ncol(trainData2@data)], y = as.factor(trainData2@data[,"Name"]), ntree = 500, mtry = 3, importance = TRUE, proximity = TRUE, keep.forest = TRUE)
rf.dirty2
Call:
randomForest(x = trainData2@data[, 3:ncol(trainData2@data)], y = as.factor(trainData2@data[, "Name"]), ntree = 500, mtry = 3, importance = TRUE, proximity = TRUE, keep.forest = TRUE)
Type of random forest: classification
Number of trees: 500
No. of variables tried at each split: 3
OOB estimate of error rate: 20.63%
Confusion matrix:
Bare Soil Buildings and Rocks Coniferous Forest Decidious Forest
Bare Soil 20 1 0 1
Buildings and Rocks 2 16 0 0
Coniferous Forest 0 0 16 3
Decidious Forest 0 1 3 14
Dirty Water 0 0 0 0
Grass and Crop 0 0 0 1
Shadow 1 2 0 0
Sludge Storage 0 0 0 0
Water 0 0 0 0
Dirty Water Grass and Crop Shadow Sludge Storage Water class.error
Bare Soil 0 0 1 0 0 0.13043478
Buildings and Rocks 0 1 1 0 0 0.20000000
Coniferous Forest 0 0 0 0 1 0.20000000
Decidious Forest 0 2 0 0 0 0.30000000
Dirty Water 13 0 0 2 5 0.35000000
Grass and Crop 0 25 0 0 0 0.03846154
Shadow 0 0 15 0 2 0.25000000
Sludge Storage 3 0 0 17 0 0.15000000
Water 5 0 1 0 14 0.30000000
plot(rf.dirty2)
varImpPlot(rf.dirty2)
rf_Dirty_T36UWU20170430 <- predict(T36UWU20170430, rf.dirty2)
plot(rf_Dirty_T36UWU20170430, axes = TRUE, legend = TRUE)
writeRaster(rf_Dirty_T36UWU20170430, "/home/geka/Documents/DirtyWater/rf_Dirty_T36UWU20170430.sdat", format = "SAGA", overwrite = TRUE)
library(ggplot2)
# png("rf_Dirty_T36UWU20170430.png", width = 420,
# height = 380, units = "mm", res = 150)
ggR(rf_Dirty_T36UWU20170430, geom_raster = TRUE) +
scale_fill_brewer(palette = "Paired",
name = "Умовні класи\nландшафтного покриву:") +
labs(y = "північна широта\n",
x = "\nсхідна довгота",
title = "Класифікація типів ландшафтного покриву території Криворіжжя\n",
subtitle = "знімок космічного апарату Sentinel-2A від 30 квітня 2017 року\nкласифікація: алгоритм «Random Forest» в середовищі мови програмування R\nсистема координат WGS 84/UTM, зона 36N (EPSG 32636)",
caption = "") +
theme(plot.title = element_text(hjust = 0.5),
text = element_text(family = "FreeSans",
face = "italic", size = 16, colour = "firebrick4", lineheight = 0.9),
legend.position = "bottom",
legend.text = element_text(family = "FreeSans",
face = "italic", size = 16, colour = "firebrick4", lineheight = 0.8, angle = 0),
axis.text.x = element_text(colour = "gray15", face="bold",
size= 14, angle = 0, vjust = 1, hjust = 0.5),
axis.text.y = element_text(colour = "gray15", face="bold",
size= 14, angle = 90, vjust = 1, hjust = 0.5),
axis.ticks = element_line(colour = "orange", size = 0.2),
plot.background = element_rect(fill = "gray95"),
panel.grid.major = element_line(colour = "orange", size = 0.2),
panel.grid.minor = element_line(colour = "gray90"),
panel.background = element_rect(fill="gray95"))
# dev.off()
So. We get classified satellite images for two dates: March 31 and April 30. But if you look to second date images you can see more classification errors. Because we use one learning dataset for different natural conditions. The time interval between two satellite images is one month. For the climatic conditions of Ukraine, this is sufficient to change the reflectivity of the landscape cover. And also we can select another way. We can use our first model (rf.dirty) directly to predicting second satellite images (T36UWU20170430). But in this case we can get more yet classification errors. Because we use trained model for images with different reflection parameters.
rf_Dirty_T36UWU20170430_2 <- predict(T36UWU20170430, rf.dirty)
NAs introduced by coercion
plot(rf_Dirty_T36UWU20170430_2, axes = TRUE, legend = TRUE)
I am learning English. If you see grammatic or orphographic errors please tell me.