Overview

Objectives

In this practice, one of the images from last practice will be classified.

Data

For this practice, use the following files:

  • Raster: s2a_post.tif

Procedure

1. Install and/or load necessary packages

As for every new session of R, you should set your working directory and load the necessary libraries:

# Install and load the necessary packages 
ipak <- function(pkg){
    new.pkg <- pkg[!(pkg %in% installed.packages()[, "Package"])]
    if (length(new.pkg)) 
        install.packages(new.pkg, dependencies = TRUE, repos = "http://cran.us.r-project.org")
    sapply(pkg, require, character.only = TRUE)
}

packages <- c("sp", "rgdal", "raster", "randomForest")
ipak(packages)

2. Load rasters calculated in the previous section: “Flood hazard mapping. Part I”

To start working in R, you need first to indicate the program where to store the information. This is the “Working Directory”. To set up the working directory, add the path of the folder where your files are located in the following function:

# Set the working directory ("wd") and assign it to an object (here, "w")
w <- setwd(".\\your\\path")

3. Create training data

In QGIS, create at least 10-15 polygons per land class. These polygons correspond to the training data that will be used for the supervised classfication.

Use the Sentinel-2 RasterStack named s2a_post_cr. Make a false color band composition to highlight different land class features. For example (not the only one): https://www.sentinel-hub.com/eoproducts/false-color

Figure 1. Creation of Shapefile in QGIS data.

Figure 2. Atribute table of training data.

4. Classification of rasters

An important use of satellite images is the interpretation and classification of the land features. This way, different analyses (e.g. land use change) or political decisions (e.g. in the frame of REDD+) can be made.

A supervised classification uses classes defined by the user, calculates statistics per class and classifies the image. A validation of the amount of error is afterwards also necessary.

Read raster data

s2_post <- stack("raster\\s2a_post_cr.tif")

Read vector data with training samples

The training vector is a file (multipolygon, point or line - shapefile) that has all the samples of the diverse classes taken by the scientist in a remote way (for example, on the computer). In this example, we created a multipolygon shapefile with 4 classes that corresponds to the land covers identifiable on the raster image (Figure 2):

  • 1 : urban
  • 2 : sediment
  • 3 : agriculture
  • 4 : water

The validation samples are going to be used later on to validate the supervised classification. In the best scenario, the validation samples are taken on the field (ground truth data) and these are used to test the accuracy of the training samples. If there is no possibility to take independent validation data, one could split the training data into 70% (training) and 30% (validation). This has of course, some statistical consequences.

# Import training samples
training <- readOGR(paste(w, "\\vector", sep=""), "training_data")

Plot the training and validation data on top of the raster subset.

# Plot the raster with both vector files on top
# First, add the raster
plotRGB(s2_post, r=3, g=2, b=1, axes=TRUE, stretch="hist")

# Second, add the vectors
plot(training, col="yellow", add=TRUE)
Figure 3. Traning samples on Sentinel

Figure 3. Traning samples on Sentinel

Random Forest Algorithm

The RandomForest is a classifier that fits many decision trees to changing subsets of the training data.

#Install necessary packages
install.packages(c("maptools", "randomForest", "mgcv"))
# Load libraries
library(maptools)
library(randomForest)
library(mgcv)

Prepare input data for classification.

# Number of samples per land cover class
numsamps <- 100

# Name of the attribute that holds the integer land cover type identifyer
attName <- "id"

# Name and path of the output GeoTiff image
outImage <- "raster\\classif_result.tif"

# Create a RasterBrick out of the RasterStack to speed up processing
satImage <- brick("raster\\s2a_post_cr.tif")

# Using the vector "training":
# Assign the name "allAtt" to the data of the object "training" (polygon)
allAtt <- slot(training, "data") 
# Table with the land classes and the number of samples in each
tabAtt <- table(allAtt[[attName]]) 
# Returns the numbers of the names of the data values (i.e. 1, 2, 3, 4)
uniqueAtt <- as.numeric(names(tabAtt)) 

Extract values out of the S2 rasters according the random samples.

# Set 100 random points inside each polygon landcover class of the vector "training"
for (x in 1:length(uniqueAtt)) {
  class_data <- training[training[[attName]]==uniqueAtt[x],] # take the unique values of the data
  classpts <- spsample(class_data, type="random", n=numsamps) # sample the 100 sample points on the polygons
  if (x == 1) {
    xy <- classpts
  } else {
    xy <- rbind(xy, classpts) # xy is a SpatialPoints object that has now all the 100 points randomly sampled on each of the 4 clases of polygons according to the data
  }
}

# Extract and combine classes and band values
# Overlay the SpatialPoints on the Polygons
temp <- over(x=xy, y=training) 
# Transform into a factor of four levels
response <- factor(temp[[attName]]) 
# Trainvals is a matrix of the response points and the correspondent value extracted from each band
trainvals <- cbind(response, extract(satImage, xy)) 

The resulting xy dataset is a SpatialPointsDataFrame of 400 points with the land classes.

Run the classification algorithm. The model “randfor” indicates the relationship between the landcover classes and their typical reflectance and temperature characteristics.

# Print a text on your R console to have a visual estimate of the progress of the process
print("Starting to calculate random forest object")

# Run the model
randfor <- randomForest(as.factor(response) ~. ,
                        data=trainvals, 
                        na.action=na.omit,
                        confusion=T)

The RandomForest algorithm performs an internal validation, the “out-of-the-bag” (OOB) validation, which is comparable to cross-validation in that it validates the predictions of each tree only against samples which were not part of the training data.

The result of this validation is a confusion matrix and some error estimates. The predictions are in the columns and the observations in the rows. Remember the legend of the data:

  • 1 : urban
  • 2 : sediment
  • 3 : agriculture
  • 4 : water
# Results of the RandomForest classifier with the confusion matrix
randfor
## 
## Call:
##  randomForest(formula = as.factor(response) ~ ., data = trainvals,      confusion = T, na.action = na.omit) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 2
## 
##         OOB estimate of  error rate: 1.75%
## Confusion matrix:
##    1  2  3   4 class.error
## 1 99  0  1   0        0.01
## 2  0 99  1   0        0.01
## 3  4  1 95   0        0.05
## 4  0  0  0 100        0.00

The class error is very low. Although the internal OOB validation is statistically correct, it is only trustworthy if the samples are independent. Taking samples from the same polygon are probably generating not independent samples.

The function predict uses the model to predict the classified landcover classes for the whole Sentinel image.

# Print a text on your R console to have a visual estimate of the progress of the process
print("Starting predictions")

# Use all band values of all pixels to generate a classification
predict_model <- predict(satImage, randfor, filename=outImage, progress="text", format="GTiff",
                         datatype="INT1U", type="response", overwrite=TRUE)
# Load supervised classification file
classi <- raster(paste(w,"\\raster", "\\classif_result.tif", sep=""))

# Create matrix for plotting rasters next to another
nf3 <- layout(matrix(c(1,2), 1, 2, byrow = TRUE))

# Plot rasters
# First, plot True Color Composite of Sentinel
plotRGB(s2_post, r=3, g=2, b=1, axes=TRUE, stretch="hist", main = "Training values of 'xy' dataset on Sentinel")
# Add vector "xy"
points(xy, col="red", add=TRUE)

# Second, plot classification result
plot(classi, breaks = c(0, 1, 2, 3, 4), col = c("chocolate3", "gold2", "darkolivegreen", "blue2") , main = 'Supervised Classification')
Figure 4. True color composite of Sentinel RasterStack with xy samples on top. Results of the Supervised Classification

Figure 4. True color composite of Sentinel RasterStack with xy samples on top. Results of the Supervised Classification

5. Statistics on Classified Rasters

It is possible to make statistics with the outcoming data.

Count

Use the raster “classi” to count the number of pixel in each class and transform this information to Km^2.

# Count of pixels per class
classi_fr <- freq(classi, useNA="no")

classi_fr
##      value    count
## [1,]     1  7640680
## [2,]     2   437395
## [3,]     3 17260521
## [4,]     4  1021846
Composition of classes after classification

After the classification it is possible to quantify every class to have an idea of the landscape. Given the amount of pixels per class, ‘class.freq’, query the area covered by forest, agriculture, urban and water, and relate it to the spatial resolution (10m2).

# Get resolution of classified raster
resclassi <- res(classi)

# Multiply the amount of pixels per class with the area
area_km2 <- classi_fr[, "count"] * prod(resclassi) * 1e-06
# Create a data frame with the resultant data
df_classi <- data.frame(landcover = (c("Urban", "Sediment", "Agriculture", "Water")), area_km2 = area_km2)

# View results
df_classi
##     landcover  area_km2
## 1       Urban  764.0680
## 2    Sediment   43.7395
## 3 Agriculture 1726.0521
## 4       Water  102.1846