In this practice, one of the images from last practice will be classified.

1. Set the working environment

To start working in R, you need first to indicate the program where to store the information. This is the “Working Directory”. To set up the working directory, add the path of the folder where your files are located in the following function:

# Set the working directory ("wd") and assign it to an object (here, "w")
w <- setwd(".\\your\\path")

# Check the location of the working directory
getwd()

2. Install packages

Functions are stored in Packages that have to be first installed and loaded. Install them using the following R commands or in the Button “Packages” in the 4th environment (normally located in the right-down side of your R Studio interface). Packages must be installed only once, whereas libraries should be loaded every time you write a new script.

# Install packages function when it is not necessary to specify the folder (because this path was saved in the Global options of R)
install.packages("rgdal")

# Install multiple packages simultaneously
install.packages(c("sp","raster")) #to install more than one package at the same time 
# Load the packages
library(sp)
library(rgdal)
library(raster)

3. Create training data

In QGIS, create at least 10-15 polygons per land class. These polygons correspond to the training data that will be used for the supervised classfication.

Figure 1. Creation of training data.

Figure 1. Creation of training data.

Figure 2. Atribute table of training data.

Figure 2. Atribute table of training data.

Create afterwards validation data using higher resolution images as a guide.

Figure 3. Atribute table of validation data.

Figure 3. Atribute table of validation data.

4. Classification of rasters

An important use of satellite images is the interpretation and classification of the land features. This way, different analyses (e.g. land use change) or political decisions (e.g. in the frame of REDD+) can be made.

A supervised classification uses classes defined by the user, calculates statistics per class and classifies the image. A validation of the amount of error is afterwards also necessary.

Read raster data

s2_pre_cr <- stack("raster\\s2a_pre_cr.tif")

Read vector data with training samples

The training vector is a file (multipolygon, point or line - shapefile) that has all the samples of the diverse classes taken by the scientist in a remote way (for example, on the computer). In this example, we created a multipolygon shapefile with 4 classes that corresponds to the land covers identifiable on the raster image (Figure 2):

  • 1 : urban
  • 2 : forest
  • 3 : agriculture
  • 4 : water

The validation samples are going to be used later on to validate the supervised classification. In the best scenario, the validation samples are taken on the field (ground truth data) and these are used to test the accuracy of the training samples. If there is no possibility to take independent validation data, one could split the training data into 70% (training) and 30% (validation). This has of course, some statistical consequences.

# Import training samples
training <- readOGR(paste(w, "\\vector", sep=""), "training_data")

# Import validation samples
validation <- readOGR(paste(w, "\\vector", sep=""), "validation_data")

Plot the training and validation data on top of the raster subset.

# Plot the raster with both vector files on top
# First, add the raster
plotRGB(s2_pre_cr, r=3, g=2, b=1, axes=TRUE, stretch="hist")

# Second, add the vectors
plot(training, col="yellow", add=TRUE)
plot(validation, col="red", add=TRUE)
Figure 5. Traning and Validation samples on Sentinel

Figure 5. Traning and Validation samples on Sentinel

Random Forest Algorithm

The RandomForest is a classifier that fits many decision trees to changing subsets of the training data.

#Install necessary packages
install.packages(c("maptools", "randomForest", "mgcv"))
# Load libraries
library(maptools)
library(randomForest)
library(mgcv)

Prepare input data for classification.

# Number of samples per land cover class
numsamps <- 100

# Name of the attribute that holds the integer land cover type identifyer
attName <- "id"

# Name and path of the output GeoTiff image
outImage <- "raster\\classif_result.tif"

# Create a RasterBrick out of the RasterStack to speed up processing
satImage <- brick("raster\\s2a_pre_cr.tif")

# Using the vector "training":
# Assign the name "allAtt" to the data of the object "training" (polygon)
allAtt <- slot(training, "data") 
# Table with the land classes and the number of samples in each
tabAtt <- table(allAtt[[attName]]) 
# Returns the numbers of the names of the data values (i.e. 1, 2, 3, 4)
uniqueAtt <- as.numeric(names(tabAtt)) 

Extract values out of the S2 rasters according the random samples.

# Set 100 random points inside each polygon landcover class of the vector "training"
for (x in 1:length(uniqueAtt)) {
  class_data <- training[training[[attName]]==uniqueAtt[x],] # take the unique values of the data
  classpts <- spsample(class_data, type="random", n=numsamps) # sample the 100 sample points on the polygons
  if (x == 1) {
    xy <- classpts
  } else {
    xy <- rbind(xy, classpts) # xy is a SpatialPoints object that has now all the 100 points randomly sampled on each of the 4 clases of polygons according to the data
  }
}

# Extract and combine classes and band values
# Overlay the SpatialPoints on the Polygons
temp <- over(x=xy, y=training) 
# Transform into a factor of four levels
response <- factor(temp[[attName]]) 
# Trainvals is a matrix of the response points and the correspondent value extracted from each band
trainvals <- cbind(response, extract(satImage, xy)) 

The resulting xy dataset is a SpatialPointsDataFrame of 400 points with the land classes.

Run the classification algorithm. The model “randfor” indicates the relationship between the landcover classes and their typical reflectance and temperature characteristics.

# Print a text on your R console to have a visual estimate of the progress of the process
print("Starting to calculate random forest object")

# Run the model
randfor <- randomForest(as.factor(response) ~. ,
                        data=trainvals, 
                        na.action=na.omit,
                        confusion=T)

The RandomForest algorithm performs an internal validation, the “out-of-the-bag” (OOB) validation, which is comparable to cross-validation in that it validates the predictions of each tree only against samples which were not part of the training data.

The result of this validation is a confusion matrix and some error estimates. The predictions are in the columns and the observations in the rows. Remember the legend of the data:

  • 1 : urban
  • 2 : forest
  • 3 : agriculture
  • 4 : water
# Results of the RandomForest classifier with the confusion matrix
randfor
## 
## Call:
##  randomForest(formula = as.factor(response) ~ ., data = trainvals,      confusion = T, na.action = na.omit) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 2
## 
##         OOB estimate of  error rate: 7.75%
## Confusion matrix:
##    1  2  3  4 class.error
## 1 78  0 22  0        0.22
## 2  0 99  0  1        0.01
## 3  7  0 93  0        0.07
## 4  0  1  0 99        0.01

The class error is very low. Although the internal OOB validation is statistically correct, it is only trustworthy if the samples are independent. Taking samples from the same polygon are probably generating not independent samples.

The function predict uses the model to predict the classified landcover classes for the whole Sentinel image.

# Print a text on your R console to have a visual estimate of the progress of the process
print("Starting predictions")

# Use all band values of all pixels to generate a classification
predict_model <- predict(satImage, randfor, filename=outImage, progress="text", format="GTiff",
                         datatype="INT1U", type="response", overwrite=TRUE)

Plot the xy dataset on top of the Sentinel image stack. Compare it to the result raster of the supervised classification.

# Load supervised classification file
classi <- raster(paste(w,"\\raster", "\\classif_result.tif", sep=""))

# Create matrix for plotting rasters next to another
nf3 <- layout(matrix(c(1,2), 1, 2, byrow = TRUE))

# Plot rasters
# First, plot True Color Composite of Sentinel
plotRGB(s2_pre_cr, r=3, g=2, b=1, axes=TRUE, stretch="hist", main = "Training values of 'xy' dataset on Sentinel")
# Add vector "xy"
points(xy, col="red", add=TRUE)

# Second, plot classification result
plot(classi, breaks = c(0, 1, 2, 3, 4), col = c("chocolate3", "darkolivegreen", "gold2", "blue2") , main = 'Supervised Classification')
Figure 6. True color composite of Sentinel RasterStack with xy samples on top. Results of the Supervised Classification

Figure 6. True color composite of Sentinel RasterStack with xy samples on top. Results of the Supervised Classification

5. Statistics on Classified Rasters

It is possible to make statistics with the outcoming data.

Count

Use the raster “classi” to count the number of pixel in each class and transform this information to Km^2.

# Count of pixels per class
classi_fr <- freq(classi, useNA="no")

classi_fr
##      value  count
## [1,]     1  81018
## [2,]     2 249308
## [3,]     3 151595
## [4,]     4   1712
Composition of classes after classification

After the classification it is possible to quantify every class to have an idea of the landscape. Given the amount of pixels per class, ‘class.freq’, query the area covered by forest, agriculture, urban and water, and relate it to the spatial resolution (10m2).

# Get resolution of classified raster
resclassi <- res(classi)

# Multiply the amount of pixels per class with the area
area_km2 <- classi_fr[, "count"] * prod(resclassi) * 1e-06
# Create a data frame with the resultant data
df_classi <- data.frame(landcover = (c("Urban", "Forest", "Agriculture", "Water")), area_km2 = area_km2)

# View results
df_classi
##     landcover area_km2
## 1       Urban   8.1018
## 2      Forest  24.9308
## 3 Agriculture  15.1595
## 4       Water   0.1712

6. Accuracy Assessment

The previous algorithm splitted the training data into 70% training data and 30% validation data. This can explain the very low errors of the previous confusion matrix. Rerun the Supervised Classification with an independent validation dataset.

This time, the function superClass from the package RStoolbox will be used, which can also apply the RandomForest (“rf”) algorithm to the model.

install.packages("RStoolbox")
install.packages("e1071")
library(RStoolbox)
library(e1071)
# Run SuperClass (Supervised Classification) with RandomForest (rf) algorithm, using 2 independents data sets
classi_ind <- superClass(satImage, model = "rf", trainData = training, responseCol = "class_name", 
                         valData = validation)
classi_ind
## superClass results
## ************ Validation **************
## $validation
## Confusion Matrix and Statistics
## 
##              Reference
## Prediction    agriculture forest urban water
##   agriculture         265      0     4     0
##   forest                0   1001     0     0
##   urban               235      1    94     0
##   water                 0      0     0    59
## 
## Overall Statistics
##                                           
##                Accuracy : 0.8553          
##                  95% CI : (0.8375, 0.8719)
##     No Information Rate : 0.604           
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.7478          
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: agriculture Class: forest Class: urban
## Sensitivity                      0.5300        0.9990      0.95918
## Specificity                      0.9965        1.0000      0.84881
## Pos Pred Value                   0.9851        1.0000      0.28485
## Neg Pred Value                   0.8309        0.9985      0.99699
## Prevalence                       0.3014        0.6040      0.05907
## Detection Rate                   0.1597        0.6034      0.05666
## Detection Prevalence             0.1621        0.6034      0.19892
## Balanced Accuracy                0.7633        0.9995      0.90400
##                      Class: water
## Sensitivity               1.00000
## Specificity               1.00000
## Pos Pred Value            1.00000
## Neg Pred Value            1.00000
## Prevalence                0.03556
## Detection Rate            0.03556
## Detection Prevalence      0.03556
## Balanced Accuracy         1.00000
## 
## *************** Map ******************
## $map
## class       : RasterLayer 
## dimensions  : 735, 1029, 756315  (nrow, ncol, ncell)
## resolution  : 10, 10  (x, y)
## extent      : 391960, 402250, 5387690, 5395040  (xmin, xmax, ymin, ymax)
## coord. ref. : +proj=utm +zone=33 +datum=WGS84 +units=m +no_defs +ellps=WGS84 +towgs84=0,0,0 
## data source : in memory
## names       : class_name 
## values      : 1, 4  (min, max)
## attributes  :
##  ID       value
##   1 agriculture
##   2      forest
##   3       urban
##   4       water

The classification run on independent data shows a different accuracy: ~0.86. The confusion matrix shows the predictions in columns and the observations in rows. The main diagonal represent the number of sites that were classified correctly according to reference data. Everything outside the main diagonal, represent samples that were missclasified.

Forest and Water seems to be classified without problems, while there seems to be a not clear decision for Urban and Agriculture. There are too many Urban pixels classified as Agriculture, which means both classes have similar reflectance or the polygons where wrongly made. Also, depending on the time of the year, some agricultural plots have a soil reflectance, which can be similar to one founded in urban areas. With these results, the classification should be made again, in order to improve the accuracy.

# Check the validation results of the previous classification
classi_ind$validation$performance
## Confusion Matrix and Statistics
## 
##              Reference
## Prediction    agriculture forest urban water
##   agriculture         265      0     4     0
##   forest                0   1001     0     0
##   urban               235      1    94     0
##   water                 0      0     0    59
## 
## Overall Statistics
##                                           
##                Accuracy : 0.8553          
##                  95% CI : (0.8375, 0.8719)
##     No Information Rate : 0.604           
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.7478          
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: agriculture Class: forest Class: urban
## Sensitivity                      0.5300        0.9990      0.95918
## Specificity                      0.9965        1.0000      0.84881
## Pos Pred Value                   0.9851        1.0000      0.28485
## Neg Pred Value                   0.8309        0.9985      0.99699
## Prevalence                       0.3014        0.6040      0.05907
## Detection Rate                   0.1597        0.6034      0.05666
## Detection Prevalence             0.1621        0.6034      0.19892
## Balanced Accuracy                0.7633        0.9995      0.90400
##                      Class: water
## Sensitivity               1.00000
## Specificity               1.00000
## Pos Pred Value            1.00000
## Neg Pred Value            1.00000
## Prevalence                0.03556
## Detection Rate            0.03556
## Detection Prevalence      0.03556
## Balanced Accuracy         1.00000

The Accuracy Assessment assess the performance of a classification. It helps to understand how to interpret the usefulness of someone else’s classification.
There are different types of accuracy:

The results of the previous model classi_ind$validation$performance, show all these types of accuracy values (but with a different vocabulary). The correspondence is: “accuracy” is the overall accuracy, “sensitivity” is the producers accuracy and the “positive prediction value” is the users accuracy.

The Kappa statistic reflects the difference between actual agreement and the agreement expected by chance. A Kappa of 0.75 means there is 75% better agreement than by chance alone.