chapter9-accuracy assessment

Part I

We have already seen how to conduct landcover classification in R. Now lets see how to conduct accuracy assessment in R.

#call the workspace
load('E:/Documents/R Studio is here/R Studio files projects/gis_coding/.RData')

#call the requisite packages
library(raster)

## Warning: package 'raster' was built under R version 4.1.3

## Loading required package: sp

library(rgdal)

## Please note that rgdal will be retired by the end of 2023,
## plan transition to sf/stars/terra functions using GDAL and PROJ
## at your earliest convenience.
## 
## rgdal: version: 1.5-28, (SVN revision 1158)
## Geospatial Data Abstraction Library extensions to R successfully loaded
## Loaded GDAL runtime: GDAL 3.2.1, released 2020/12/29
## Path to GDAL shared files: C:/Users/User/Documents/R/win-library/4.1/rgdal/gdal
## GDAL binary built with GEOS: TRUE 
## Loaded PROJ runtime: Rel. 7.2.1, January 1st, 2021, [PJ_VERSION: 721]
## Path to PROJ shared files: C:/Users/User/Documents/R/win-library/4.1/rgdal/proj
## PROJ CDN enabled: FALSE
## Linking to sp version:1.4-6
## To mute warnings of possible GDAL/OSR exportToProj4() degradation,
## use options("rgdal_show_exportToProj4_warnings"="none") before loading sp or rgdal.
## Overwritten PROJ_LIB was C:/Users/User/Documents/R/win-library/4.1/rgdal/proj

library(rgeos)

## rgeos version: 0.5-9, (SVN revision 684)
##  GEOS runtime version: 3.9.1-CAPI-1.14.2 
##  Please note that rgeos will be retired by the end of 2023,
## plan transition to sf functions using GEOS at your earliest convenience.
##  GEOS using OverlayNG
##  Linking to sp version: 1.4-6 
##  Polygon checking: TRUE

library(sf)

## Linking to GEOS 3.9.1, GDAL 3.2.1, PROJ 7.2.1; sf_use_s2() is TRUE

library(sp)
library(randomForest)

## Warning: package 'randomForest' was built under R version 4.1.3

## randomForest 4.7-1.1

## Type rfNews() to see new features/changes/bug fixes.

#save classification 2016 image to directory
writeRaster(classification2016, filename = 'r.classification2016.tif', format = 'GTiff', overwrite = T)

#save classification 2019 image to directory
writeRaster(classification2019, filename = 'r.classification2019.tif', format = 'GTiff', overwrite = T)

#call the classified 2016 and 2019 images
r_class2016 <- raster('r.classification2016.tif')
r_class2019 <- raster('r.classification2019.tif')

#collect the sample points of 5 from each class for 2016 classification .tif file
smp.test2016 <- sampleStratified(x = classification2016, size = 5, na.rm = T, sp = T)

#check the class labels of our extracted points
smp.test2016$classmodel2016

##  [1] 1 1 1 1 1 2 2 2 2 2 3 3 3 3 3 4 4 4 4 4 5 5 5 5 5 6 6 6 6 6

There are six landcover classes just like in our classification2016 image, and each class has been sampled 5 times. Good to go.

After this, we will mix the order of the samples so that they are random. This is because we will be manually labelling the samples in Qgis later on.

#make the order of samples completely random
smp.test2016 <- smp.test2016[sample(nrow(smp.test2016)), ]

#check if the ordering is now random
smp.test2016$classmodel2016

##  [1] 3 2 3 1 1 5 4 6 2 2 6 4 4 3 4 2 5 1 5 1 2 6 5 6 5 1 3 3 4 6

We will delete all the variables in our smp.test2016 and append an ID variable in their place called ID, which will then help us in labelling the classes in Qgis.

#delete the only columns available - cell and classmodel2016
smp.test2016 <- smp.test2016[, -c(1, 2)]
smp.test2016$ID <- 1:nrow(smp.test2016)

#visualize the sample points on top of our classification 2016 image
#in the order of Agriculture    Built-up    Bushland      Forest        Road       Water 
plot(classification2016, main = 'Classification 2016', col = c('yellow', 'gray85', 'darkolivegreen1', 'darkgreen', 'gray1', 'darkblue'))
#plot the smp.test 2016 samples file
points(smp.test2016)

Can you see some black circles somewhere overlying your image? If they are present, you can breathe a sigh of relief this is past. For example, in our image above, there are 8 or so black circles at the bottom right.

#save the point sample file to your directory
writeOGR(smp.test2016, dsn = '.', layer = 'validate2016', driver =  'ESRI Shapefile', overwrite = T)

You will use the validate2016.shp to manually collect the landclasses on the 2016 Sentinel-2 true-color, or falsecolor image (whichever you will use) in Qgis. Don’t worry, this has already been done for you.

The file with the labelled landclasses is accessible from here: - https://www.dropbox.com/sh/5611ivwki2n5eqw/AAAWWQPguPg0v0fsh3bzhm85a?dl=0

The land class codes are in the valid2016 field.

Part II - Accuracy assessment

#call the validation shapefile
valid2016 <- readOGR(dsn = './valid2016/valid2016.shp')

## OGR data source with driver: ESRI Shapefile 
## Source: "E:\Documents\R Studio is here\R Studio files projects\gis_coding\valid2016\valid2016.shp", layer: "valid2016"
## with 30 features
## It has 2 fields
## Integer64 fields read as strings:  valid2016

#read the first six rows of valid2016 to see if the landclass codes are in the attributes.
head(valid2016@data)

##   ID valid2016
## 1  1         2
## 2  2         3
## 3  3         6
## 4  4         3
## 5  5         2
## 6  6         6

In the dataframe above, the land class codes are in the valid2016 field. Remember, the landclass codes are given in alphabetical order. To clear the confusion, remember in Chapter 8 while assigning a macroclass to the training file sample2016 we used as.factor() function which factorized the names in MC_name column, thereby assigning them numerical digits in alphabetical order.

#sample2016$MCname <- as.factor(train2016$MC_name[match(sample2016$ID, seq(nrow(train2016)))])
#summary(sample2016$MCname)
#1-Agriculture    2-Built-up    3-Bushland    4-Forest    5-Road       6-Water 
#        26404       16728       30567       25179        1284       42045

Therefore in ID1 in the valid2016 data above, the corresponding landclass code is 2, which stands for built-up, ID2 - bushland, ID3 - water and so on.

Two vectors will be needed for the confusion matrix. They are:

reference : class labels you assigned manually in the previous section

predicted : class labels that resulted in the automatic RF (or SVM) classification (r_class2016.tif)

#transform the valid2016 codes to factors
reference2016 <- as.factor(valid2016$valid2016)#the valid2016 after $ symbol is the field containing the landclass codes
reference2016

##  [1] 2 3 6 3 2 6 6 3 3 6 2 3 5 2 4 1 6 5 2 1 1 2 3 3 2 3 3 5 1 3
## Levels: 1 2 3 4 5 6

For the prediction vector, we will create one whereby the landclass values from the 2016 image - r_class2016 will be extracted. These extracted values will form the data for the prediction vector.

#create the prediction vector for the 2016 image
predicted2016 <- as.factor(extract(r_class2016, valid2016))
predicted2016

##  [1] 2 2 6 2 3 6 6 3 4 6 1 3 5 5 4 1 6 3 2 1 1 5 4 3 2 4 4 5 4 1
## Levels: 1 2 3 4 5 6

We will use both vectors to create an accuracy matrix for the 2016 image.

#create accuracy matrix
matrix_2016 <- table('pred' = predicted2016, 'ref' = reference2016)
matrix_2016

##     ref
## pred 1 2 3 4 5 6
##    1 3 1 1 0 0 0
##    2 0 3 2 0 0 0
##    3 0 1 3 0 1 0
##    4 1 0 4 1 0 0
##    5 0 2 0 0 2 0
##    6 0 0 0 0 0 5

In the accuracy matrix above, only classes labelled on the diagonal line are correct, the rest are misclassified.

#calculate the user's accuracy(UA), producer's accuracy(PA) and overall accuracy (OA)
UA2016 <- diag(matrix_2016) / rowSums(matrix_2016) * 100
UA2016

##         1         2         3         4         5         6 
##  60.00000  60.00000  60.00000  16.66667  50.00000 100.00000

PA2016 <- diag(matrix_2016) / colSums(matrix_2016) * 100
PA2016

##         1         2         3         4         5         6 
##  75.00000  42.85714  30.00000 100.00000  66.66667 100.00000

OA2016 <- sum(diag(matrix_2016)) / sum(matrix_2016) * 100
OA2016

## [1] 56.66667

#user accuracy
UA2016

##         1         2         3         4         5         6 
##  60.00000  60.00000  60.00000  16.66667  50.00000 100.00000

#producer accuracy
PA2016

##         1         2         3         4         5         6 
##  75.00000  42.85714  30.00000 100.00000  66.66667 100.00000

#overall accuracy
OA2016

## [1] 56.66667

The overall accuracy (OA) is a meagre 56 points. This is possibly due to the low number of sample points generated, and that some classes had an insufficient number of sample points. eg. if you check the valid2016 shapefile and matrix_2016 table, the forest class only had one point.

Next, we will create a confusion matrix, following the steps done by the blog here: - https://blogs.fu-berlin.de/reseda/accuracy-statistics-in-r/

#create the confusion matrix
matrix2016_conf <- addmargins(matrix_2016)
matrix2016_conf <- rbind(matrix2016_conf, 'Users' = c(PA2016, NA))
matrix2016_conf <- cbind(matrix2016_conf, 'Producers' = c(UA2016, NA, OA2016))
colnames(matrix2016_conf) <- c(levels(as.factor(train2016$MC_name)), 'Sum', 'PA')
rownames(matrix2016_conf) <- c(levels(as.factor(train2016$MC_name)), 'Sum', 'UA')
matrix2016_conf <- round(matrix2016_conf, digits = 1)
dimnames(matrix2016_conf) <- list('Prediciton' = colnames(matrix2016_conf), 
                                  'Reference' = rownames(matrix2016_conf))

class(matrix2016_conf) <- 'table'
matrix2016_conf

##              Reference
## Prediciton    Agriculture Built-up Bushland Forest  Road Water   Sum    UA
##   Agriculture         3.0      1.0      1.0    0.0   0.0   0.0   5.0  60.0
##   Built-up            0.0      3.0      2.0    0.0   0.0   0.0   5.0  60.0
##   Bushland            0.0      1.0      3.0    0.0   1.0   0.0   5.0  60.0
##   Forest              1.0      0.0      4.0    1.0   0.0   0.0   6.0  16.7
##   Road                0.0      2.0      0.0    0.0   2.0   0.0   4.0  50.0
##   Water               0.0      0.0      0.0    0.0   0.0   5.0   5.0 100.0
##   Sum                 4.0      7.0     10.0    1.0   3.0   5.0  30.0      
##   PA                 75.0     42.9     30.0  100.0  66.7 100.0        56.7

Congrats for creating a confusion matrix using code only! It was our first time too.

Already tired, don’t worry, one more statistic to go!

The Kappa coefficient is used to calculate the overall accuracy of a classification. Some points to note for this coefficient are:

The Kappa coefficient can range from -1 to 1. A value of 0 indicates that the classification is as good as random values. A value below 0 indicates the classification is significantly worse than random. A value greater than 0 indicates that the classification is significantly better than random.

Adapted from: - https://blogs.fu-berlin.de/reseda/accuracy-statistics-in-r/

#create a function to calculate the kappa coefficient

kappa <- function(m) {
  N <- sum(m)
  No <- sum(diag(m))
  Ne <- 1 / N * sum(colSums(m) * rowSums(m))
  return((No - Ne) / (N - Ne))
}

Use the kappa coefficient function created above to calculate the kappa for matrix2016_conf.

#calculate kappa for our 2016 confusion matrix
kappa2016 <- kappa(matrix2016_conf)
kappa2016

## [1] NA

Well, that’s embarassing. There seems to be a mistake somewhere, but that is the workflow.

Now imagine we have to repeat the process for the 2019 image.

Think about it. Which saves more time, R or Qgis? Which offers the possibility of automation, R or Qgis?

This tutorial has been made possible by standing on the shoulders of giants who did all the heavy lifting prior to this. Also remember, all code without play makes code a very boring game.

chapter9-accuracy assessment

samuel gachuhi

6/4/2022