chapter-8-

Part I

Let’s load our previous workspace to avoid messages of R not recognizing some files.

#load previous workspace
load('E:/Documents/R Studio is here/R Studio files projects/gis_coding/.RData')

We will proceed and do landcover classification for the two satellite images - stack2016 and stack2019. The digits in their names stand for their respective years.

The landcover classification images for the respective years were created using Qgis Semi-automatic Classification Plugins (SCP) and the inherent Maximum Likelihood Classification algorithm. The landcover classification images can be downloaded from here: Classification 2019 - https://www.dropbox.com/sh/qo4gfpnq92pfd52/AABB_DkW7ZerrzzrY2SSP-QGa?dl=0 Classification 2016 - https://www.dropbox.com/sh/xl4lyxiy6b8rc9n/AABJ8njsa-3vqtQIL2_y0KQKa?dl=0

Let’s load the requisite R packages

#load required packages
library(raster)

## Warning: package 'raster' was built under R version 4.1.3

## Loading required package: sp

library(rgeos)

## rgeos version: 0.5-9, (SVN revision 684)
##  GEOS runtime version: 3.9.1-CAPI-1.14.2 
##  Please note that rgeos will be retired by the end of 2023,
## plan transition to sf functions using GEOS at your earliest convenience.
##  GEOS using OverlayNG
##  Linking to sp version: 1.4-6 
##  Polygon checking: TRUE

library(rgdal)

## Please note that rgdal will be retired by the end of 2023,
## plan transition to sf/stars/terra functions using GDAL and PROJ
## at your earliest convenience.
## 
## rgdal: version: 1.5-28, (SVN revision 1158)
## Geospatial Data Abstraction Library extensions to R successfully loaded
## Loaded GDAL runtime: GDAL 3.2.1, released 2020/12/29
## Path to GDAL shared files: C:/Users/User/Documents/R/win-library/4.1/rgdal/gdal
## GDAL binary built with GEOS: TRUE 
## Loaded PROJ runtime: Rel. 7.2.1, January 1st, 2021, [PJ_VERSION: 721]
## Path to PROJ shared files: C:/Users/User/Documents/R/win-library/4.1/rgdal/proj
## PROJ CDN enabled: FALSE
## Linking to sp version:1.4-6
## To mute warnings of possible GDAL/OSR exportToProj4() degradation,
## use options("rgdal_show_exportToProj4_warnings"="none") before loading sp or rgdal.
## Overwritten PROJ_LIB was C:/Users/User/Documents/R/win-library/4.1/rgdal/proj

library(sf)

## Linking to GEOS 3.9.1, GDAL 3.2.1, PROJ 7.2.1; sf_use_s2() is TRUE

library(sp)

Let’s extract the values per each training sample polygon. The training sample is available from here: https://www.dropbox.com/sh/03t89ct1yy9p880/AAC2nt9jUGw8Sof-6QkrldcTa?dl=0

Use the file train2016.shp.

We will use the train2016.shp to extract the pixel values from our stack2016 and thereafter classify the image.

#call the train2016.shp into R
train2016 <- readOGR(dsn = './sentinel-2-data/training_files_2016/train2016.shp')

## OGR data source with driver: ESRI Shapefile 
## Source: "E:\Documents\R Studio is here\R Studio files projects\gis_coding\sentinel-2-data\training_files_2016\train2016.shp", layer: "train2016"
## with 19 features
## It has 6 fields
## Integer64 fields read as strings:  MC_ID C_ID

#read first six attributes
head(train2016@data)

##   fid MC_ID  MC_name C_ID           C_name                   SCP_UID
## 0  NA     1    Water    2      Lake Nakuru 20220528_200356827219_612
## 1  NA     1    Water    3     Lake Nakuru3 20220528_200531967835_502
## 2  NA     2 Built-up    1           Houses 20220528_200710952334_520
## 3  NA     2 Built-up    2          Houses2 20220528_200930948826_647
## 4  NA     2 Built-up    3 Houses and roads 20220528_201116490714_641
## 5  NA     3 Bushland    1         Bushland 20220528_201327473931_941

#extract values from stack2016 image
sample2016 <- extract(stack2016, train2016, df = T)

Check it out. Each row contains a unique value collected from each of the 13 bands. How will we work with such kind of data that seems like a cryptic log from a different civilization?

#assign macroclass name to the training file - sample2016
#note that as.factor() will transform the MCnames to factorial but in alphabetical order
sample2016$MCname <- as.factor(train2016$MC_name[match(sample2016$ID, seq(nrow(train2016)))])

#assign macroclass ID to the training file - sample 2016
sample2016$MCid <- train2016$MC_ID[match(sample2016$ID, seq(nrow(train2016)))]

#Find out pixels sampled per landcover class
summary(sample2016$MCname)

## Agriculture    Built-up    Bushland      Forest        Road       Water 
##       26404       16728       30567       25179        1284       42045

Because some classes have more of their pixels sampled compared to others, as in the case of water vis a vis road, ‘their tyranny of numbers’ will highly influence the result. As such we need to find a way in which there will be equality in the classification process. For this case, we will take the lowest number of pixels sampled across the 6 classes - that is 1284 and belonging to road class. This number of pixels is what will be used from all the other classes as well during classification, irregardless if they had more than this collected by the training file.

#install the randomForest package
library(randomForest)

## Warning: package 'randomForest' was built under R version 4.1.3

## randomForest 4.7-1.1

## Type rfNews() to see new features/changes/bug fixes.

Next, we will form a vector where the length corresponds to the number of target classes. We will use this vector to tell the classifier how many samples it should randomly draw per class for each decision tree during training. This very much simplifed explanation of the method was extracted from here: https://blogs.fu-berlin.de/reseda/rf-classification/

#specify the number of classes that should be samples from each class
sample2016.size <- rep(min(summary(sample2016$MCname)), nlevels(sample2016$MCname))
sample2016.size

## [1] 1284 1284 1284 1284 1284 1284

The number of sampled pixels for each class is equal to that for Road. We are good to go.

Before we proceed, since the random forest function will calculate across all columns, it’s time we delete the unnecessary columns of our sample2016 dataframe.

#remove columns
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following object is masked from 'package:randomForest':
## 
##     combine

## The following objects are masked from 'package:rgeos':
## 
##     intersect, setdiff, union

## The following objects are masked from 'package:raster':
## 
##     intersect, select, union

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

#remove columns
sample2016 <- select(sample2016, -1)

We will use tuneRF() to classify stack 2016. For more information on Random Forest, see: - https://blogs.fu-berlin.de/reseda/random-forest/

In a nutshell, randomForest uses classification trees to identify each pixel to the class it is most probable of falling into. You can say it uses several iterations to make a very well educated guess.

#training the classification model
rfmodel2016 <- tuneRF(x = sample2016[, 1:13], #specify columns for training samples
                      y = sample2016[, 14],   #factor for classification
                      sampsize = sample2016.size, #how many samples to draw per class
                      strata = sample2016[, 14], #column to be used for stratified sampling
                      ntree = 500, 
                      importance = T,  #subsequent assessment of the variable importance (if True)
                      doBest = T)

## mtry = 3  OOB error = 0.91% 
## Searching left ...
## mtry = 2     OOB error = 1.03% 
## -0.1406371 0.05 
## Searching right ...
## mtry = 6     OOB error = 0.78% 
## 0.1421911 0.05 
## mtry = 12    OOB error = 0.7% 
## 0.0923913 0.05 
## mtry = 13    OOB error = 0.73% 
## -0.03393214 0.05

#check out the model values
rfmodel2016

## 
## Call:
##  randomForest(x = x, y = y, mtry = res[which.min(res[, 2]), 1],      strata = ..2, sampsize = ..1, importance = ..3) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 12
## 
##         OOB estimate of  error rate: 0.71%
## Confusion matrix:
##             Agriculture Built-up Bushland Forest Road Water  class.error
## Agriculture       25929        0      364    111    0     0 0.0179896985
## Built-up              0    16719        0      7    2     0 0.0005380201
## Bushland            203        0    30336      4   24     0 0.0075571695
## Forest              292        0        7  24880    0     0 0.0118749752
## Road                  0        0        0      0 1284     0 0.0000000000
## Water                 0        0        0      0    0 42045 0.0000000000

Let’s save this randomForest model to our directory and free up some mental space.

#save the rfmodel
save(rfmodel2016, file = 'rfmodel2016.RData')

Let’s use the rfmodel2016 to predict the landcover classes of stack2016.

#predict landcover classes for stack2016
classification2016 <- predict(stack2016, rfmodel2016, filename = 'classmodel2016.tif', 
                              overwrite = T)

Let’s view our classification image for 2016. Here comes something….

#plot 2016 classification image and assign colors
#in the order of Agriculture    Built-up    Bushland      Forest        Road       Water 
plot(classification2016, main = 'Classification 2016', col = c('yellow', 'gray85', 'darkolivegreen1', 'darkgreen', 'gray1', 'darkblue'))

Contrast the rfmodel with this from Qgis SCP plugin.

#call qgis classification for 2016
qgis2016 <- raster('./sentinel-2-data/classification_2016/classification_2016.tif')
#note ordering of classes in qgis was unclassified, water, built-up, bushland, forest, agriculture, road
plot(qgis2016, main = 'Qgis 2016', col = c('black', 'darkblue', 'gray85', 'darkolivegreen1', 
                                           'darkgreen', 'darkgoldenrod1', 'gray48'))

Landcover classification in R was definitely not easy. However, comparing this workflow with the somewhat straighforward one of Qgis SCP GUI, which would you honestly go for? Which produces better results?

Before you proceed, take a coffee break.

Part II

Using the same process as above, we will create a landcover classification image using stack2019. We will use the same workflow as above, but still using the train2016 data rather than creating a new training file for the 2019 image. Since inevitably some landcover change has occured in the four-year period, using the very same training file as for the start-year is always a good quality checker of the intelligentsia of our remote sensing tool, as well as saving time (and effort).

#extract stack 2019 pixels covered by train2016 polygons
sample2019 <- extract(stack2019, train2016, df = T)

#assign columns that will contain classnames and class ids
sample2019$MCname <- as.factor(train2016$MC_name[match(sample2019$ID, seq(nrow(train2016)))])
sample2019$MCid <- train2016$MC_ID[match(sample2019$ID, seq(nrow(train2016)))]

#find out total pixels per landcover class
summary(sample2019$MCname)

## Agriculture    Built-up    Bushland      Forest        Road       Water 
##       26404       16728       30567       25179        1284       42045

#specify the number of classes that should be samples from each class
sample2019.size <- rep(min(summary(sample2019$MCname)), nlevels(sample2019$MCname))
sample2019.size

## [1] 1284 1284 1284 1284 1284 1284

#remove the first column called ID as it is unneeded
sample2019 <- select(sample2019, -1)

Let’s train the random forest classification model. Fingers crossed.

#train the model using the 2019 image

rfmodel2019 <- tuneRF(x = sample2019[, 1:13], #specify columns for training samples
                      y = sample2019[, 14],   #factor for classification
                      sampsize = sample2019.size, #how many samples to draw per class
                      strata = sample2019[, 14], #column to be used for stratified sampling
                      ntree = 500, 
                      importance = T,  #subsequent assessment of the variable importance (if True)
                      doBest = T)

## mtry = 3  OOB error = 0.99% 
## Searching left ...
## mtry = 2     OOB error = 1.12% 
## -0.1304965 0.05 
## Searching right ...
## mtry = 6     OOB error = 0.88% 
## 0.1156028 0.05 
## mtry = 12    OOB error = 0.89% 
## -0.01122694 0.05

#check out the model values
rfmodel2019

## 
## Call:
##  randomForest(x = x, y = y, mtry = res[which.min(res[, 2]), 1],      strata = ..2, sampsize = ..1, importance = ..3) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 6
## 
##         OOB estimate of  error rate: 0.87%
## Confusion matrix:
##             Agriculture Built-up Bushland Forest Road Water  class.error
## Agriculture       25655        0      442    158  149     0 0.0283669141
## Built-up              0    16721        0      7    0     0 0.0004184601
## Bushland            294        6    30218      0   49     0 0.0114175418
## Forest              135        0        1  25043    0     0 0.0054013265
## Road                  0        0        2      0 1282     0 0.0015576324
## Water                 0        0        0      0    0 42045 0.0000000000

If the model performs to your desired result, ie. such as Out-of-Box (OOB) and other parameters seem alright, save the model.

#save the model
save(rfmodel2019, file = 'rfmodel2019.RData')

#use the model to predict the landcover classes for stack 2019
classification2019 <- predict(stack2019, rfmodel2019, filename = 'classmodel2019.tif', 
                              overwrite = T)

#plot the classification 2019 image using the same color order as for 2016 image
#Agriculture    Built-up    Bushland      Forest        Road       Water 
plot(classification2019, main = 'Classification 2019', col = c('yellow', 'gray85', 'darkolivegreen1', 'darkgreen', 'gray1', 'darkblue'))

#call 2019 image classified using qgis for comparison
qgis2019 <- raster('./sentinel-2-data/classification_2019/classification_2019.tif')
#plot qgis 2019 image
#ordering of classes in qgis was unclassified, water, built-up, bushland, forest, agriculture, road
plot(qgis2019, main = 'Qgis 2019', col = c('black', 'darkblue', 'gray85', 'darkolivegreen1', 
                                           'darkgreen', 'darkgoldenrod1', 'gray48'))

Between R and Qgis, which should be the go to option?

Conducting landcover classification in Qgis is fairly easier, although it has its own share of challenges. When classifying the 2019 Sentinel-2 image in Qgis, the train2016.shp was reloaded, and classification signatures were calculated using the 2016 polygons. This was done through the ’import signaturestool of SCP plugin..scpor.shp` files can be reused. This saves time that would have otherwise been wasted creating a new training file for the year 2019.

Now to the verdict. Surprisingly, the accuracy of random forest in R is better than the Qgis SCP Maximum Likelihood algorithm. However, the latter wins when it comes to simplicity. The former, however, wins the day in almost creating a real world classification scenario. When it comes to analyses, and landcover classification is no exemption, accuracy triumphs over simplicity. Go with R, but if the going gets tough, Qgis is a handy and just-as-good option.

chapter-8-

samuel gachuhi

6/1/2022