1. Summary

This final project deals with the classification and prediction of CORINE land cover classes from multispectral Landsat 8 data. Sample sites were randomly generated for a study region that includes Bremen, Bremerhaven, Oldenburg in the northwest of Germany. With those multispectral sample sites, the land cover classes were predicted using CART and RandomForest in R. The results differed considerably between the two methods with Random Forest (RF) showing a more realistic and precise classification than CART. However, both methods showed a rather high error rate over the whole model. The false classification differed between the main land cover classes and the subclasses. Both models managed to correctly classify most water related classes and subclasses like ocean and rivers. Arable land and most urban environments could not be predicted by the CART model, while the RF model managed to better predict some of these. Forests, marshes and grasslands were reasonably well predicted by both models. The RF method seems to be generating better results for multispectral classification, but it also requires more time to calculate. The high error of both models can be explained by the heterogeneous character of some of the main classes. Most importantly urban and agricultural areas, which probably differ in their spectral profile based on factors such as age, building material, management systems, crop rotations and others. Furthermore some of the land cover subclasses are rather specific, yet still diverging in their multispectral profile (e.g. urban areas). For a better classification of the CORINE data, it might be usefull to look at a larger study area, include more possible parameters for the classification trees (e.g. hyperspectral/multispectral with a higher resolution or vegetation indices), the inclusion of temporal changes in the multispectral data or the use of other classification methods (e.g. neural networks).

2. Introduction

This project aims to predict land cover from multispectral satellite data for a study region around Bremen. Studies have shown that broad land cover types (e.g. trees/forests, arable land, urban areas, pastures, water bodies) can be relatively well predicted from multispectral data (Matikainen et al., 2017; Gislason et al., 2006; Rodriguez-Galiano et al., 2012). Plants differ in their multispectral fingerprint depending on their species, health, age and other factors (Gitelson & Merzlyak, 2010; Ferreira et al., 2016; Rocchini et al., 2005; Moshou et al., 2005). This leads to different mutlispectral responses at different areas depending on their vegetation. But not only vegetation can be the reason for similarities and differences in the multispectral response of land cover classes. For example streets are ususally in a similar color due to asphalt and therefore could maybe be identified by multispectral data classification. Areas with open soil (e.g. mining sites) might also share similarities in their light reflectance. In this project, I try to classify and predict the classes and subclasses of the CORINE land cover data, which are often rather specific.

The research question therefore is: Which main and subclasses of CORINE land cover data can be reliably classified from multispectral Landsat 8 data?

3. Data and methods

The data for this project originates from two sources. The land cover data is part of the CORINE Land Cover (CLC) project, which collects and processes land cover data for all of Europe. The data set has been downloaded from the European Union’s Earth Observation Programme (Copernicus) for the year 2018 and included the whole of Europe (https://land.copernicus.eu/pan-european/corine-land-cover/clc2018?tab=mapview). The multispectral data was received from a website (https://earthexplorer.usgs.gov/) of the United States Geological Survey (USGS) for the Northwest of Germany. The data has been collected by Landsat 8 in April of 2018. Both data sets were clipped by a mask for the chosen study region (an area that includes Bremerhaven, Oldenburg, Bremen as well as the surrounding areas) in QGIS 3.4. The first seven bands (ultra blue, blue, green, red, NIR, SWIR1, SWIR2) of the multispectral data were combined to a virtual raster stack. Further work with and the processing of data was done in R 3.5.1 and RStudio 1.1.463 with the following packages:

library(randomForest)
library(raster)
library(rasterVis)
library(rpart)
library(dismo)
library(randomForestExplainer)
library(rpart.plot)
library(RColorBrewer)

Importing the CLC 2018 data:

setwd("I:/Data")
cover<-raster("./Qgis/MASKclc.tif") # Imports a a tif as a raster

Importing the Landsat 8 data:

setwd("I:/Data")
landsat=stack("./Qgis/mergemulti.vrt")

Processing data:

Land cover data:

#Adding the land cover class names
coverclass=c("Continuous urban fabric","Discontinuous urban fabric","Industrial or commercial units","Road and rail networks","Port areas","Airports","Mineral extraction sites","Dump sites","Construction sites","Green urban areas","Sport and leisure facilities","Non-irrigated arable land","Fruit trees and berry plantations","Pastures","Agriculture/Natural Vegetation","Broad-leaved forest","Coniferous forest","Mixed forest","Natural grasslands","Moors and heathland","Transitional woodland-shrub","Inland marshes","Peat bogs","Salt marshes","Intertidal flats","Water courses","Water bodies","Estuaries","Sea and ocean")

#Ratifying data
cover=ratify(cover)
rat=levels(cover)[[1]]
rat$landcover=coverclass
levels(cover)=rat

Multispectral data:

#Setting the names of the bands
names(landsat)<-c("ultra blue","blue","green","red","NIR","SWIR1","SWIR2")
names(landsat)

## [1] "ultra.blue" "blue"       "green"      "red"        "NIR"       
## [6] "SWIR1"      "SWIR2"

Creating the color palette for land cover classes:

reds=brewer.pal(n = 9, name = "YlOrRd")[c(9,7)] # Creates a new palette from the 9th to the 7th color of the YlOrRd palette.
redspal=colorRampPalette(reds)(11) # Creates a palette of 11 colors between the two chosen colors
greens=brewer.pal(n=9,name="Greens")[c(3,5)]
greenpal=colorRampPalette(greens)(2)
darkgreens=brewer.pal(n=9,name="Greens")[c(8,9)]
darkgreenpal=colorRampPalette(darkgreens)(3)
lightblues=brewer.pal(n = 9, name = "Blues")[c(4,5)]
lightbluepal=colorRampPalette(lightblues)(4)
darkblues=brewer.pal(n = 9, name = "Blues")[c(7,9)]
darkbluepal=colorRampPalette(darkblues)(4)
mypal=c(redspal,"#cda314","#c61be8",greenpal,darkgreenpal,"#93f088","#2f312e","#2ed85f",lightbluepal,darkbluepal)
pie(rep(1, length(mypal)), col = mypal , main="")

This pie chart only shows the color palette that will be used for the CORINE data.

Data structure:

levels(cover)

## [[1]]
##     ID                         landcover
## 1  111           Continuous urban fabric
## 2  112        Discontinuous urban fabric
## 3  121    Industrial or commercial units
## 4  122            Road and rail networks
## 5  123                        Port areas
## 6  124                          Airports
## 7  131          Mineral extraction sites
## 8  132                        Dump sites
## 9  133                Construction sites
## 10 141                 Green urban areas
## 11 142      Sport and leisure facilities
## 12 211         Non-irrigated arable land
## 13 222 Fruit trees and berry plantations
## 14 231                          Pastures
## 15 243    Agriculture/Natural Vegetation
## 16 311               Broad-leaved forest
## 17 312                 Coniferous forest
## 18 313                      Mixed forest
## 19 321                Natural grasslands
## 20 322               Moors and heathland
## 21 324       Transitional woodland-shrub
## 22 411                    Inland marshes
## 23 412                         Peat bogs
## 24 421                      Salt marshes
## 25 423                  Intertidal flats
## 26 511                     Water courses
## 27 512                      Water bodies
## 28 522                         Estuaries
## 29 523                     Sea and ocean

The CORINE data includes multiple classes and subclasses of land cover types of which only a part can be found in Germany and only a subset of those in the study region. The main classes in the region are urban areas (ID 111-142), agricultural areas (ID 211-243), forests (311-324), marshes (411-423) and water bodies/courses (511-523). The urban class contains the highest number of subclasses in the region:

Full Legend of CORINE land cover subclasses:

111 : Continuous urban fabric

112 : Discontinuous urban fabric

121 : Industrial or commercial units

122 : Road and rail networks and associated land

123 : Port areas

124 : Airports

131 : Mineral extraction sites

132 : Dump sites

133 : Construction sites

141 : Green urban areas

142 : Sport and leisure facilities

211 : Non-irrigated arable land

222 : Fruit trees and berry plantations

231 : Pastures

243 : Land principally occupied by agriculture, with significant areas of natural vegetation

311 : Broad-leaved forest

312 : Coniferous forest

313 : Mixed forest

321 : Natural grasslands

322 : Moors and heathland

324 : Transitional woodland-shrub

411 : Inland marshes

412 : Peat bogs

421 : Salt marshes

423 : Intertidal flats

511 : Water courses

512 : Water bodies

522 : Estuaries

523 : Sea and ocean

Study region

Plotting the land cover data of the study region:

levelplot(cover,maxpixels=1e8, col.regions = mypal, main = 'CORINE land cover subclasses of the study region',colorkey=list(labels=list(cex=0.7)),pretty=TRUE,scales=list(draw=FALSE))

The map shows the different classes and subclasses of land cover of the region. Urban areas (111-142) are red, arable land brown, fruit tree plantations purple, grassland/pastures are lightgreen, forests are darkgreen, moors are black, marshes (411-423) are lightblue, water bodies/ocean (511-523) are darkblue. Furthermore the extent of the study region is shown. Bremen is the biggest city in the region and is located south of the center. In the west of Bremen is Oldenburg and in the north is Bremerhaven. The study region also contains a part of the North Sea (north west) as well as the major river Weser that flows into the sea at Bremerhaven. The spring of the Weser is south of the study region. The map shows heterogeneous as well as homogenous areas. There are vast sections of arable land in the south and east, while only few areas of this class exist in the west. Grasslands and pastures dominate the west, the centre and the north of the study region. Forest areas are relatively small but strongly distributed over the entire region with the biggest forests being in the southeast, southwest and between Bremen and Bremerhaven.

Sample sites

Creating 300 random sample sites for each land cover class:

set.seed(74) # Setting a random seed for consistent results.
samples=sampleStratified(cover,size=300,na.rm=TRUE,sp=TRUE)
samples

## class       : SpatialPointsDataFrame 
## features    : 8350 
## extent      : 4191150, 4295950, 3298650, 3401750  (xmin, xmax, ymin, ymax)
## coord. ref. : +proj=laea +lat_0=52 +lon_0=10 +x_0=4321000 +y_0=3210000 +ellps=GRS80 +towgs84=0,0,0,0,0,0,0 +units=m +no_defs 
## variables   : 2
## names       :    cell, MASKclc 
## min values  :     747,     111 
## max values  : 1089798,     523

table(samples$MASKclc)

## 
## 111 112 121 122 123 124 131 132 133 141 142 211 222 231 243 311 312 313 
## 300 300 300 300 300 300 300 300  56 300 300 300 300 300 300 300 300 300 
## 321 322 324 411 412 421 423 511 512 522 523 
## 300 194 300 300 300 300 300 300 300 300 300

Class 133 and 194 do not have enough area in the region to generate 300 sites, thus 56 and 194 sites have been created.

Plotting the sample sites:

plt <- levelplot(cover, col.regions = mypal, main = 'Sample site distribution',scales=list(draw=FALSE))
print(plt + layer(sp.points(samples, pch = 3, cex = 0.5, col = 1)))

Due to the amount of different urban land cover subclasses, there is an accumulation of sample sites in and around the cities. For the water bodies there are also many different subclasses, which explains the accumulation of sample sites in the North Sea and in the rivers. For the agricultural and forest subclasses the sample sites are more scattered.

Extracting the multispectral values for the sample sites:

sampvals=extract(landsat,samples,df=TRUE)

## Warning in .local(x, y, ...): Transforming SpatialPoints to the CRS of the
## Raster

sampvals=sampvals[,-1]
sampdata=data.frame(classvalue=samples@data$MASKclc,sampvals)

4. Results and Conclusions

CART

Creating and plotting the classification tree:

carttree=rpart(as.factor(classvalue)~.,data=sampdata,method="class",minsplit=5,cp=0.005)
rpart.plot(carttree,type=0,extra=0,fallen.leaves=TRUE,tweak=1.2) #Tweak changed font size

Legend of CORINE land cover subclasses included in the tree

111 : Continuous urban fabric

123 : Port areas

124 : Airports

141 : Green urban areas

222 : Fruit trees and berry plantations

231 : Pastures

311 : Broad-leaved forest

312 : Coniferous forest

313 : Mixed forest

321 : Natural grasslands

411 : Inland marshes

423 : Intertidal flats

511 : Water courses

512 : Water bodies

522 : Estuaries

523 : Sea and ocean

The classification tree generated with CART is only able to classify 16 out of the 29 total subclasses. Every band, except for the blue one, has been used in the classification tree. Ocean and water body subclasses as well as forest subclasses are all included, while arable land, marshes and moors, and most of the urban land cover subclasses are missing. The splits in the tree already group some of the subclasses of a main class together. For example three of the four subclasses (all ocean related) of the fifth group (512, 522, 523) are grouped under the first split which is based on NIR. The river/water courses subclass seems to show higher levels of NIR compared to the oceanic subclasses with deeper water levels. This is most likely based on the ability of water to absorb near infrared light. Also the green band seems to show higher values for ocean sites. There are differences between water bodies based on turbidity and other factors, which could explain this classification (Doron et al., 2011). The importance of the green band for classification of ocean sites has also been shown in studies (e.g. Mélin & Sclep, 2015). Inland marshes seem to be closer to grasslands in their multispectral profile, while intertidal flats are more similar to the ocean subclasses. Vegetation shows different reflectance of NIR and SWIR light, which is based in part on the chlorophyll in green plants (Gitelson & Merzlyak, 2010). Intertidal flats probably have less vegetation, while inland marshes are settled by many green, herbaceous plants (Adam et al., 2010). This vegetation probably has a similar reflectance to the ones of grasslands (pastures, natural grassland), which could explain the similarity in classification. Forest areas show different SWIR1/2 levels depending on their vegetation with broad-leaved forests being classified as higher and mixed as well as coniferous forests showing lower SWIR values. Studies showed that coniferous forests tend to reflect more in the NIR spectrum on a needle/leaf and wood level than broad-leaved forests but for SWIR coniferous forests show similar or smaller values than broad-leaved dominated forests (Silván-Cárdenas et al., 2015). The difference in the vegetation and their spectral profiles seems to be the reason for the classification of the forest subclasses. Urban areas show higher SWIR1 and ultra blue values than most other classes. We can also see a split between urban areas with more vegetation (124, 141) and those without (111, 123) based on the NIR, where those with more vegetation are characterised by a higher NIR value. The reason for this difference is also most likely the presence and amount of vegetation.

Accuracy, prediction and validation

Printing the relative error:

printcp(carttree)

## 
## Classification tree:
## rpart(formula = as.factor(classvalue) ~ ., data = sampdata, method = "class", 
##     minsplit = 5, cp = 0.005)
## 
## Variables actually used in tree construction:
## [1] green      NIR        red        SWIR1      SWIR2      ultra.blue
## 
## Root node error: 8047/8347 = 0.96406
## 
## n=8347 (3 observations deleted due to missingness)
## 
##           CP nsplit rel error  xerror      xstd
## 1  0.0354169      0   1.00000 1.01305 0.0017149
## 2  0.0318131      1   0.96458 0.97527 0.0026917
## 3  0.0293277      2   0.93277 0.94868 0.0031734
## 4  0.0231142      3   0.90344 0.90555 0.0037803
## 5  0.0222443      4   0.88033 0.89400 0.0039174
## 6  0.0173357      5   0.85808 0.86156 0.0042588
## 7  0.0134212      7   0.82341 0.82205 0.0046040
## 8  0.0108115      8   0.80999 0.81111 0.0046881
## 9  0.0098173      9   0.79918 0.80154 0.0047579
## 10 0.0091960     10   0.78936 0.79943 0.0047729
## 11 0.0086989     11   0.78017 0.78849 0.0048478
## 12 0.0074562     13   0.76277 0.77582 0.0049297
## 13 0.0052193     14   0.75531 0.76873 0.0049732
## 14 0.0050000     15   0.75009 0.76028 0.0050230

The CART tree shows quite high relative and xerror values as well as a rather high node error which already indicates problems with the correct classification of the land cover types. The relative error decreases for each split and thus the tree does not need to be pruned. However, in the later splits the decrease is rather small (<0.01).

Predicting the land cover for the study area:

predcart=predict(landsat,carttree,type="class")

Ratifying:

predcart=ratify(predcart)

Plotting the predicted classification based on CART:

predcolor=c("#800026","#A70A22","#B10D20","#D9171D","#c61be8","#C7E9C0","#006D2C","#005823","#00441B","#93f088","#9ECAE1","#8DC0DD","#2171B5","#185B9C","#104583","#08306B")
rat=levels(predcart)[[1]]
rat$landcover=c("Continuous urban fabric","Port areas","Airports","Green urban areas","Fruit trees and berry plantations","Pastures","Broad-leaved forest","Coniferous forest","Mixed forest","Natural grasslands","Inland marshes","Intertidal flats","Water courses","Water bodies","Estuaries","Sea and ocean")
levels(predcart)=rat
levelplot(predcart,maxpixels=1e8,col.regions=predcolor,scales=list(draw=FALSE),main="CART classification of Landsat 8 data")

The water bodies and courses as well as many grassland areas in the west of the study region have been correctly classsified. The biggest problem arises due to arable land not being in the classification tree. The arable land is therefor classified as forests, grassland and urban areas. The main cities have been also predicted correctly for the most part, but due to the misclassification as arable land show a way too big total area. The biggest forests have also more or less been classified correctly, but the total area here is also too big due to the arable land being misclassified as forests. Fruit trees and berry plantations have been predicted way too much. In reality only an area in the northest is heavily characterised by this subclass. This map also shows the big impact of the high error rates of the CART tree.

Validation and confusion matrix of the model with k-fold cross-validation (5 groups):

j <- kfold(sampdata, k = 5, by=sampdata$classvalue)
table(j)

## j
##    1    2    3    4    5 
## 1670 1670 1670 1670 1670

Train and test the model five times:

x <- list()
for (k in 1:5) {
    train <- sampdata[j!= k, ]
    test <- sampdata[j == k, ]
    cart <- rpart(as.factor(classvalue)~., data=train, method = 'class', minsplit = 5)
    pclass <- predict(carttree, test, type='class')
    # create a data.frame using the reference and prediction
    x[[k]] <- cbind(test$classvalue, as.integer(pclass))
}

Combining the five lists into one data frame and computing a confusion matrix:

y <- do.call(rbind, x)
y <- data.frame(y)
# Column titles
colnames(y) <- c('observed', 'predicted')
conmat <- table(y)
# Column names of predicted
colnames(conmat) <- c("111","123","124","141","222","231","311","312","313","321","411","423","511","512","522","523")
conmat

##         predicted
## observed 111 123 124 141 222 231 311 312 313 321 411 423 511 512 522 523
##      111 186  23   2  22   3   1  15   0  11   4  31   0   1   0   1   0
##      112  20   1   7  77  25  16  58   0  12  46  36   0   0   2   0   0
##      121  33  26  30  88   9  11  15   0   5  36  44   0   1   2   0   0
##      122 124  23   6  62   1   0  19   1   4  16  43   0   1   0   0   0
##      123  64 110  26  51   1   6   0   1   1   8  23   0   8   0   0   1
##      124   6   2 193  35   2  30   3   0   1  11  11   0   0   5   1   0
##      131  16   5 120  35   5   4  12   0   3  41  13   0   9  21   0  16
##      132  15  10  26  41   9  30  30  44   9  57  26   0   2   0   1   0
##      133   1   1  11  13   2   2   5   0   0  11   9   0   1   0   0   0
##      141  27   1  20 109  20  10  53   1  25  18  13   0   2   1   0   0
##      142   7   0  18 100  12  38  24   5  43  38   9   0   2   4   0   0
##      211   5   0  57  42  14  57  29   0   6  55  35   0   0   0   0   0
##      222   0   0   4   8  96  63  44   0   2  71   9   0   0   2   1   0
##      231   1   0  17  29  18 160  16   1  19  29   9   0   0   1   0   0
##      243   2   0  22  67  17  44  68   3  40  29   6   0   0   2   0   0
##      311   2   0   4  32  20   5 104  16  65  44   7   0   0   1   0   0
##      312   0   0   4  48   3   2  21  61 134  21   5   0   1   0   0   0
##      313   0   0   4  33   9   9  53  11 152  21   7   0   1   0   0   0
##      321   2   0  23   5   7  35  23   3  17 102  80   0   1   2   0   0
##      322   0   0  48  38   4  12  27   0   2  55   8   0   0   0   0   0
##      324   0   0  23  23  26  26  58   9  61  61  12   0   0   1   0   0
##      411  11   0  14   8   3   2  25   7  32  60 131   0   2   3   2   0
##      412   0   0  13   5   5  11  47  13  77  99  23   0   0   7   0   0
##      421  21   0  39  34   2  17   0   0  36  89  61   0   1   0   0   0
##      423  16   0   0   1   0   0   0   2   0   1   5 210  44   0   0  21
##      511  59  13   1  53   0   2   1   2   6   2  10   6 125  11   9   0
##      512  18   4   0   5   1   0   5  16  23   3   9   2  16 176   9  13
##      522   3   0   0   0   0   0   1   9   5   2   9  20  17  35 117  82
##      523   1   0   0   0   0   0   0   0   0   0   0  13   1   5   1 279

The confusion matrix also supports the conclusions that have been drawn from the prediction map and the classification tree. Misclassification of urban subclasses as other subclasses is quite high as the map already suggested. Interesting is the rather high misclassification of coniferous forest as mixed forest. It might be possible that the mixed forest in the study area is dominated by coniferous trees and thus closer to that forest type in vegetation and multispectral profile. Marshes and ocean related subclasses show very small misclassification numbers, which was already visible in the prediction map. Interesting is also the rather correct classification of airport and continuous uzrban fabric areas.

Random Forest

Creating the forest:

sampdata$classvalue=as.factor(sampdata$classvalue)
sampdata=na.exclude(sampdata)
rf=randomForest(classvalue~.,data=sampdata,method="class",ntree=200,localImp=TRUE)
rf

## 
## Call:
##  randomForest(formula = classvalue ~ ., data = sampdata, method = "class",      ntree = 200, localImp = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 200
## No. of variables tried at each split: 2
## 
##         OOB estimate of  error rate: 54.26%
## Confusion matrix:
##     111 112 121 122 123 124 131 132 133 141 142 211 222 231 243 311 312
## 111 150  18  17  48  20   1   4   1   0   6   8   7   1   1   1   1   0
## 112  17  60  11  23   4   8   8   8   1  30  22  18  24   8  20   5   3
## 121  20  16  47  19  44   8  10  12   3  19   6  10   8  12   6   2   5
## 122  47   8  13 124  28   2   1   6   2  16   3   2   4   1   2   6   1
## 123  33   2  22  41 141   7   5   3   2   3   4   4   2   4   0   0   0
## 124   1   7   9   3  12 179  19   5   2   3   8  11   4   2   3   0   0
## 131   0   8   5   2  14  21 140   9   2   4   2  12  10   2   2   1   0
## 132   5  10  13  12   8   9  17  92   2   4   8   5  12   7  10  10   0
## 133   2   4   1   1   7   1   9   0   3   1   0  11   5   0   0   1   0
## 141   6  20   3  26   4   9   5   1   0  95  29   6  20  10   5  16   8
## 142   2  12   3  11   0  14   3   6   0  39  52  11  10  22  17  14  26
## 211   4  21   3   4   7  18  24   8   2   1  13  55  15  35  11   8   7
## 222   0   7   0   1   0   0   4   4   1  13   4  10 184  17   8   8   0
## 231   0  10   4   3   2  12   4   5   0   7  21  17  16 115  25  10   4
## 243   1  20   7   1   0  10   3   4   0  10  15  18   8  26  83  21  11
## 311   2   2   2   6   0   2   2   2   0  15  11   6  10   2  16  75  31
## 312   0   4   0   2   0   0   1   0   0  14  12   7   0   1   5  26 153
## 313   1   7   1   3   0   0   3   2   0  11  20   1   1   6  19  36  64
## 321   0   4   5   5   1   2   9   4   1   3   5  11  19  14   7  11   1
## 322   0   2   2   1   0   8   2   6   0   5  10   4   0   2   2  14   2
## 324   0   4   2   6   0   9   4   4   0   5   5   2  16   8  12  33  11
## 411   3   0   0  12   3   2   2   1   1   3   6   8   8   2   2   3   1
## 412   0   2   2   1   0   2   4   2   0   3   1   2   3   6   4  23   6
## 421   0   0   9   1   3   3   7  10   1   1   4   3   2   7   0   0   2
## 423   0   0   0   0   2   0   0   0   0   0   0   0   0   0   0   0   0
## 511   1   2   5   4   3   1   3   0   0   3   2   1   0   3   4   0   2
## 512   8   2   0   0   4   2   7   3   0   1   1   0   0   0   4   3   1
## 522   1   0   0   0   0   0   3   1   0   0   0   0   0   0   0   0   0
## 523   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
##     313 321 322 324 411 412 421 423 511 512 522 523 class.error
## 111   0   2   0   1   7   1   0   0   2   3   0   0   0.5000000
## 112   4   3   2   4   6   4   3   0   2   2   0   0   0.8000000
## 121   0  10   8   5   9   2  12   0   3   3   1   0   0.8433333
## 122   0   6   7   3  12   0   4   0   2   0   0   0   0.5866667
## 123   0   1   0   1   6   0   4   0  11   3   0   1   0.5300000
## 124   0   4   1   2   6   1  11   0   0   5   2   0   0.4033333
## 131   1   7   7   7   2   7  16   3   4  11   1   0   0.5333333
## 132   2  10   9   9  10   9  22   0   2   1   2   0   0.6933333
## 133   0   3   0   1   2   0   2   0   1   1   0   0   0.9464286
## 141   8   5   7   5   5   0   0   0   3   4   0   0   0.6833333
## 142  19   4  10   8   0   2   6   0   4   5   0   0   0.8266667
## 211   4  19   3   7  18   5   8   0   0   0   0   0   0.8166667
## 222   2  14   0  12   3   3   4   0   0   1   0   0   0.3866667
## 231   7   7   4  10   2   7   6   0   0   2   0   0   0.6166667
## 243  18   7   7  15   3   3   4   0   1   4   0   0   0.7233333
## 311  47   7  15  27   3  16   0   0   0   1   0   0   0.7500000
## 312  47   1   5  10   1   5   2   0   4   0   0   0   0.4900000
## 313  72   6   4  20   4  17   0   0   0   1   0   0   0.7591973
## 321   2  97   9  11  37  18  16   1   0   7   0   0   0.6766667
## 322   6   2 108   9   1   6   2   0   0   0   0   0   0.4432990
## 324  17   3  21  96  14  17   4   0   1   5   1   0   0.6800000
## 411   3  31   5  10 150  18   8   0   8   5   4   0   0.4983278
## 412  15  10  10  28  25 131   9   0   0  11   0   0   0.5633333
## 421   0  13   1   1  13   5 203   2   7   0   2   0   0.3233333
## 423   0   0   0   0   0   0   2 265   2   0  17  12   0.1166667
## 511   0   0   0   0   1   1  10   0 242   9   2   0   0.1906355
## 512   2   4   0   2   3  17   1   1  25 200   9   0   0.3333333
## 522   1   2   0   1   1   1   3  23   6  10 236  11   0.2133333
## 523   0   0   0   0   0   0   0  19   0   1  10 270   0.1000000

The out-of-bag error is above 50% and therefore quite high. Thus the total ability of the modell to predict the classes is quite poor, however it seems to be better suited than the CART tree. The confusion matrix shows strong differences between classes and subclasses, which suggests that the model can predict some reasonably well while it fails to correctly classify others. The subclasses 112,121,132,133142,211,243,311,313 show rather high class errors, but at least the random forest actually classified all subclasses in contrast to the CART tree. The same subclasses that were reasonably well predicted by the CART tree are also the ones with the lowest class errors here (water related subclasses).

Error plot of subclasses:

layout(matrix(c(1,2),nrow=1),
       width=c(4,1)) 
par(mar=c(5,4,4,0)) #No margin on the right side
plot(rf, log="y")
par(mar=c(5,0,4,2)) #No margin on the left side
plot(c(0,1),type="n", axes=F, xlab="", ylab="")
legend("top", colnames(rf$err.rate),col=1:4,cex=0.5,fill=1:4)

As we can see the error is decreasing for most parameter classes the higher the amount of generated trees is. However, there are some classes that show a small increase in error in the later trees as well as strong fluctuations. It is also visible that only 10 subclasses achieve an error below 0.5, which points to a rather bad prediction accuracy.

Measuring parameter importance:

importancerf <- measure_importance(rf)

importancerf

This table shows some accuracy and importance parameters of the seven factors used for prediction. NIR, SWIR1 and SWIR2 are often the first or second split in the trees, they also have the highest numbers of nodes and are often the root, which all suggests their importance for the whole model. The correlation is better visible in the following mutli-way importance plot:

plot_multi_way_importance(importancerf, size_measure = "no_of_nodes")

There is a clear grouping of NIR, SWIR1 and SWIR2 at the top left, which shows their importance. These vegetation related spectrums are the most important ones. Blue is interestingly more often a root and has a lower mean min depth but this band also has an incredibly small amount of nodes that it appears in over all trees.

Mindepth:

mindepth <- min_depth_distribution(rf)

Distribution of minimal depth:

plot_min_depth_distribution(mindepth)

This plot shows a more detailed overview of the minimal depth of the parameters. NIR, SWIR1 and SWIR2 are often responsible for splits at zero and one depth, while ultra blue, green and blue are more often at minimal depth of two.

Predicting the land cover using RF:

rfpred=predict(landsat,rf,type="response",index=1,na.rm=TRUE, progress="window")

## Loading required namespace: tcltk

Ratifying and plotting the predicted land cover classes:

rfpred=ratify(rfpred)
rat=levels(rfpred)[[1]]
rat$landcover=coverclass
levels(rfpred)=rat
levelplot(rfpred,maxpixels=1e8, col.regions = mypal, main = 'RF classification of Landsat 8 data',scales=list(draw=FALSE))

This map shows the predicted land cover classes and it validates the information gathered before. We can see, that the water related classes are mostly correctly classified. Building and settlement related areas are overestimated while agricultural areas are predicted to be smaller than they are in reality. However, the map seems to be closer to the actual CORINE land cover than the map generated with the CART classification tree. There certainly are still major deviations from the observed land covers, but the overall classification seems to have imporved. A major factor for this is that RF includes all classes, and while it still underestimates the agricultural areas, it still manages to better portray reality. Due to the inclusion of all classes there are also way less misclassifications of other subclasses as urban ones. The fruit tree classifications in the northeast are very similar to the observed ones. There are still too many of these areas in the center and the west of the study area, but compared to the CART map, these are less distributed. The water related areas are even more close to reality compared to CART. In conclusion, the RF method seems to generate better classifications for the land cover classes of the study area than CART. Both methods still have considerably high error rates and there are certain subclasses that are for the most part wrongly predicted with either method. To achieve better results in the future, it would probably make sense to incorperate different multispectral data from different times of the year. This together with a modification of the RF method could maybe achieve better results for the agricultural classes, that show very different multispectral profiles based on a multitude of possible factors such as farming systems and management, crop rotation,….

5. Sources and References

Adam, E., Mutanga, O., & Rugege, D. (2010). Multispectral and hyperspectral remote sensing for identification and mapping of wetland vegetation: a review. Wetlands Ecology and Management, 18(3), 281-296. https://doi.org/10.1007/s11273-009-9169-z

Doron, M., Bélanger, S., Doxaran, D., & Babin, M. (2011). Spectral variations in the near-infrared ocean reflectance. Remote Sensing of Environment, 115(7), 1617-1631. https://doi.org/10.1016/j.rse.2011.01.015

Ferreira, M. P., Zortea, M., Zanotta, D. C., Shimabukuro, Y. E., & de Souza Filho, C. R. (2016). Mapping tree species in tropical seasonal semi-deciduous forests with hyperspectral and multispectral data. Remote Sensing of Environment, 179, 66-78. https://doi.org/10.1016/j.rse.2016.03.021

Gislason, P. O., Benediktsson, J. A., & Sveinsson, J. R. (2006). Random Forests for land cover classification. Pattern Recognition Letters, 27(4), 294-300. https://doi.org/10.1016/j.patrec.2005.08.011

Gitelson, A. A., & Merzlyak, M. N. (2010). Remote estimation of chlorophyll content in higher plant leaves. International Journal of Remote Sensing. https://doi.org/10.1080/014311697217558

Matikainen, L., Karila, K., Hyyppä, J., Litkey, P., Puttonen, E., & Ahokas, E. (2017). Object-based analysis of multispectral airborne laser scanner data for land cover classification and map updating. ISPRS Journal of Photogrammetry and Remote Sensing, 128, 298-313. https://doi.org/10.1016/j.isprsjprs.2017.04.005

Mélin, F., & Sclep, G. (2015). Band shifting for ocean color multi-spectral reflectance data. Optics Express, 23(3), 2262. https://doi.org/10.1364/OE.23.002262

Moshou, D., Bravo, C., Oberti, R., West, J., Bodria, L., McCartney, A., & Ramon, H. (2005). Plant disease detection based on data fusion of hyper-spectral and multi-spectral fluorescence imaging using Kohonen maps. Real-Time Imaging, 11(2), 75-83. https://doi.org/10.1016/j.rti.2005.03.003

Myneni, R. B., Hall, F. G., Sellers, P. J., & Marshak, A. L. (1995). The interpretation of spectral vegetation indexes. IEEE Transactions on Geoscience and Remote Sensing, 33(2), 481-486. https://doi.org/10.1109/36.377948

Rocchini, D., Andreini Butini, S., & Chiarucci, A. (2005). Maximizing plant species inventory efficiency by means of remotely sensed spectral distances: Maximizing plant species inventorying by remote sensing. Global Ecology and Biogeography, 14(5), 431-437. https://doi.org/10.1111/j.1466-822x.2005.00169.x

Rodriguez-Galiano, V. F., Ghimire, B., Rogan, J., Chica-Olmo, M., & Rigol-Sanchez, J. P. (2012). An assessment of the effectiveness of a random forest classifier for land-cover classification. ISPRS Journal of Photogrammetry and Remote Sensing, 67, 93-104. https://doi.org/10.1016/j.isprsjprs.2011.11.002

Silván-Cárdenas, J., Corona, N., Mauricio Galeana Pizaña, J., Núñez, J. M., & Madrigal, J. (2015). Geospatial Technologies to Support Coniferous Forests Research and Conservation efforts in Mexico (pp. 67-123).

https://rspatial.org/rs/5-supclassification.html

http://bleutner.github.io/RStoolbox/rstbx-docu/superClass.html

Evaluation of land cover classification from multispectral images using CART and Random Forest - Schnepel