This final project deals with the classification and prediction of CORINE land cover classes from multispectral Landsat 8 data. Sample sites were randomly generated for a study region that includes Bremen, Bremerhaven, Oldenburg in the northwest of Germany. With those multispectral sample sites, the land cover classes were predicted using CART and RandomForest in R. The results differed considerably between the two methods with Random Forest (RF) showing a more realistic and precise classification than CART. However, both methods showed a rather high error rate over the whole model. The false classification differed between the main land cover classes and the subclasses. Both models managed to correctly classify most water related classes and subclasses like ocean and rivers. Arable land and most urban environments could not be predicted by the CART model, while the RF model managed to better predict some of these. Forests, marshes and grasslands were reasonably well predicted by both models. The RF method seems to be generating better results for multispectral classification, but it also requires more time to calculate. The high error of both models can be explained by the heterogeneous character of some of the main classes. Most importantly urban and agricultural areas, which probably differ in their spectral profile based on factors such as age, building material, management systems, crop rotations and others. Furthermore some of the land cover subclasses are rather specific, yet still diverging in their multispectral profile (e.g. urban areas). For a better classification of the CORINE data, it might be usefull to look at a larger study area, include more possible parameters for the classification trees (e.g. hyperspectral/multispectral with a higher resolution or vegetation indices), the inclusion of temporal changes in the multispectral data or the use of other classification methods (e.g. neural networks).
This project aims to predict land cover from multispectral satellite data for a study region around Bremen. Studies have shown that broad land cover types (e.g. trees/forests, arable land, urban areas, pastures, water bodies) can be relatively well predicted from multispectral data (Matikainen et al., 2017; Gislason et al., 2006; Rodriguez-Galiano et al., 2012). Plants differ in their multispectral fingerprint depending on their species, health, age and other factors (Gitelson & Merzlyak, 2010; Ferreira et al., 2016; Rocchini et al., 2005; Moshou et al., 2005). This leads to different mutlispectral responses at different areas depending on their vegetation. But not only vegetation can be the reason for similarities and differences in the multispectral response of land cover classes. For example streets are ususally in a similar color due to asphalt and therefore could maybe be identified by multispectral data classification. Areas with open soil (e.g. mining sites) might also share similarities in their light reflectance. In this project, I try to classify and predict the classes and subclasses of the CORINE land cover data, which are often rather specific.
The research question therefore is: Which main and subclasses of CORINE land cover data can be reliably classified from multispectral Landsat 8 data?
The data for this project originates from two sources. The land cover data is part of the CORINE Land Cover (CLC) project, which collects and processes land cover data for all of Europe. The data set has been downloaded from the European Union’s Earth Observation Programme (Copernicus) for the year 2018 and included the whole of Europe (https://land.copernicus.eu/pan-european/corine-land-cover/clc2018?tab=mapview). The multispectral data was received from a website (https://earthexplorer.usgs.gov/) of the United States Geological Survey (USGS) for the Northwest of Germany. The data has been collected by Landsat 8 in April of 2018. Both data sets were clipped by a mask for the chosen study region (an area that includes Bremerhaven, Oldenburg, Bremen as well as the surrounding areas) in QGIS 3.4. The first seven bands (ultra blue, blue, green, red, NIR, SWIR1, SWIR2) of the multispectral data were combined to a virtual raster stack. Further work with and the processing of data was done in R 3.5.1 and RStudio 1.1.463 with the following packages:
library(randomForest)
library(raster)
library(rasterVis)
library(rpart)
library(dismo)
library(randomForestExplainer)
library(rpart.plot)
library(RColorBrewer)
Importing the CLC 2018 data:
setwd("I:/Data")
cover<-raster("./Qgis/MASKclc.tif") # Imports a a tif as a raster
Importing the Landsat 8 data:
setwd("I:/Data")
landsat=stack("./Qgis/mergemulti.vrt")
Land cover data:
#Adding the land cover class names
coverclass=c("Continuous urban fabric","Discontinuous urban fabric","Industrial or commercial units","Road and rail networks","Port areas","Airports","Mineral extraction sites","Dump sites","Construction sites","Green urban areas","Sport and leisure facilities","Non-irrigated arable land","Fruit trees and berry plantations","Pastures","Agriculture/Natural Vegetation","Broad-leaved forest","Coniferous forest","Mixed forest","Natural grasslands","Moors and heathland","Transitional woodland-shrub","Inland marshes","Peat bogs","Salt marshes","Intertidal flats","Water courses","Water bodies","Estuaries","Sea and ocean")
#Ratifying data
cover=ratify(cover)
rat=levels(cover)[[1]]
rat$landcover=coverclass
levels(cover)=rat
Multispectral data:
#Setting the names of the bands
names(landsat)<-c("ultra blue","blue","green","red","NIR","SWIR1","SWIR2")
names(landsat)
## [1] "ultra.blue" "blue" "green" "red" "NIR"
## [6] "SWIR1" "SWIR2"
Creating the color palette for land cover classes:
reds=brewer.pal(n = 9, name = "YlOrRd")[c(9,7)] # Creates a new palette from the 9th to the 7th color of the YlOrRd palette.
redspal=colorRampPalette(reds)(11) # Creates a palette of 11 colors between the two chosen colors
greens=brewer.pal(n=9,name="Greens")[c(3,5)]
greenpal=colorRampPalette(greens)(2)
darkgreens=brewer.pal(n=9,name="Greens")[c(8,9)]
darkgreenpal=colorRampPalette(darkgreens)(3)
lightblues=brewer.pal(n = 9, name = "Blues")[c(4,5)]
lightbluepal=colorRampPalette(lightblues)(4)
darkblues=brewer.pal(n = 9, name = "Blues")[c(7,9)]
darkbluepal=colorRampPalette(darkblues)(4)
mypal=c(redspal,"#cda314","#c61be8",greenpal,darkgreenpal,"#93f088","#2f312e","#2ed85f",lightbluepal,darkbluepal)
pie(rep(1, length(mypal)), col = mypal , main="")
This pie chart only shows the color palette that will be used for the CORINE data.
Data structure:
levels(cover)
## [[1]]
## ID landcover
## 1 111 Continuous urban fabric
## 2 112 Discontinuous urban fabric
## 3 121 Industrial or commercial units
## 4 122 Road and rail networks
## 5 123 Port areas
## 6 124 Airports
## 7 131 Mineral extraction sites
## 8 132 Dump sites
## 9 133 Construction sites
## 10 141 Green urban areas
## 11 142 Sport and leisure facilities
## 12 211 Non-irrigated arable land
## 13 222 Fruit trees and berry plantations
## 14 231 Pastures
## 15 243 Agriculture/Natural Vegetation
## 16 311 Broad-leaved forest
## 17 312 Coniferous forest
## 18 313 Mixed forest
## 19 321 Natural grasslands
## 20 322 Moors and heathland
## 21 324 Transitional woodland-shrub
## 22 411 Inland marshes
## 23 412 Peat bogs
## 24 421 Salt marshes
## 25 423 Intertidal flats
## 26 511 Water courses
## 27 512 Water bodies
## 28 522 Estuaries
## 29 523 Sea and ocean
The CORINE data includes multiple classes and subclasses of land cover types of which only a part can be found in Germany and only a subset of those in the study region. The main classes in the region are urban areas (ID 111-142), agricultural areas (ID 211-243), forests (311-324), marshes (411-423) and water bodies/courses (511-523). The urban class contains the highest number of subclasses in the region:
111 : Continuous urban fabric
112 : Discontinuous urban fabric
121 : Industrial or commercial units
122 : Road and rail networks and associated land
123 : Port areas
124 : Airports
131 : Mineral extraction sites
132 : Dump sites
133 : Construction sites
141 : Green urban areas
142 : Sport and leisure facilities
211 : Non-irrigated arable land
222 : Fruit trees and berry plantations
231 : Pastures
243 : Land principally occupied by agriculture, with significant areas of natural vegetation
311 : Broad-leaved forest
312 : Coniferous forest
313 : Mixed forest
321 : Natural grasslands
322 : Moors and heathland
324 : Transitional woodland-shrub
411 : Inland marshes
412 : Peat bogs
421 : Salt marshes
423 : Intertidal flats
511 : Water courses
512 : Water bodies
522 : Estuaries
523 : Sea and ocean
Plotting the land cover data of the study region:
levelplot(cover,maxpixels=1e8, col.regions = mypal, main = 'CORINE land cover subclasses of the study region',colorkey=list(labels=list(cex=0.7)),pretty=TRUE,scales=list(draw=FALSE))
The map shows the different classes and subclasses of land cover of the region. Urban areas (111-142) are red, arable land brown, fruit tree plantations purple, grassland/pastures are lightgreen, forests are darkgreen, moors are black, marshes (411-423) are lightblue, water bodies/ocean (511-523) are darkblue. Furthermore the extent of the study region is shown. Bremen is the biggest city in the region and is located south of the center. In the west of Bremen is Oldenburg and in the north is Bremerhaven. The study region also contains a part of the North Sea (north west) as well as the major river Weser that flows into the sea at Bremerhaven. The spring of the Weser is south of the study region. The map shows heterogeneous as well as homogenous areas. There are vast sections of arable land in the south and east, while only few areas of this class exist in the west. Grasslands and pastures dominate the west, the centre and the north of the study region. Forest areas are relatively small but strongly distributed over the entire region with the biggest forests being in the southeast, southwest and between Bremen and Bremerhaven.
Creating 300 random sample sites for each land cover class:
set.seed(74) # Setting a random seed for consistent results.
samples=sampleStratified(cover,size=300,na.rm=TRUE,sp=TRUE)
samples
## class : SpatialPointsDataFrame
## features : 8350
## extent : 4191150, 4295950, 3298650, 3401750 (xmin, xmax, ymin, ymax)
## coord. ref. : +proj=laea +lat_0=52 +lon_0=10 +x_0=4321000 +y_0=3210000 +ellps=GRS80 +towgs84=0,0,0,0,0,0,0 +units=m +no_defs
## variables : 2
## names : cell, MASKclc
## min values : 747, 111
## max values : 1089798, 523
table(samples$MASKclc)
##
## 111 112 121 122 123 124 131 132 133 141 142 211 222 231 243 311 312 313
## 300 300 300 300 300 300 300 300 56 300 300 300 300 300 300 300 300 300
## 321 322 324 411 412 421 423 511 512 522 523
## 300 194 300 300 300 300 300 300 300 300 300
Class 133 and 194 do not have enough area in the region to generate 300 sites, thus 56 and 194 sites have been created.
Plotting the sample sites:
plt <- levelplot(cover, col.regions = mypal, main = 'Sample site distribution',scales=list(draw=FALSE))
print(plt + layer(sp.points(samples, pch = 3, cex = 0.5, col = 1)))
Due to the amount of different urban land cover subclasses, there is an accumulation of sample sites in and around the cities. For the water bodies there are also many different subclasses, which explains the accumulation of sample sites in the North Sea and in the rivers. For the agricultural and forest subclasses the sample sites are more scattered.
Extracting the multispectral values for the sample sites:
sampvals=extract(landsat,samples,df=TRUE)
## Warning in .local(x, y, ...): Transforming SpatialPoints to the CRS of the
## Raster
sampvals=sampvals[,-1]
sampdata=data.frame(classvalue=samples@data$MASKclc,sampvals)
Creating and plotting the classification tree:
carttree=rpart(as.factor(classvalue)~.,data=sampdata,method="class",minsplit=5,cp=0.005)
rpart.plot(carttree,type=0,extra=0,fallen.leaves=TRUE,tweak=1.2) #Tweak changed font size
111 : Continuous urban fabric
123 : Port areas
124 : Airports
141 : Green urban areas
222 : Fruit trees and berry plantations
231 : Pastures
311 : Broad-leaved forest
312 : Coniferous forest
313 : Mixed forest
321 : Natural grasslands
411 : Inland marshes
423 : Intertidal flats
511 : Water courses
512 : Water bodies
522 : Estuaries
523 : Sea and ocean
The classification tree generated with CART is only able to classify 16 out of the 29 total subclasses. Every band, except for the blue one, has been used in the classification tree. Ocean and water body subclasses as well as forest subclasses are all included, while arable land, marshes and moors, and most of the urban land cover subclasses are missing. The splits in the tree already group some of the subclasses of a main class together. For example three of the four subclasses (all ocean related) of the fifth group (512, 522, 523) are grouped under the first split which is based on NIR. The river/water courses subclass seems to show higher levels of NIR compared to the oceanic subclasses with deeper water levels. This is most likely based on the ability of water to absorb near infrared light. Also the green band seems to show higher values for ocean sites. There are differences between water bodies based on turbidity and other factors, which could explain this classification (Doron et al., 2011). The importance of the green band for classification of ocean sites has also been shown in studies (e.g. Mélin & Sclep, 2015). Inland marshes seem to be closer to grasslands in their multispectral profile, while intertidal flats are more similar to the ocean subclasses. Vegetation shows different reflectance of NIR and SWIR light, which is based in part on the chlorophyll in green plants (Gitelson & Merzlyak, 2010). Intertidal flats probably have less vegetation, while inland marshes are settled by many green, herbaceous plants (Adam et al., 2010). This vegetation probably has a similar reflectance to the ones of grasslands (pastures, natural grassland), which could explain the similarity in classification. Forest areas show different SWIR1/2 levels depending on their vegetation with broad-leaved forests being classified as higher and mixed as well as coniferous forests showing lower SWIR values. Studies showed that coniferous forests tend to reflect more in the NIR spectrum on a needle/leaf and wood level than broad-leaved forests but for SWIR coniferous forests show similar or smaller values than broad-leaved dominated forests (Silván-Cárdenas et al., 2015). The difference in the vegetation and their spectral profiles seems to be the reason for the classification of the forest subclasses. Urban areas show higher SWIR1 and ultra blue values than most other classes. We can also see a split between urban areas with more vegetation (124, 141) and those without (111, 123) based on the NIR, where those with more vegetation are characterised by a higher NIR value. The reason for this difference is also most likely the presence and amount of vegetation.
Printing the relative error:
printcp(carttree)
##
## Classification tree:
## rpart(formula = as.factor(classvalue) ~ ., data = sampdata, method = "class",
## minsplit = 5, cp = 0.005)
##
## Variables actually used in tree construction:
## [1] green NIR red SWIR1 SWIR2 ultra.blue
##
## Root node error: 8047/8347 = 0.96406
##
## n=8347 (3 observations deleted due to missingness)
##
## CP nsplit rel error xerror xstd
## 1 0.0354169 0 1.00000 1.01305 0.0017149
## 2 0.0318131 1 0.96458 0.97527 0.0026917
## 3 0.0293277 2 0.93277 0.94868 0.0031734
## 4 0.0231142 3 0.90344 0.90555 0.0037803
## 5 0.0222443 4 0.88033 0.89400 0.0039174
## 6 0.0173357 5 0.85808 0.86156 0.0042588
## 7 0.0134212 7 0.82341 0.82205 0.0046040
## 8 0.0108115 8 0.80999 0.81111 0.0046881
## 9 0.0098173 9 0.79918 0.80154 0.0047579
## 10 0.0091960 10 0.78936 0.79943 0.0047729
## 11 0.0086989 11 0.78017 0.78849 0.0048478
## 12 0.0074562 13 0.76277 0.77582 0.0049297
## 13 0.0052193 14 0.75531 0.76873 0.0049732
## 14 0.0050000 15 0.75009 0.76028 0.0050230
The CART tree shows quite high relative and xerror values as well as a rather high node error which already indicates problems with the correct classification of the land cover types. The relative error decreases for each split and thus the tree does not need to be pruned. However, in the later splits the decrease is rather small (<0.01).
Predicting the land cover for the study area:
predcart=predict(landsat,carttree,type="class")
Ratifying:
predcart=ratify(predcart)
Plotting the predicted classification based on CART:
predcolor=c("#800026","#A70A22","#B10D20","#D9171D","#c61be8","#C7E9C0","#006D2C","#005823","#00441B","#93f088","#9ECAE1","#8DC0DD","#2171B5","#185B9C","#104583","#08306B")
rat=levels(predcart)[[1]]
rat$landcover=c("Continuous urban fabric","Port areas","Airports","Green urban areas","Fruit trees and berry plantations","Pastures","Broad-leaved forest","Coniferous forest","Mixed forest","Natural grasslands","Inland marshes","Intertidal flats","Water courses","Water bodies","Estuaries","Sea and ocean")
levels(predcart)=rat
levelplot(predcart,maxpixels=1e8,col.regions=predcolor,scales=list(draw=FALSE),main="CART classification of Landsat 8 data")
The water bodies and courses as well as many grassland areas in the west of the study region have been correctly classsified. The biggest problem arises due to arable land not being in the classification tree. The arable land is therefor classified as forests, grassland and urban areas. The main cities have been also predicted correctly for the most part, but due to the misclassification as arable land show a way too big total area. The biggest forests have also more or less been classified correctly, but the total area here is also too big due to the arable land being misclassified as forests. Fruit trees and berry plantations have been predicted way too much. In reality only an area in the northest is heavily characterised by this subclass. This map also shows the big impact of the high error rates of the CART tree.
Validation and confusion matrix of the model with k-fold cross-validation (5 groups):
j <- kfold(sampdata, k = 5, by=sampdata$classvalue)
table(j)
## j
## 1 2 3 4 5
## 1670 1670 1670 1670 1670
Train and test the model five times:
x <- list()
for (k in 1:5) {
train <- sampdata[j!= k, ]
test <- sampdata[j == k, ]
cart <- rpart(as.factor(classvalue)~., data=train, method = 'class', minsplit = 5)
pclass <- predict(carttree, test, type='class')
# create a data.frame using the reference and prediction
x[[k]] <- cbind(test$classvalue, as.integer(pclass))
}
Combining the five lists into one data frame and computing a confusion matrix:
y <- do.call(rbind, x)
y <- data.frame(y)
# Column titles
colnames(y) <- c('observed', 'predicted')
conmat <- table(y)
# Column names of predicted
colnames(conmat) <- c("111","123","124","141","222","231","311","312","313","321","411","423","511","512","522","523")
conmat
## predicted
## observed 111 123 124 141 222 231 311 312 313 321 411 423 511 512 522 523
## 111 186 23 2 22 3 1 15 0 11 4 31 0 1 0 1 0
## 112 20 1 7 77 25 16 58 0 12 46 36 0 0 2 0 0
## 121 33 26 30 88 9 11 15 0 5 36 44 0 1 2 0 0
## 122 124 23 6 62 1 0 19 1 4 16 43 0 1 0 0 0
## 123 64 110 26 51 1 6 0 1 1 8 23 0 8 0 0 1
## 124 6 2 193 35 2 30 3 0 1 11 11 0 0 5 1 0
## 131 16 5 120 35 5 4 12 0 3 41 13 0 9 21 0 16
## 132 15 10 26 41 9 30 30 44 9 57 26 0 2 0 1 0
## 133 1 1 11 13 2 2 5 0 0 11 9 0 1 0 0 0
## 141 27 1 20 109 20 10 53 1 25 18 13 0 2 1 0 0
## 142 7 0 18 100 12 38 24 5 43 38 9 0 2 4 0 0
## 211 5 0 57 42 14 57 29 0 6 55 35 0 0 0 0 0
## 222 0 0 4 8 96 63 44 0 2 71 9 0 0 2 1 0
## 231 1 0 17 29 18 160 16 1 19 29 9 0 0 1 0 0
## 243 2 0 22 67 17 44 68 3 40 29 6 0 0 2 0 0
## 311 2 0 4 32 20 5 104 16 65 44 7 0 0 1 0 0
## 312 0 0 4 48 3 2 21 61 134 21 5 0 1 0 0 0
## 313 0 0 4 33 9 9 53 11 152 21 7 0 1 0 0 0
## 321 2 0 23 5 7 35 23 3 17 102 80 0 1 2 0 0
## 322 0 0 48 38 4 12 27 0 2 55 8 0 0 0 0 0
## 324 0 0 23 23 26 26 58 9 61 61 12 0 0 1 0 0
## 411 11 0 14 8 3 2 25 7 32 60 131 0 2 3 2 0
## 412 0 0 13 5 5 11 47 13 77 99 23 0 0 7 0 0
## 421 21 0 39 34 2 17 0 0 36 89 61 0 1 0 0 0
## 423 16 0 0 1 0 0 0 2 0 1 5 210 44 0 0 21
## 511 59 13 1 53 0 2 1 2 6 2 10 6 125 11 9 0
## 512 18 4 0 5 1 0 5 16 23 3 9 2 16 176 9 13
## 522 3 0 0 0 0 0 1 9 5 2 9 20 17 35 117 82
## 523 1 0 0 0 0 0 0 0 0 0 0 13 1 5 1 279
The confusion matrix also supports the conclusions that have been drawn from the prediction map and the classification tree. Misclassification of urban subclasses as other subclasses is quite high as the map already suggested. Interesting is the rather high misclassification of coniferous forest as mixed forest. It might be possible that the mixed forest in the study area is dominated by coniferous trees and thus closer to that forest type in vegetation and multispectral profile. Marshes and ocean related subclasses show very small misclassification numbers, which was already visible in the prediction map. Interesting is also the rather correct classification of airport and continuous uzrban fabric areas.
Creating the forest:
sampdata$classvalue=as.factor(sampdata$classvalue)
sampdata=na.exclude(sampdata)
rf=randomForest(classvalue~.,data=sampdata,method="class",ntree=200,localImp=TRUE)
rf
##
## Call:
## randomForest(formula = classvalue ~ ., data = sampdata, method = "class", ntree = 200, localImp = TRUE)
## Type of random forest: classification
## Number of trees: 200
## No. of variables tried at each split: 2
##
## OOB estimate of error rate: 54.26%
## Confusion matrix:
## 111 112 121 122 123 124 131 132 133 141 142 211 222 231 243 311 312
## 111 150 18 17 48 20 1 4 1 0 6 8 7 1 1 1 1 0
## 112 17 60 11 23 4 8 8 8 1 30 22 18 24 8 20 5 3
## 121 20 16 47 19 44 8 10 12 3 19 6 10 8 12 6 2 5
## 122 47 8 13 124 28 2 1 6 2 16 3 2 4 1 2 6 1
## 123 33 2 22 41 141 7 5 3 2 3 4 4 2 4 0 0 0
## 124 1 7 9 3 12 179 19 5 2 3 8 11 4 2 3 0 0
## 131 0 8 5 2 14 21 140 9 2 4 2 12 10 2 2 1 0
## 132 5 10 13 12 8 9 17 92 2 4 8 5 12 7 10 10 0
## 133 2 4 1 1 7 1 9 0 3 1 0 11 5 0 0 1 0
## 141 6 20 3 26 4 9 5 1 0 95 29 6 20 10 5 16 8
## 142 2 12 3 11 0 14 3 6 0 39 52 11 10 22 17 14 26
## 211 4 21 3 4 7 18 24 8 2 1 13 55 15 35 11 8 7
## 222 0 7 0 1 0 0 4 4 1 13 4 10 184 17 8 8 0
## 231 0 10 4 3 2 12 4 5 0 7 21 17 16 115 25 10 4
## 243 1 20 7 1 0 10 3 4 0 10 15 18 8 26 83 21 11
## 311 2 2 2 6 0 2 2 2 0 15 11 6 10 2 16 75 31
## 312 0 4 0 2 0 0 1 0 0 14 12 7 0 1 5 26 153
## 313 1 7 1 3 0 0 3 2 0 11 20 1 1 6 19 36 64
## 321 0 4 5 5 1 2 9 4 1 3 5 11 19 14 7 11 1
## 322 0 2 2 1 0 8 2 6 0 5 10 4 0 2 2 14 2
## 324 0 4 2 6 0 9 4 4 0 5 5 2 16 8 12 33 11
## 411 3 0 0 12 3 2 2 1 1 3 6 8 8 2 2 3 1
## 412 0 2 2 1 0 2 4 2 0 3 1 2 3 6 4 23 6
## 421 0 0 9 1 3 3 7 10 1 1 4 3 2 7 0 0 2
## 423 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0
## 511 1 2 5 4 3 1 3 0 0 3 2 1 0 3 4 0 2
## 512 8 2 0 0 4 2 7 3 0 1 1 0 0 0 4 3 1
## 522 1 0 0 0 0 0 3 1 0 0 0 0 0 0 0 0 0
## 523 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 313 321 322 324 411 412 421 423 511 512 522 523 class.error
## 111 0 2 0 1 7 1 0 0 2 3 0 0 0.5000000
## 112 4 3 2 4 6 4 3 0 2 2 0 0 0.8000000
## 121 0 10 8 5 9 2 12 0 3 3 1 0 0.8433333
## 122 0 6 7 3 12 0 4 0 2 0 0 0 0.5866667
## 123 0 1 0 1 6 0 4 0 11 3 0 1 0.5300000
## 124 0 4 1 2 6 1 11 0 0 5 2 0 0.4033333
## 131 1 7 7 7 2 7 16 3 4 11 1 0 0.5333333
## 132 2 10 9 9 10 9 22 0 2 1 2 0 0.6933333
## 133 0 3 0 1 2 0 2 0 1 1 0 0 0.9464286
## 141 8 5 7 5 5 0 0 0 3 4 0 0 0.6833333
## 142 19 4 10 8 0 2 6 0 4 5 0 0 0.8266667
## 211 4 19 3 7 18 5 8 0 0 0 0 0 0.8166667
## 222 2 14 0 12 3 3 4 0 0 1 0 0 0.3866667
## 231 7 7 4 10 2 7 6 0 0 2 0 0 0.6166667
## 243 18 7 7 15 3 3 4 0 1 4 0 0 0.7233333
## 311 47 7 15 27 3 16 0 0 0 1 0 0 0.7500000
## 312 47 1 5 10 1 5 2 0 4 0 0 0 0.4900000
## 313 72 6 4 20 4 17 0 0 0 1 0 0 0.7591973
## 321 2 97 9 11 37 18 16 1 0 7 0 0 0.6766667
## 322 6 2 108 9 1 6 2 0 0 0 0 0 0.4432990
## 324 17 3 21 96 14 17 4 0 1 5 1 0 0.6800000
## 411 3 31 5 10 150 18 8 0 8 5 4 0 0.4983278
## 412 15 10 10 28 25 131 9 0 0 11 0 0 0.5633333
## 421 0 13 1 1 13 5 203 2 7 0 2 0 0.3233333
## 423 0 0 0 0 0 0 2 265 2 0 17 12 0.1166667
## 511 0 0 0 0 1 1 10 0 242 9 2 0 0.1906355
## 512 2 4 0 2 3 17 1 1 25 200 9 0 0.3333333
## 522 1 2 0 1 1 1 3 23 6 10 236 11 0.2133333
## 523 0 0 0 0 0 0 0 19 0 1 10 270 0.1000000
The out-of-bag error is above 50% and therefore quite high. Thus the total ability of the modell to predict the classes is quite poor, however it seems to be better suited than the CART tree. The confusion matrix shows strong differences between classes and subclasses, which suggests that the model can predict some reasonably well while it fails to correctly classify others. The subclasses 112,121,132,133142,211,243,311,313 show rather high class errors, but at least the random forest actually classified all subclasses in contrast to the CART tree. The same subclasses that were reasonably well predicted by the CART tree are also the ones with the lowest class errors here (water related subclasses).
Error plot of subclasses:
layout(matrix(c(1,2),nrow=1),
width=c(4,1))
par(mar=c(5,4,4,0)) #No margin on the right side
plot(rf, log="y")
par(mar=c(5,0,4,2)) #No margin on the left side
plot(c(0,1),type="n", axes=F, xlab="", ylab="")
legend("top", colnames(rf$err.rate),col=1:4,cex=0.5,fill=1:4)
As we can see the error is decreasing for most parameter classes the higher the amount of generated trees is. However, there are some classes that show a small increase in error in the later trees as well as strong fluctuations. It is also visible that only 10 subclasses achieve an error below 0.5, which points to a rather bad prediction accuracy.
Measuring parameter importance:
importancerf <- measure_importance(rf)
importancerf
This table shows some accuracy and importance parameters of the seven factors used for prediction. NIR, SWIR1 and SWIR2 are often the first or second split in the trees, they also have the highest numbers of nodes and are often the root, which all suggests their importance for the whole model. The correlation is better visible in the following mutli-way importance plot:
plot_multi_way_importance(importancerf, size_measure = "no_of_nodes")
There is a clear grouping of NIR, SWIR1 and SWIR2 at the top left, which shows their importance. These vegetation related spectrums are the most important ones. Blue is interestingly more often a root and has a lower mean min depth but this band also has an incredibly small amount of nodes that it appears in over all trees.
Mindepth:
mindepth <- min_depth_distribution(rf)
Distribution of minimal depth:
plot_min_depth_distribution(mindepth)
This plot shows a more detailed overview of the minimal depth of the parameters. NIR, SWIR1 and SWIR2 are often responsible for splits at zero and one depth, while ultra blue, green and blue are more often at minimal depth of two.
Predicting the land cover using RF:
rfpred=predict(landsat,rf,type="response",index=1,na.rm=TRUE, progress="window")
## Loading required namespace: tcltk
Ratifying and plotting the predicted land cover classes:
rfpred=ratify(rfpred)
rat=levels(rfpred)[[1]]
rat$landcover=coverclass
levels(rfpred)=rat
levelplot(rfpred,maxpixels=1e8, col.regions = mypal, main = 'RF classification of Landsat 8 data',scales=list(draw=FALSE))
This map shows the predicted land cover classes and it validates the information gathered before. We can see, that the water related classes are mostly correctly classified. Building and settlement related areas are overestimated while agricultural areas are predicted to be smaller than they are in reality. However, the map seems to be closer to the actual CORINE land cover than the map generated with the CART classification tree. There certainly are still major deviations from the observed land covers, but the overall classification seems to have imporved. A major factor for this is that RF includes all classes, and while it still underestimates the agricultural areas, it still manages to better portray reality. Due to the inclusion of all classes there are also way less misclassifications of other subclasses as urban ones. The fruit tree classifications in the northeast are very similar to the observed ones. There are still too many of these areas in the center and the west of the study area, but compared to the CART map, these are less distributed. The water related areas are even more close to reality compared to CART. In conclusion, the RF method seems to generate better classifications for the land cover classes of the study area than CART. Both methods still have considerably high error rates and there are certain subclasses that are for the most part wrongly predicted with either method. To achieve better results in the future, it would probably make sense to incorperate different multispectral data from different times of the year. This together with a modification of the RF method could maybe achieve better results for the agricultural classes, that show very different multispectral profiles based on a multitude of possible factors such as farming systems and management, crop rotation,….
Adam, E., Mutanga, O., & Rugege, D. (2010). Multispectral and hyperspectral remote sensing for identification and mapping of wetland vegetation: a review. Wetlands Ecology and Management, 18(3), 281-296. https://doi.org/10.1007/s11273-009-9169-z
Doron, M., Bélanger, S., Doxaran, D., & Babin, M. (2011). Spectral variations in the near-infrared ocean reflectance. Remote Sensing of Environment, 115(7), 1617-1631. https://doi.org/10.1016/j.rse.2011.01.015
Ferreira, M. P., Zortea, M., Zanotta, D. C., Shimabukuro, Y. E., & de Souza Filho, C. R. (2016). Mapping tree species in tropical seasonal semi-deciduous forests with hyperspectral and multispectral data. Remote Sensing of Environment, 179, 66-78. https://doi.org/10.1016/j.rse.2016.03.021
Gislason, P. O., Benediktsson, J. A., & Sveinsson, J. R. (2006). Random Forests for land cover classification. Pattern Recognition Letters, 27(4), 294-300. https://doi.org/10.1016/j.patrec.2005.08.011
Gitelson, A. A., & Merzlyak, M. N. (2010). Remote estimation of chlorophyll content in higher plant leaves. International Journal of Remote Sensing. https://doi.org/10.1080/014311697217558
Matikainen, L., Karila, K., Hyyppä, J., Litkey, P., Puttonen, E., & Ahokas, E. (2017). Object-based analysis of multispectral airborne laser scanner data for land cover classification and map updating. ISPRS Journal of Photogrammetry and Remote Sensing, 128, 298-313. https://doi.org/10.1016/j.isprsjprs.2017.04.005
Mélin, F., & Sclep, G. (2015). Band shifting for ocean color multi-spectral reflectance data. Optics Express, 23(3), 2262. https://doi.org/10.1364/OE.23.002262
Moshou, D., Bravo, C., Oberti, R., West, J., Bodria, L., McCartney, A., & Ramon, H. (2005). Plant disease detection based on data fusion of hyper-spectral and multi-spectral fluorescence imaging using Kohonen maps. Real-Time Imaging, 11(2), 75-83. https://doi.org/10.1016/j.rti.2005.03.003
Myneni, R. B., Hall, F. G., Sellers, P. J., & Marshak, A. L. (1995). The interpretation of spectral vegetation indexes. IEEE Transactions on Geoscience and Remote Sensing, 33(2), 481-486. https://doi.org/10.1109/36.377948
Rocchini, D., Andreini Butini, S., & Chiarucci, A. (2005). Maximizing plant species inventory efficiency by means of remotely sensed spectral distances: Maximizing plant species inventorying by remote sensing. Global Ecology and Biogeography, 14(5), 431-437. https://doi.org/10.1111/j.1466-822x.2005.00169.x
Rodriguez-Galiano, V. F., Ghimire, B., Rogan, J., Chica-Olmo, M., & Rigol-Sanchez, J. P. (2012). An assessment of the effectiveness of a random forest classifier for land-cover classification. ISPRS Journal of Photogrammetry and Remote Sensing, 67, 93-104. https://doi.org/10.1016/j.isprsjprs.2011.11.002
Silván-Cárdenas, J., Corona, N., Mauricio Galeana Pizaña, J., Núñez, J. M., & Madrigal, J. (2015). Geospatial Technologies to Support Coniferous Forests Research and Conservation efforts in Mexico (pp. 67-123).
https://rspatial.org/rs/5-supclassification.html
http://bleutner.github.io/RStoolbox/rstbx-docu/superClass.html