Land-Cover Classification in R

Overview

We’ve spent a lot of time so far in class working with land-cover data derived from satellite imagery, but now we’re going to see how we can make those layers ourselves! While there are a bunch of global land-cover datasets at relatively high resolution, they don’t always perform well at local scales, so if you’re working on a specific area and accuracy is at a premium, it is (for the moment, at least) still often better to “roll your own” classified land-cover rasters.

With the rapid rollout of machine-learning techniques for prediction and classification, researchers have been applying them to land-cover classification, with impressive results. Machine-learning algorithms like random forests can often achieve accuracy rates above 90%, providing valuable data for landscape mapping and monitoring.

In this tutorial, we’ll have a look at functions in the RSToolbox package that make it relatively straightforward to fit and assess supervised and unsupervised land-cover classifications in R.

Packages

We’ll be using our familiar tidyverse, sf and terra packages, as well as RStoolbox, which has our functions for image classification. Let’s get those going:

require(tidyverse)
require(terra)
require(sf)
require(RStoolbox)

The Data

Land-cover classification requires, at a minimum, some satellite imagery as an input. If we are using supervised classification, where we are trying to group pixels into designated land-cover classes, we also need to have training data - locations of known land cover.

We have a few files we will be using (I have more detail on their sources and characteristics below):

  • Study_Area.kml contains the boundaries of the area whose land cover we are going to be classifying
  • Study_Area_Pasture.kml identifies areas within the study area identified as pasture
  • HI_Ag_Survey_2020.gpkg contains data from the state of Hawaii’s 2020 survey of agriculture, which identifies types of production at the field level
  • HI_raster_data.tif contains data from Sentinel-2 and the United States’s National Elevation Dataset, which some students and I compiled using Google Earth Engine

These data are available in this zip file.

The coolest data we have here come from the Hawaii Department of Agriculture, which conducted a complete geospatial survey of crops grown in the state in 2020. We couldn’t ask for better training data to play with. To keep data processing manageable, we’ll be focusing on a section of the western shore of the island of Hawaii proper (i.e. the Big Island). That’s where kona coffee and a variety of other products are grown, so it gives us an interesting test case.

Here’s the citation: Perroy, R., & Collier, E. (2021). 2020 update to the Hawai’i statewide agricultural land use baseline. Honolulu, HI, US: Hawaii State Department of Agriculture. https://planning.hawaii.gov/gis/download-gis-data-expanded/

In addition, we will be using data that some students and I have compiled using Google Earth Engine. We have a single raster created by combining Sentinel-2 data from the main coffee growing and harvesting seasons with data from the United States’ National Elevation Dataset. Because our goal in this tutorial is to see if we can successfully identify different types of crops, I’ve masked the image to only include areas identified as croplands in the agricultural survey.

Because this raster has quite a few bands with different information, I’d like to bring it in before we move on, just so we can talk through what’s there a bit:

setwd("G:/My Drive/My Classes/GIS for Global Environment/Labs/Land Cover Classification")

hi_rast <- rast("HI_raster_data.tif")

hi_rast
class       : SpatRaster 
dimensions  : 1686, 1014, 13  (nrow, ncol, nlyr)
resolution  : 0.0002694946, 0.0002694946  (x, y)
extent      : -156.0042, -155.7309, 19.25188, 19.70625  (xmin, xmax, ymin, ymax)
coord. ref. : lon/lat WGS 84 (EPSG:4326) 
source      : HI_raster_data.tif 
names       : HI_ra~ata_1, HI_ra~ata_2, HI_ra~ata_3, HI_ra~ata_4, HI_ra~ata_5, HI_ra~ata_6, ... 
min values  :      0.0000,      0.0000,     0.00000,          ? ,          ? ,          ? , ... 
max values  :      0.2899,      0.2735,     0.50565,          ? ,          ? ,          ? , ... 
plot(hi_rast)

Okay, so there are 13 bands in this thing! That’s a lot to deal with, so I’m going to help by giving you meaningful names for each band, so you can keep them all straight:

names(hi_rast) <- c("green_grow", "red_grow", "nir_grow", "NDVI", "NDWI",
                    "green_harvest", "red_harvest", "nir_harvest",
                    "NDVI_harvest", "NDWI_harvest", "elevation",
                    "slope", "aspect")

plot(hi_rast)

Most of these bands should be self-explanatory, but to be clear, green_, red_, and nir_ are Sentinel-2’s green, red, and near-infrared bands during the growing and harvest seasons, as labelled.

Unsupervised classification in RStoolbox

Unsupervised classification can be helpful as a quick way to get a sense of the broad characteristics of heterogeneity in a landscape. It works by using one of several possible algorithms to group together pixels that have similar values across the raster bands.

RStoolbox uses the k-means clustering algorithm, which is pretty commonly used for unsupervised image classification. Basically, this method works by randomly grouping the pixels into a number of classes that you set. Then, it computes the distance between each pixel’s values and the class mean. Then it reassigns pixels to different classes based on their distance from each class mean. It keeps iterating to minimize the difference within classes while maximizing the difference between classes.

unsuperClass() performs this analysis in RStoolbox. All it needs is a SpatRaster and the number of classes you want.

So how do we determine the number of classes? Well, you might try looking at the landscape in question and see if you can count up some distinct-looking areas to try to get a guess, but in any case you probably will want to try a few different numbers of classes to see how things change.

We have an advantage in selecting classes because we already have some training data. Let’s have a look at the section of Hawaii’s agricultural survey we’re working with and see how many different crop types we have:

hi_ag_suv <- vect("HI_Ag_Survey_2020.gpkg")
Warning: [vect] Reading layer: aglanduse_2020
Other layers: buffered
unique(hi_ag_suv$Crops_2020)
[1] "Coffee"                        "Macadamia Nuts"               
[3] "Pasture"                       "Tropical Fruits"              
[5] "Diversified Crop"              "Banana"                       
[7] "Flowers / Foliage / Landscape"

Okay, so we have seven different crop designations, though a couple (I’m looking at you, “Diversified Crop” and “Flowers / Foliage / Landscape”) seem a bit vague.

Also, I know that for at least some of these crops, there are actually very few examples in the area, so we might want to simplify. Let’s have a look at the number of plots of each crop. We can use the table() function to get a quick count of the number of each unique value in a column:

table(hi_ag_suv$Crops_2020)

                       Banana                        Coffee 
                            1                           583 
             Diversified Crop Flowers / Foliage / Landscape 
                           11                             7 
               Macadamia Nuts                       Pasture 
                          260                            34 
              Tropical Fruits 
                           27 

So the vast majority of the plots are Coffee or Macadamia Nuts. Let’s look in terms of area:

plot(hi_ag_suv, "Crops_2020") #This is how we can plot a specific

#variable from a SpatVector object

So we can see that while there are only a few pasture areas, they are huge. Given this, maybe we should just try to distinguish four classes: coffee, macadamia nuts, pasture, and other:

hi_crop_unsuper_4 <- unsuperClass(hi_rast, nClasses = 4)

hi_crop_unsuper_4
unsuperClass results

*************** Model ******************
$model
K-means clustering with 4 clusters of sizes 7349, 866, 1007, 778

Cluster centroids:
   green_grow    red_grow    nir_grow      NDVI        NDWI green_harvest
1 0.001954586 0.001656198 0.008931168 0.0229524 -0.02136517   0.001904769
2 0.069934584 0.070951213 0.208189203 0.4772815 -0.48166124   0.069647901
3 0.078404816 0.071569514 0.243888580 0.5399731 -0.50351445   0.081706510
4 0.075838368 0.062012661 0.290768059 0.6326360 -0.56666419   0.063178342
  red_harvest nir_harvest NDVI_harvest NDWI_harvest   elevation     slope
1 0.001475099 0.009038896   0.02384333  -0.02167868    4.755772 0.2515509
2 0.065919873 0.216021127   0.50904761  -0.49228816 1764.973801 7.1813809
3 0.072424634 0.257192958   0.54492016  -0.49738878 1226.735459 6.7185285
4 0.045622622 0.296514075   0.70881000  -0.62429951  626.195283 8.8437120
      aspect
1   8.835192
2 261.132278
3 254.538662
4 258.211314

Within cluster sum of squares by cluster:
[1] 22185970 30653183 25894988 28776934

*************** Map ******************
$map
class       : SpatRaster 
dimensions  : 1686, 1014, 1  (nrow, ncol, nlyr)
resolution  : 0.0002694946, 0.0002694946  (x, y)
extent      : -156.0042, -155.7309, 19.25188, 19.70625  (xmin, xmax, ymin, ymax)
coord. ref. : lon/lat WGS 84 (EPSG:4326) 
source(s)   : memory
name        : class_unsupervised 
min value   :                  1 
max value   :                  4 

The output is a list with three parts: call has information about the function you originally ran,model has information about the k-means clustering results, and map has the resulting clusters in the form of a SpatRaster. Let’s have a look at the map and see how we did:

plot(hi_crop_unsuper_4$map, type = "classes")

Hmmm . . . well, clearly, that’s not so good, but it’s pretty clear what is dominating the model fit. Have a look at the elevation:

plot(hi_rast[["elevation"]])

So it looks to me like all we’ve really done is group our pixels by elevation. This is a good lesson for us, because it shows that it’s not always the case that more data are better, especially since k-means is weighting all these variables equally. Let’s see what happens if we drop the topographic variables:

hi_crop_unsuper_4 <- unsuperClass(hi_rast[[1:10]], nClasses = 4)

plot(hi_crop_unsuper_4$map, type = "classes")

Not quite what we were hoping for here, either. This is the problem with unsupervised classification - it’s great at finding mathematically meaningful differences between groups, but those differences may or may not be meaningful to a human viewer. Now, we could save out this raster and open it up in QGIS on top of a Google Satellite image and see if we can get a sense of what the algorithms are picking up.

If you want to have a look in QGIS, here’s how we would export this SpatRaster. I’ve also enclosed some KML files showing the study area and pasture lands in case you want to check this out in Google Earth:

writeRaster(hi_crop_unsuper_4$map, "hi_crop_k_means_4.tif", overwrite=TRUE)

Task 1: Compare the unsupervised classification results with what you see from a satellite image. Do you see any obvious features that the algorithm might be picking out? Based on what you see, can you come up with any possible strategies for getting results that are closer to what we’re trying to achieve?

Task 2: Run another k-means clustering on the raster using bands and a number of classes of your choosing. Then save out the image. Have a look at your results in QGIS against a Google Satellite tile. Comment on what you see? Is there anything that you can identify the algorithm picking up?

Supervised classification in RStoolbox

Okay, now that that’s out of the way, we can move on to the main event. The fact is that most of the time when we conduct land-cover classification, we’re looking for something. That means we already have some idea of the classes we would expect to see, and that’s where supervised classification comes in.

While supervised classification algorithms are very technically sophisticated, RStoolbox handles pretty much all of the complicated stuff “under the hood” for you. Everything you need runs from a single function, superClass(), which requires the following input:

  • The SpatRaster to be classified
  • The control locations you will be using to train the data
  • The name of the column in the training data that has the classes
  • The number of points to sample per land-cover class, if the training data are polygons
  • The type of classification algorithm to use
  • The proportion of control points to save out for creating the confusion matrix
  • Any additional inputs (tuning parameters) for the classification algorithm

That’s quite a few things to specify, but the great thing is that the function takes it from there and does everything else for you!

Now, we’ve already seen that we probably don’t want to try to classify all the crop classes in this section of the island, because several (looking at you, bananas!) are pretty rare.

So let’s start by recoding the classification from the agricultural survey into just coffee, macademia, pasture, and other. Here’s a quick way to do that:

#First, you tell R which column you're using
crop_reclass <- hi_ag_suv$Crops_2020

#The %in% function gives back a TRUE value if an entry in the vector
#on the left is present in the vector on the right, so we can use it
#to subset crop_reclass to only the listed crops.
crop_reclass[crop_reclass %in% c("Tropical Fruits",
                                 "Diversified Crop", "Banana",
                                 "Flowers / Foliage / Landscape")] <-
  "Other"

#Then we just reassign all those values as "Other"

unique(crop_reclass)
[1] "Coffee"         "Macadamia Nuts" "Pasture"        "Other"         
#Finally, we add this back as a new variable in the agricultural survey:
hi_ag_suv$Crops_2020_Rec <- crop_reclass

Okay, now we’re ready to use superClass():

crop_sclass <- superClass(img = hi_rast,
                          #Training data must be in sf format:
                          trainData = st_as_sf(hi_ag_suv),
                          responseCol = "Crops_2020_Rec",
                          nSamples = 500, #Training points per class
                          nSamplesV = 500, #Validation points per class
                          trainPartition = 0.7, #Share of points for
                          #testing
                          model = "rf") #Using random forests
Error: traingData must POINTs or a POLYGONs
crop_sclass
Error in eval(expr, envir, enclos): object 'crop_sclass' not found

Well, shoot! it looks like we have some additional vector data maintenance to do. First, let’s have a look at what the issue is. After examining the code underlying this function, it appears that our problem is the following:

hi_ag_suv %>% 
  st_as_sf() %>%
  st_geometry_type() %>%
  table()
.
          GEOMETRY              POINT         LINESTRING            POLYGON 
                 0                  0                  0                916 
        MULTIPOINT    MULTILINESTRING       MULTIPOLYGON GEOMETRYCOLLECTION 
                 0                  0                  7                  0 
    CIRCULARSTRING      COMPOUNDCURVE       CURVEPOLYGON         MULTICURVE 
                 0                  0                  0                  0 
      MULTISURFACE              CURVE            SURFACE  POLYHEDRALSURFACE 
                 0                  0                  0                  0 
               TIN           TRIANGLE 
                 0                  0 

See how there are seven objects in there that are a “MULTIPOLYGON” type? That’s our first problem. superClass() wants only individual polygons. So we can use the st_cast() function to force the geometry into a specific type:

hi_ag_suv_sf <- hi_ag_suv %>%
  st_as_sf() %>%
  st_cast("POLYGON")
Warning in st_cast.MULTIPOLYGON(X[[i]], ...): polygon from first part only
Warning in st_cast.MULTIPOLYGON(X[[i]], ...): polygon from first part only
Warning in st_cast.MULTIPOLYGON(X[[i]], ...): polygon from first part only
Warning in st_cast.MULTIPOLYGON(X[[i]], ...): polygon from first part only
Warning in st_cast.MULTIPOLYGON(X[[i]], ...): polygon from first part only
Warning in st_cast.MULTIPOLYGON(X[[i]], ...): polygon from first part only
Warning in st_cast.MULTIPOLYGON(X[[i]], ...): polygon from first part only

So the warning is telling us that we’re only getting the first polygon from those seven multipolygon objects, but since this is just a training and test dataset, we’re probably okay. Let’s see how that worked out:

crop_sclass <- superClass(img = hi_rast,
                          trainData = hi_ag_suv_sf,
                          responseCol = "Crops_2020_Rec",
                          nSamples = 500, #Training points per class
                          nSamplesV = 500, #Validation points per class
                          trainPartition = 0.7, #Share of points for
                          #testing
                          model = "rf") #Using random forests
Error in wk_handle.wk_wkb(wkb, s2_geography_writer(oriented = oriented, : Loop 0 is not valid: Edge 686 has duplicate vertex with edge 692
crop_sclass
Error in eval(expr, envir, enclos): object 'crop_sclass' not found

A brief excursion through polygon cleaning in sf

Shoot! It’s still not happy. We’ve come across a common problem in vector data, especially vector data. What this error message is telling us is that we have at least one invalid polygon in the hi_ag_suv object.

There can be lots of reasons for invalid polygons, but one of the most common is for one of the polygon vertices to be inadvertently located inside the polygon itself. If that’s the case, many geoprocessing operations will fail.

You can identify valid and invalid polygons with the st_is_valid() function:

st_is_valid(hi_ag_suv_sf)
  [1]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
 [13]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
 [25]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
 [37]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
 [49]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
 [61]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
 [73]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
 [85]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
 [97]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
[109]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
[121]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
[133]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
[145]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
[157]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
[169]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
[181]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
[193]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
[205]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
[217]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
[229]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
[241]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
[253]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
[265]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
[277]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
[289]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE
[301]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
[313]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
[325]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
[337]  TRUE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
[349]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
[361]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
[373]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
[385]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
[397]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
[409]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
[421]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
[433]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
[445]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
[457]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
[469]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
[481]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
[493]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
[505]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
[517]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
[529]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
[541]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
[553]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
[565]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
[577]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
[589]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
[601]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
[613]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
[625]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
[637]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
[649]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
[661]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
[673]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
[685]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
[697]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
[709]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
[721]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
[733]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
[745]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
[757]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
[769]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
[781]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
[793]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
[805]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
[817]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
[829]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
[841]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
[853]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
[865]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
[877]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
[889]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
[901]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
[913]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE

Whew! Just a couple invalid polygons! Now, we would probably be safe to just get rid of them, but this is one of those teachable moments, so I want us to have a look at one of these to see what we’re dealing with:

See that little hole in the middle of that polygon? I bet that’s our problem.

sf has a built-in function, called st_make_valid() that uses a triangulation algorithm to try to automatically repair invalid polygons. Let’s see if it can help us (the sum() part is so we can quickly see what proportion of the polygons in the dataset are valid). I’m also combining this with another trick, which is to construct a 0-width buffer, which basically allows us to redraw the polygons:

hi_ag_suv_sf <- st_make_valid(hi_ag_suv_sf) %>%
  st_buffer(dist=0)

hi_ag_suv_sf %>%
  st_is_valid() %>%
  sum()/nrow(hi_ag_suv_sf)
[1] 1

Yay! st_make_valid() seems to have done the trick. Now back to your regularly scheduled programming:

Supervised classification in RStoolbox (again)

Okay, now let’s see if that did the trick (be patient - superClass() is doing a ton of hard work for us):

#For complicated reason, using spherical distance approximations is
#causing errors in superClass(), so we're running the following line to 
#turn those features off. It does not appear to affect our results.
sf_use_s2(FALSE)
Spherical geometry (s2) switched off
crop_sclass <- superClass(img = hi_rast,
                          trainData = hi_ag_suv_sf,
                          responseCol = "Crops_2020_Rec",
                          nSamples = 500, #Training points per class
                          nSamplesV = 500, #Validation points per class
                          trainPartition = 0.7, #Share of points for
                          #testing
                          model = "rf") #Using random forests
although coordinates are longitude/latitude, st_intersects assumes that they
are planar
Warning in st_buffer.sfc(st_geometry(x), dist, nQuadSegs, endCapStyle =
endCapStyle, : st_buffer does not correctly buffer longitude/latitude data
dist is assumed to be in decimal degrees (arc_degrees).
although coordinates are longitude/latitude, st_union assumes that they are
planar
although coordinates are longitude/latitude, st_difference assumes that they
are planar
Warning in st_buffer.sfc(st_geometry(x), dist, nQuadSegs, endCapStyle =
endCapStyle, : st_buffer does not correctly buffer longitude/latitude data
dist is assumed to be in decimal degrees (arc_degrees).
Warning: package 'caret' was built under R version 4.3.3
Loading required package: lattice

Attaching package: 'caret'
The following object is masked from 'package:purrr':

    lift
Warning in st_buffer.sfc(st_geometry(x), dist, nQuadSegs, endCapStyle =
endCapStyle, : st_buffer does not correctly buffer longitude/latitude data
dist is assumed to be in decimal degrees (arc_degrees).
crop_sclass
superClass results
************ Validation **************
$validation
Confusion Matrix and Statistics

                Reference
Prediction       Coffee Macadamia Nuts Other Pasture
  Coffee           1711            470    72     103
  Macadamia Nuts    291            547    16      92
  Other              13              2     0      48
  Pasture            59             29     9     345

Overall Statistics
                                          
               Accuracy : 0.6837          
                 95% CI : (0.6687, 0.6985)
    No Information Rate : 0.5448          
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.451           
                                          
 Mcnemar's Test P-Value : < 2.2e-16       

Statistics by Class:

                     Class: Coffee Class: Macadamia Nuts Class: Other
Sensitivity                 0.8250                0.5219      0.00000
Specificity                 0.6278                0.8554      0.98302
Pos Pred Value              0.7262                0.5782      0.00000
Neg Pred Value              0.7498                0.8249      0.97409
Prevalence                  0.5448                0.2753      0.02548
Detection Rate              0.4494                0.1437      0.00000
Detection Prevalence        0.6189                0.2485      0.01655
Balanced Accuracy           0.7264                0.6887      0.49151
                     Class: Pasture
Sensitivity                 0.58673
Specificity                 0.96987
Pos Pred Value              0.78054
Neg Pred Value              0.92779
Prevalence                  0.15445
Detection Rate              0.09062
Detection Prevalence        0.11610
Balanced Accuracy           0.77830

*************** Map ******************
$map
class       : SpatRaster 
dimensions  : 1686, 1014, 1  (nrow, ncol, nlyr)
resolution  : 0.0002694946, 0.0002694946  (x, y)
extent      : -156.0042, -155.7309, 19.25188, 19.70625  (xmin, xmax, ymin, ymax)
coord. ref. : lon/lat WGS 84 (EPSG:4326) 
source(s)   : memory
name        : Crops_2020_Rec_supervised 
min value   :                         1 
max value   :                         4 

Okay, there’s a lot to digest here, because that one function basically did all the work for you. Let’s just review everything it did:

  • Extract training and validation sample points from the input polygons
  • Extract all the raster bands at each training and validation point
  • Train a random forest model on the sampled points
  • Predict class membership for every pixel in the image
  • Construct a confusion matrix comparing the predicted and actual class memberships for the validation points
  • Compute a large number of accuracy assessments for the classes and the overall fit.

Like the output for unsupervised classification, superClass() gives us two important components. The $validation part gives us our confusion matrix and a number of statistics assessing how well the algorithm performed.

Task 3: Examine the validation section of the superClass() output. You may need to look up some of the different accuracy assessment measures used, as the package is using different language than what we’ve used in the class. How did we do? Which land-cover types were we best at identifying? Which ones did we identify poorly? What do you think caused our errors? Can you think of other spatial data that might help us improve our predictions?

Task 4: Fit two new classification models. First, use the same number of sample points but with Gradient Boosting Machine classification (model = "gbm" in the superClass() function. Second, use random forests again but with only 200 training points per class. Assess the changes in accuracy as you make these changes. Do you do better, worse, or about the same? How might you explain your results?

Congratulations! You’re done!