Supervised machine learning with Rasters

In yesterday’s linear regression tutorial, we used a standard statistical linear model to identify parameters that best fit a linear regression and did some predictions. Today, we build another predictive model in which we first train a model on training data, and use the trained model for prediction. This approach is called supervised machine learning.

Supervised machine learning is a type of machine learning algorithm that uses labeled training data to learn a function that can be applied to unknown data. The training data includes both the input and the correct output. The algorithm analyzes the training data and produces an inferred function, which can be used to map new, unknown examples. This inferred function aims to predict the output values/labels for the new data based on the learned information from the training data.

For today’s tutorial, we train a machine learning model on land use land cover data in a raster format. We will talk about the machine learning model we use later in the tutorial.

We will take the following steps:

We use terra to handle raster data.

We make a list of raster data filenames using paste0. We can then read in and combine them using rast. We assign names to the raster layers.

Notice the name of the layers. These are the bands of images collected by the Landsat satellite. Blue, Green, and Red make the visible spectrum, along with Near-InfraRed and ShortWaveInfraRed. These bands are helpful in detecting surface reflectance, rocks, soil moisture, etc. Our ultimate goal is, given the values for these bands, we want to predict what is the most likely land use/land cover of a particular cell is, e.g. water, cropland, built area, etc.

library(terra)
terra 1.7.29
# We read the 6 bands from the Landsat image we previously used
raslist <- paste0('./rs/LC08_044034_20170614_B', 2:7, ".tif")
landsat <- rast(raslist)
names(landsat) <- c('blue', 'green', 'red', 'NIR', 'SWIR1', 'SWIR2')

The polygons we use for land cover information is saved as rds files. rds file formats are specific to R. We can save all R objects as rds in our hard disks so that it saves space and loads faster that other formats while importing for use.

# load polygons with land use land cover information
samp <- readRDS("./rs/lcsamples.rds")

# check the distribution of the polygons
plot(samp)
text(samp, samp$class)

We will randomly sample 200 points.

set.seed(1)
# generate point samples from the polygons
ptsamp <- spatSample(samp, 200, method="random")
plot(ptsamp, "class")

We can also create the training data from other rasters. For example, the National Land Cover Database 2011 (NLCD 2011) is a land cover product for the United States. NLCD is a 30-m Landsat-based land cover database spanning 4 epochs (1992, 2001, 2006 and 2011). NLCD 2011 is based primarily on a decision-tree classification of circa 2011 Landsat data.

We import NLCS raster. Since NLCD has data from both 2001 and 2011, we only select 2011 data for use. We name the classes and later assign ordinal levels to the names and assign appropriate colors to represent those classes.

nlcd <- rast('./rs/nlcd-L1.tif')
names(nlcd) <- c("nlcd2001", "nlcd2011")
nlcd2011 <- nlcd[[2]]
# assign class names as categories (levels)
nlcdclass <- c("Water", "Developed", "Barren", "Forest", "Shrubland", "Herbaceous", "Cultivated", "Wetlands")
classdf <- data.frame(value = c(1,2,3,4,5,7,8,9), names = nlcdclass)
levels(nlcd2011) <- classdf
# colors (as hexidecimal codes)
classcolor <- c("#5475A8", "#B50000", "#D2CDC0", "#38814E", "#AF963C", "#D1D182", "#FBF65D", "#C8E6F8")
# plot the locations on top of the original nlcd raster
plot(nlcd2011, col=classcolor)
ptlonlat <- project(ptsamp, crs(nlcd2011))
points(ptlonlat)

We sample 200 raster units randomly. Note the distribution of different classes.

# Sampling
samp2011 <- spatSample(nlcd2011, size = 200, method="regular")
# Number of samples in each class
table(samp2011[,1])

     Water  Developed     Barren     Forest  Shrubland Herbaceous Cultivated 
        19         29          1          6          3         35        107 
  Wetlands 
        16 

The reflectance values imported from the landsat images are the independent variables, or in machine learning parlance - predictors.

# extract the reflectance values for the locations
df <- extract(landsat, ptsamp, ID=FALSE)
# Quick check for the extracted values
head(df)
        blue      green        red        NIR       SWIR1       SWIR2
1 0.09797934 0.08290726 0.05811966 0.01958285 0.006245691 0.003816809
2 0.11986095 0.09904197 0.07978442 0.05723051 0.040965684 0.033939276
3 0.12838373 0.13727517 0.18810819 0.31336907 0.315103978 0.180648044
4 0.15295446 0.18123358 0.25232175 0.40375817 0.401546150 0.237791821
5 0.13992092 0.14987499 0.19207680 0.28279120 0.356763631 0.232044920
6 0.12341753 0.10114556 0.08288557 0.06377982 0.046972826 0.038883783

We finally compile a dataset with predictors from landsat and classes from LULC 2011 to train the machine learning model.

sampdata <- data.frame(class = ptsamp$class, df)

Train the classifier

The machine learning model we are using is called CART.

CART (Classification and Regression Trees) is a popular machine learning algorithm that is used for both classification and regression tasks. The model works by splitting the data into subsets based on the values of the input features. Each split is determined using a measure of homogeneity, such as Gini impurity or information gain, to create groups that are as homogeneous as possible.

The result of these splits is a tree structure, where each node represents a decision (i.e., a split on a particular feature) and each leaf node represents a prediction. In a classification tree, the prediction at each leaf node is the most common class (or label) of the instances that reach that leaf. In a regression tree, the prediction is the mean target value of the instances that reach that leaf.

CART models are simple to understand and interpret, which makes them a popular choice for machine learning tasks.

library(rpart)
# Train the model
cartmodel <- rpart(as.factor(class)~., data = sampdata, method = 'class', minsplit = 5)
# print trained model
print(cartmodel)
n= 200 

node), split, n, loss, yval, (yprob)
      * denotes terminal node

 1) root 200 122 water (0.055 0.12 0.19 0.24 0.39)  
   2) NIR>=0.08455543 122  73 open (0.09 0.2 0.31 0.4 0)  
     4) SWIR1< 0.2957055 70  35 fallow (0.14 0.34 0.5 0.014 0)  
       8) blue< 0.09913956 24   0 cropland (0 1 0 0 0) *
       9) blue>=0.09913956 46  11 fallow (0.22 0 0.76 0.022 0)  
        18) blue>=0.1312463 10   0 built (1 0 0 0 0) *
        19) blue< 0.1312463 36   1 fallow (0 0 0.97 0.028 0) *
     5) SWIR1>=0.2957055 52   4 open (0.019 0 0.058 0.92 0)  
      10) blue< 0.1299668 13   3 open (0 0 0.23 0.77 0)  
        20) SWIR2>=0.2063899 3   0 fallow (0 0 1 0 0) *
        21) SWIR2< 0.2063899 10   0 open (0 0 0 1 0) *
      11) blue>=0.1299668 39   1 open (0.026 0 0 0.97 0) *
   3) NIR< 0.08455543 78   0 water (0 0 0 0 1) *
# Plot the trained classification tree
plot(cartmodel, uniform=TRUE, main="Classification Tree")
text(cartmodel, cex = 1)

Look at the decison tree! That’s the pattern that the machine learning model has learned.

Exercise
  • How does the model make the decision that the area represented by a raster unit is built?

Now we can use the model to predict land cover classes for all raster unites in the landsat data.

classified <- predict(landsat, cartmodel, na.rm = TRUE)
classified
class       : SpatRaster 
dimensions  : 1245, 1497, 5  (nrow, ncol, nlyr)
resolution  : 30, 30  (x, y)
extent      : 594090, 639000, 4190190, 4227540  (xmin, xmax, ymin, ymax)
coord. ref. : WGS 84 / UTM zone 10N (EPSG:32610) 
source(s)   : memory
names       : built, cropland, fallow, open, water 
min values  :     0,        0,      0,    0,     0 
max values  :     1,        1,      1,    1,     1 

We plot the probability values [0-1] for each class for each cell.

plot(classified)

We can assign the class for a raster based on whichever class has the maximum value for that unit.

lulc <- which.max(classified)
lulc
class       : SpatRaster 
dimensions  : 1245, 1497, 1  (nrow, ncol, nlyr)
resolution  : 30, 30  (x, y)
extent      : 594090, 639000, 4190190, 4227540  (xmin, xmax, ymin, ymax)
coord. ref. : WGS 84 / UTM zone 10N (EPSG:32610) 
source(s)   : memory
name        : which.max 
min value   :         1 
max value   :         5 

We now create a plot of the predicted values for landsat image.

cls <- c("built","cropland","fallow","open","water")
df <- data.frame(id = 1:5, class=cls)
levels(lulc) <- df
lulc
class       : SpatRaster 
dimensions  : 1245, 1497, 1  (nrow, ncol, nlyr)
resolution  : 30, 30  (x, y)
extent      : 594090, 639000, 4190190, 4227540  (xmin, xmax, ymin, ymax)
coord. ref. : WGS 84 / UTM zone 10N (EPSG:32610) 
source(s)   : memory
categories  : class 
name        : class 
min value   : built 
max value   : water 
mycolor <- c("darkred", "yellow", "burlywood", "cyan", "blue")
plot(lulc, col=mycolor)

Model evaluation

We evaluate a model mainly on the accuracy of predictions.

Here, we are conducting a k-fold cross validation. In this technique the data used to fit the model is split into k groups (typically 5 groups). In turn, one of the groups will be used for model testing, while the rest of the data is used for model training (fitting).

set.seed(99)
# number of folds
k <- 5
j <- sample(rep(1:k, each = round(nrow(sampdata))/k))
table(j)
j
 1  2  3  4  5 
40 40 40 40 40 

Now we train and test the model five times, each time computing the predictions and storing that with the actual values in a list. Later we use the list to compute the final accuracy.

x <- list()
for (k in 1:5) {
    train <- sampdata[j!= k, ]
    test <- sampdata[j == k, ]
    cart <- rpart(as.factor(class)~., data=train, method = 'class',
                  minsplit = 5)
    pclass <- predict(cart, test, na.rm = TRUE)
    # assign class to maximum probablity
    pc <- apply(pclass, 1, which.max)
    # use labels instead of numbers
    pc <- colnames(pclass)[pc]
    # create a data.frame using the reference and prediction
    x[[k]] <- cbind(test$class, pc)
}

Now combine the five list elements into a single data.frame, using do.call and compute a confusion matrix.

y <- do.call(rbind, x)
y <- data.frame(y)
colnames(y) <- c('observed', 'predicted')
# confusion matrix
conmat <- table(y)
print(conmat)
          predicted
observed   built cropland fallow open water
  built        8        0      1    2     0
  cropland     0       24      0    0     0
  fallow       2        0     33    3     0
  open         0        0      2   47     0
  water        0        0      0    0    78

Overall accuracy

# number of total cases/samples
n <- sum(conmat)
n
[1] 200
# number of correctly classified cases per class
diag <- diag(conmat)
# Overall Accuracy
OA <- sum(diag) / n
OA
[1] 0.95

Kappa Coefficient:

The Kappa statistic (or Kappa coefficient) is a measure of agreement between two sets of categorical data. It takes into account the agreement occurring by chance, and a value of 1 indicates perfect agreement, while a value of 0 indicates agreement equivalent to chance. Negative values suggest agreement less than chance. In the context of classification, it is often used as a measure of how well a classifier’s predictions match the true values.

# observed (true) cases per class
rowsums <- apply(conmat, 1, sum)
p <- rowsums / n
# predicted cases per class
colsums <- apply(conmat, 2, sum)
q <- colsums / n
expAccuracy <- sum(p*q)
kappa <- (OA - expAccuracy) / (1 - expAccuracy)
kappa
[1] 0.9317732

Producer and user accuracy

Producer’s Accuracy: The producer’s accuracy is a measure of a classifier’s ability to correctly classify instances of a particular class. It is calculated as the number of true positives (i.e., the number of instances correctly classified as belonging to the class) divided by the total number of instances that actually belong to that class (true positives and false negatives). It is essentially a measure of the classifier’s ability to correctly identify a class.

User’s Accuracy: The user’s accuracy, on the other hand, is a measure of how often instances that are classified as belonging to a particular class actually do belong to that class. It is calculated as the number of true positives divided by the total number of instances classified as that class (true positives and false positives). It is essentially a measure of the reliability of the classification of a class.

# Producer accuracy
PA <- diag / colsums
# User accuracy
UA <- diag / rowsums
outAcc <- data.frame(producerAccuracy = PA, userAccuracy = UA)
outAcc
         producerAccuracy userAccuracy
built           0.8000000    0.7272727
cropland        1.0000000    1.0000000
fallow          0.9166667    0.8684211
open            0.9038462    0.9591837
water           1.0000000    1.0000000

References