In yesterday’s linear regression tutorial, we used a standard statistical linear model to identify parameters that best fit a linear regression and did some predictions. Today, we build another predictive model in which we first train a model on training data, and use the trained model for prediction. This approach is called supervised machine learning.
Supervised machine learning is a type of machine learning algorithm that uses labeled training data to learn a function that can be applied to unknown data. The training data includes both the input and the correct output. The algorithm analyzes the training data and produces an inferred function, which can be used to map new, unknown examples. This inferred function aims to predict the output values/labels for the new data based on the learned information from the training data.
For today’s tutorial, we train a machine learning model on land use land cover data in a raster format. We will talk about the machine learning model we use later in the tutorial.
We will take the following steps:
Create sample sites used for classification
Extract cell values from Landsat data for the sample sites
Train the classifier using training samples
Classify the Landsat data using the trained model
Evaluate the accuracy of the model
We use terra to handle raster data.
We make a list of raster data filenames using paste0. We can then read in and combine them using rast. We assign names to the raster layers.
Notice the name of the layers. These are the bands of images collected by the Landsat satellite. Blue, Green, and Red make the visible spectrum, along with Near-InfraRed and ShortWaveInfraRed. These bands are helpful in detecting surface reflectance, rocks, soil moisture, etc. Our ultimate goal is, given the values for these bands, we want to predict what is the most likely land use/land cover of a particular cell is, e.g. water, cropland, built area, etc.
library(terra)
terra 1.7.29
# We read the 6 bands from the Landsat image we previously usedraslist <-paste0('./rs/LC08_044034_20170614_B', 2:7, ".tif")landsat <-rast(raslist)names(landsat) <-c('blue', 'green', 'red', 'NIR', 'SWIR1', 'SWIR2')
The polygons we use for land cover information is saved as rds files. rds file formats are specific to R. We can save all R objects as rds in our hard disks so that it saves space and loads faster that other formats while importing for use.
# load polygons with land use land cover informationsamp <-readRDS("./rs/lcsamples.rds")# check the distribution of the polygonsplot(samp)text(samp, samp$class)
We will randomly sample 200 points.
set.seed(1)# generate point samples from the polygonsptsamp <-spatSample(samp, 200, method="random")plot(ptsamp, "class")
We can also create the training data from other rasters. For example, the National Land Cover Database 2011 (NLCD 2011) is a land cover product for the United States. NLCD is a 30-m Landsat-based land cover database spanning 4 epochs (1992, 2001, 2006 and 2011). NLCD 2011 is based primarily on a decision-tree classification of circa 2011 Landsat data.
We import NLCS raster. Since NLCD has data from both 2001 and 2011, we only select 2011 data for use. We name the classes and later assign ordinal levels to the names and assign appropriate colors to represent those classes.
nlcd <-rast('./rs/nlcd-L1.tif')names(nlcd) <-c("nlcd2001", "nlcd2011")nlcd2011 <- nlcd[[2]]# assign class names as categories (levels)nlcdclass <-c("Water", "Developed", "Barren", "Forest", "Shrubland", "Herbaceous", "Cultivated", "Wetlands")classdf <-data.frame(value =c(1,2,3,4,5,7,8,9), names = nlcdclass)levels(nlcd2011) <- classdf# colors (as hexidecimal codes)classcolor <-c("#5475A8", "#B50000", "#D2CDC0", "#38814E", "#AF963C", "#D1D182", "#FBF65D", "#C8E6F8")# plot the locations on top of the original nlcd rasterplot(nlcd2011, col=classcolor)ptlonlat <-project(ptsamp, crs(nlcd2011))points(ptlonlat)
We sample 200 raster units randomly. Note the distribution of different classes.
# Samplingsamp2011 <-spatSample(nlcd2011, size =200, method="regular")# Number of samples in each classtable(samp2011[,1])
Water Developed Barren Forest Shrubland Herbaceous Cultivated
19 29 1 6 3 35 107
Wetlands
16
The reflectance values imported from the landsat images are the independent variables, or in machine learning parlance - predictors.
# extract the reflectance values for the locationsdf <-extract(landsat, ptsamp, ID=FALSE)# Quick check for the extracted valueshead(df)
We finally compile a dataset with predictors from landsat and classes from LULC 2011 to train the machine learning model.
sampdata <-data.frame(class = ptsamp$class, df)
Train the classifier
The machine learning model we are using is called CART.
CART (Classification and Regression Trees) is a popular machine learning algorithm that is used for both classification and regression tasks. The model works by splitting the data into subsets based on the values of the input features. Each split is determined using a measure of homogeneity, such as Gini impurity or information gain, to create groups that are as homogeneous as possible.
The result of these splits is a tree structure, where each node represents a decision (i.e., a split on a particular feature) and each leaf node represents a prediction. In a classification tree, the prediction at each leaf node is the most common class (or label) of the instances that reach that leaf. In a regression tree, the prediction is the mean target value of the instances that reach that leaf.
CART models are simple to understand and interpret, which makes them a popular choice for machine learning tasks.
library(rpart)# Train the modelcartmodel <-rpart(as.factor(class)~., data = sampdata, method ='class', minsplit =5)
class : SpatRaster
dimensions : 1245, 1497, 1 (nrow, ncol, nlyr)
resolution : 30, 30 (x, y)
extent : 594090, 639000, 4190190, 4227540 (xmin, xmax, ymin, ymax)
coord. ref. : WGS 84 / UTM zone 10N (EPSG:32610)
source(s) : memory
categories : class
name : class
min value : built
max value : water
We evaluate a model mainly on the accuracy of predictions.
Here, we are conducting a k-fold cross validation. In this technique the data used to fit the model is split into k groups (typically 5 groups). In turn, one of the groups will be used for model testing, while the rest of the data is used for model training (fitting).
set.seed(99)# number of foldsk <-5j <-sample(rep(1:k, each =round(nrow(sampdata))/k))table(j)
j
1 2 3 4 5
40 40 40 40 40
Now we train and test the model five times, each time computing the predictions and storing that with the actual values in a list. Later we use the list to compute the final accuracy.
x <-list()for (k in1:5) { train <- sampdata[j!= k, ] test <- sampdata[j == k, ] cart <-rpart(as.factor(class)~., data=train, method ='class',minsplit =5) pclass <-predict(cart, test, na.rm =TRUE)# assign class to maximum probablity pc <-apply(pclass, 1, which.max)# use labels instead of numbers pc <-colnames(pclass)[pc]# create a data.frame using the reference and prediction x[[k]] <-cbind(test$class, pc)}
Now combine the five list elements into a single data.frame, using do.call and compute a confusion matrix.
y <-do.call(rbind, x)y <-data.frame(y)colnames(y) <-c('observed', 'predicted')# confusion matrixconmat <-table(y)print(conmat)
predicted
observed built cropland fallow open water
built 8 0 1 2 0
cropland 0 24 0 0 0
fallow 2 0 33 3 0
open 0 0 2 47 0
water 0 0 0 0 78
Overall accuracy
# number of total cases/samplesn <-sum(conmat)n
[1] 200
# number of correctly classified cases per classdiag <-diag(conmat)# Overall AccuracyOA <-sum(diag) / nOA
[1] 0.95
Kappa Coefficient:
The Kappa statistic (or Kappa coefficient) is a measure of agreement between two sets of categorical data. It takes into account the agreement occurring by chance, and a value of 1 indicates perfect agreement, while a value of 0 indicates agreement equivalent to chance. Negative values suggest agreement less than chance. In the context of classification, it is often used as a measure of how well a classifier’s predictions match the true values.
Producer’s Accuracy: The producer’s accuracy is a measure of a classifier’s ability to correctly classify instances of a particular class. It is calculated as the number of true positives (i.e., the number of instances correctly classified as belonging to the class) divided by the total number of instances that actually belong to that class (true positives and false negatives). It is essentially a measure of the classifier’s ability to correctly identify a class.
User’s Accuracy: The user’s accuracy, on the other hand, is a measure of how often instances that are classified as belonging to a particular class actually do belong to that class. It is calculated as the number of true positives divided by the total number of instances classified as that class (true positives and false positives). It is essentially a measure of the reliability of the classification of a class.