Mapping tea plantations at the foot of Mount Kenya

Supervised Machine Learning Tutorial
Land Economics Group, Institute for Food and Resource Economics, University of Bonn, Bonn, Germany

Author

David Wuepper, Lisa Biber-Freudenberger, Hadi, Wyclife Agumba Oluoch

Published

June 10, 2026

1 Background

Tea (Camellia sinensis (L.) Kuntze) is a major cash crop and key driver of livelihoods in many parts of central highlands of Kenya with favorable altitude, rainfall, and soil conditions. Accurate monitoring of tea plantations is important for supporting agricultural planning, optimizing resource allocation like fertilizer, and monitoring land use change such as encroachment into protected areas. Traditional field surveys, though accurate, are often time-consuming and resource intensive, making remote sensing approaches increasingly valuable.

In this tutorial, we demonstrate how to map tea plantations in a small region at the foot of Mount Kenya by combining Sentinel-2 satellite imagery, true presence and absence field data, and random forest machine learning techniques implemented in R (R Core Team 2025). We focus on workflow that spans from data preparation and model training to prediction and visualization of binary maps.

Note

While the case study is centered on tea mapping, the general approach can be adapted to map other crops or land cover types in different regions.

Inside the data folder we have the following files that are projected to epsg code 3857:

roi_p.gpkg: This is the region of interest (roi) geopackage for the tea plantations in Kenya.
s2_large.tif: The is the Sentinel-2 image for the region of interest.
vect.gpkg: This is the polygons of tea and non-tea areas for model training.
eval.gpkg: This is observed 200 points consisting of 100 true tea and 100 non tea for testing the model performance.

2 Training data: tea and no tea points

In QGIS, we identified a small area at the foot of Mount Kenya and manually digitized tea plantation and non-tea areas. These were to be used for extracting the Sentinel-2 data for training the model.

2.1 Loading Training Data in R

We begin by loading the boundary data using the vect function in terra package (Hijmans 2025) version 1.9.27, which offers a convenient way to read a lot of geospatial data into R. There are several other options or packages one can use to read geospatial data into R such as sf package (Pebesma and Bivand 2023). We also include tidyterra package (Hernangómez 2023) version 1.1.0 for plotting in conjunction with assorted tidyverse package 2.0.0 functionalities in the background. The extent of the study area is shown in Figure 1.

Code

library(terra)      # version 1.9.27
library(tidyterra)  # version 1.1.0
library(tidyverse)  # version 2.0.0

study_area <- vect(x = "data/roi_p.gpkg")

ggplot() +
  geom_spatvector(data = study_area, 
                  fill = NA, 
                  col = "blue")

The above code snippet loads and visualizes the study area boundary. However, we have only digitized a small portion of it for our training data. So, below, we add the roi to the map, see Figure 2.

Code

roi <- vect(x = "data/vect.gpkg")
ggplot() +
  geom_spatvector(data = study_area, 
                  fill = NA, 
                  col = "red") +
  geom_spatvector(data = roi, 
                  fill = NA, 
                  col = "blue")

Figure 2: Study area (red) border and small portion used for training (blue) border

Now, let us turn our attention to the roi. It has 29 polygons. Polygons 1:28 are tea fields and 29 is non-tea field. We can visualize them separately, Figure 3.

Code

ggplot() +
  geom_spatvector(data = roi,
                  aes(fill = as.factor(tea_no_tea)),  # map tea_no_tea to fill
                  color = "black") +                  # border color
  scale_fill_manual(
    values = c("0" = "orange", "1" = "green"),
    name = "Legend"
  ) +
  theme_minimal()

Figure 3: Region of interest where all tea plantations were digitized (green) and non tea areas (orange) used for model training

We can then obtain the training data from the satellite image. We will use the extract function from terra package to extract the values of the pixels in the polygons.

Code

s2 <- rast("data/s2_large.tif")
tea_extract <- terra::extract(x = s2, y = roi)
head(tea_extract)

  ID         B2         B3         B4        B5        B6        B7        B8
1  1 0.03013333 0.06610000 0.03245000 0.1058750 0.3268500 0.4021000 0.4461900
2  1 0.03083333 0.06640000 0.03309167 0.1123167 0.3263700 0.4008167 0.4367167
3  1 0.03050000 0.06720000 0.03300000 0.1125333 0.3273400 0.4008167 0.4395000
4  1 0.02990000 0.06695000 0.03306667 0.1150250 0.3363000 0.4116000 0.4460889
5  1 0.03016667 0.06870000 0.03360000 0.1152000 0.3375000 0.4116000 0.4501750
6  1 0.02980000 0.06816667 0.03485714 0.1242667 0.3263917 0.3895000 0.4527428
        B8A       B11        B12
1 0.4390500 0.1833600 0.08485000
2 0.4289333 0.1897375 0.08923333
3 0.4295333 0.1901750 0.08923333
4 0.4432875 0.1936333 0.09263334
5 0.4438750 0.1939500 0.09220000
6 0.4210000 0.2034333 0.10420000

2.2 Visualize RGB of the s2 for Study Area

The next step is to make a true color composite map from Sentinel-2 images within the study area showing the study extent and region of interest (where training points will be obtained from) Figure 4.

Code

plotRGB(x = s2, r = 3, g = 2, b = 1, stretch = "lin")
plot(study_area, add = T, lwd = 4, border = 'red')
plot(roi, add = T, fill = FALSE, border = 'blue', lwd = 2)

Figure 4: Study area (red) on background RGB image of Sentinel-2. Training area is shown in blue borders

Since we know that the polygons 1 to 28 are tea plantations and 29th is non-tea field, we can subset the data to have tea and non-tea extracted values.

Code

tea_df <- tea_extract[tea_extract$ID != 29, ]
no_tea_df <- tea_extract[tea_extract$ID == 29, ]

Great, now we have 2226 tea records and 3884 non-tea records. We can combine them into one data frame. But before that, we add column called tea which will contain a value of 1 for tea and 0 for non-tea. We then bind them and drop the ID column as follows.

Code

tea_df$tea <- 1
no_tea_df$tea <- 0

tea_no_tea_df <- rbind(tea_df, no_tea_df)

tea_no_tea_df$ID <- NULL

head(tea_no_tea_df)

          B2         B3         B4        B5        B6        B7        B8
1 0.03013333 0.06610000 0.03245000 0.1058750 0.3268500 0.4021000 0.4461900
2 0.03083333 0.06640000 0.03309167 0.1123167 0.3263700 0.4008167 0.4367167
3 0.03050000 0.06720000 0.03300000 0.1125333 0.3273400 0.4008167 0.4395000
4 0.02990000 0.06695000 0.03306667 0.1150250 0.3363000 0.4116000 0.4460889
5 0.03016667 0.06870000 0.03360000 0.1152000 0.3375000 0.4116000 0.4501750
6 0.02980000 0.06816667 0.03485714 0.1242667 0.3263917 0.3895000 0.4527428
        B8A       B11        B12 tea
1 0.4390500 0.1833600 0.08485000   1
2 0.4289333 0.1897375 0.08923333   1
3 0.4295333 0.1901750 0.08923333   1
4 0.4432875 0.1936333 0.09263334   1
5 0.4438750 0.1939500 0.09220000   1
6 0.4210000 0.2034333 0.10420000   1

3 Build the sdmData object and model

In this next step, we will build the machine learning model which will be able to take the current tea and no tea points and predictor variables to calibrate a model that can predict where else the crop could be within the study area. For that, we will use sdm package (Naimi and Araujo 2016) version 1.2.67. There are several other packages that can achieve this, such as caret (Kuhn 2008), mlr3 (Lang et al. 2019), tidymodels (Kuhn and Wickham 2020) among others. I prefer sdm because of it’s simplicity, extensibile, and ease of use.

Code

library(sdm)

sdm 1.2-67 (2025-09-24)

Code

sdm_data <- sdmData(formula = tea ~., 
                    train = tea_no_tea_df)

Have a glance at the sdm_data object:

Code

sdm_data

class                                 : sdmdata 
=========================================================== 
number of species                     :  1 
species names                         :  tea 
number of features                    :  10 
feature names                         :  B2, B3, B4, ... 
type                                  :  Presence-Absence 
has independent test data?             :  FALSE 
number of records                     :  6024 
has coordinates?                      :  FALSE

Then the model calibration stage comes with sdm function from sdm package itself:

Code

tea_model <- sdm(formula = tea ~.,
                 data = sdm_data, 
                 methods = "rf",
                 replications = "sub", # can use boot or cv too
                 test.p = 30,
                 n = 5,
                 parallelSettings = list(ncores = 10,
                                         method = "parallel"))
# write.sdm(tea_model, "model/tea_model.sdm")

View the model summary:

Code

tea_model <- read.sdm("model/tea_model.sdm")
tea_model

class                                 : sdmModels 
======================================================== 
number of species                     :  1 
number of modelling methods           :  1 
names of modelling methods            :  rf 
replicate.methods (data partitioning) :  bootstrap 
number of replicates (each method)    :  5 
total number of replicates per model  :  5 (per species) 
------------------------------------------
model run success percentage (per species)  :
------------------------------------------
method          tea              
---------------------- 
rf         :        100   %

###################################################################
model Mean performance (per species), using test dataset (generated using partitioning):
-------------------------------------------------------------------------------

 ## species   :  tea 
=========================

methods    :     AUC     |     COR     |     TSS     |  Deviance 
-------------------------------------------------------------------------
rf         :     0.99    |     0.91    |     0.9     |     0.3

Other checks on the model include looking at the response curves Figure 5.

Code

rcurve(tea_model)

The id argument is not specified; The modelIDs of 5 successfully fitted models are assigned to id...!

Figure 5: Response curve of all the predictor variables used to build the tea_model

We can also check for the area under the receiver operating characteristic curve (AUC), Figure 6.

Code

roc(tea_model)

Figure 6: Area under the receiver operating characteristic curve of the tea_model

Visualizing how each of the predictor variables used in the model was important for the crop under mapping is also useful Figure 7.

Code

plot(getVarImp(tea_model))


The variable importance for all the models are combined (averaged)...

Important

One can redo what we have done up to here several times, tuning the modeling process until a model with desirable characteristics is attained. Basically, this is a model that meets the specific needs of the project. For example, one can change the number of replications (20 etc), replication methods (sub, cv), model methods (brt, gam, glm, rf, svm). One can also add occurrence records (e.g. from tea factories), one can also add predictor variables (soil, elevation).

4 Assess Area of Applicability

It is important to assess the area of applicability of the model. This helps us to know how far we can use the model to make predictions. We can do this by using the aoa function from the sdm package.

Code

tea_aoa <- sdm::aoa(x = s2, d = tea_model)
plot(tea_aoa)

Figure 8: Area of Applicability of the tea_model

Values close to 1 in Figure 8 indicate areas that are similar to the training data. We notice that we can use our model to make predictions in most parts of the study area with minimal caution to the northeastern end,

5 Predict tea plantations throughout the study area

Finally, we get to the point where we use the trained model to predict the tea plantations over the study area. For this, we use predict function.

Code

pred <- predict(object = tea_model, s2)
pred

class       : SpatRaster
size        : 187, 335, 5  (nrow, ncol, nlyr)
resolution  : 10, 10  (x, y)
extent      : 4167740, 4171090, -40980, -39110  (xmin, xmax, ymin, ymax)
coord. ref. : WGS 84 / Pseudo-Mercator (EPSG:3857)
source(s)   : memory
names       : id_1__~e_boot, id_2__~e_boot, id_3__~e_boot, id_4__~e_boot, id_5__~e_boot
min values  :            -0,            -0,            -0,            -0,            -0
max values  :             1,             1,             1,             1,             1

This is giving us five layers of spatRaster corresponding to the five runs of the model.

We can visualize them, see

Code

ggplot() +
  geom_spatraster(data = pred) +
  facet_wrap(~lyr, ncol = 3) +
  scale_fill_whitebox_c(
    palette = "pi_y_g", 
    n.breaks = 10,
    guide = guide_legend(reverse = TRUE)
  ) +
  labs(
    fill = ""
  )

Figure 9: Predicted likelihood of pixels being tea

Normally, we take the median of the predictions since we only have one method ‘maxent’. When more than one method, we may be careful to only get median of each method separately Figure 10.

Code

pred_median <- median(pred)
ggplot() +
  geom_spatraster(data = pred_median) +
  scale_fill_whitebox_c(
    palette = "pi_y_g", 
    n.breaks = 10,
    guide = guide_legend(reverse = TRUE)
  ) +
  labs(
    fill = ""
  )

Figure 10: Predicted median likelihood of pixels being tea

It is also useful to quantify the uncertainty across the models run. This we can give using standard deviation metric or variance, among others Figure 11.

Code

pred_uncertainty <- stdev(pred)
ggplot() +
  geom_spatraster(data = pred_uncertainty) +
  scale_fill_whitebox_c(
    palette = "gn_yl", 
    n.breaks = 10,
    guide = guide_legend(reverse = TRUE)
  ) +
  labs(
    fill = ""
  )

Figure 11: Uncertainty around predicted likelihood of pixels being tea

5.1 Binarize the likelihood map for tea-no_tea map

We now need to make an important step which is to binarize the map into tea-no_tea map. This is a map which we can therefore use to tell how much land is under tea. However, since we actually have not been in the field to make the observation of the fields, we use threshold based evaluation metrics to create a cut-off point along the 0 to 1 likelihood scale. To get the binary map, we are going to use pa function as follows:

Code

tea_no_tea_map <- pa(x = pred_median,
                     y = tea_model, 
                     id = 'ensemble', 
                     opt = 2)

Now, with this map we can show the potential tea map of the study area Figure 12.

Code

ggplot() +
  geom_spatraster(data = as.factor(tea_no_tea_map)) +
  scale_fill_manual(
    values = c("0" = "pink", "1" = "darkgreen"), # Map levels to colors
    labels = c("No tea", "Tea"),
    guide = guide_legend(reverse = TRUE),
    na.translate = FALSE# Reverse legend order
  ) +
  labs(
    fill = ""
  )

Figure 12: Predicted tea plantations and background

6 Testing the model on independent data

We can now test the model on independent data. We have a geopackage file called eval.gpkg which contains 200 points of tea and non-tea. We can visualize these Figure 13, and test the model.

Code

library(mapview)

test_data <- vect("data/eval.gpkg")

e1 <- extract(pred_median, test_data)

eval1 <- evaluates(x = test_data$tea_no_tea,
                   p = e1$median)

eval1@threshold_based

       criteria threshold sensitivity specificity  TSS    MCC     F1 Kappa
1         sp=se   0.17800        1.00           1 1.00 1.0000 1.0000  1.00
2    max(se+sp)   0.17800        1.00           1 1.00 1.0000 1.0000  1.00
3     min(cost)   0.17800        1.00           1 1.00 1.0000 1.0000  1.00
4    minROCdist   0.17800        1.00           1 1.00 1.0000 1.0000  1.00
5    max(kappa)   0.17800        1.00           1 1.00 1.0000 1.0000  1.00
6  max(ppv+npv)   0.17800        1.00           1 1.00 1.0000 1.0000  1.00
7       ppv=npv   0.17800        1.00           1 1.00 1.0000 1.0000  1.00
8      max(NMI)   0.17800        1.00           1 1.00 1.0000 1.0000  1.00
9      max(ccr)   0.17800        1.00           1 1.00 1.0000 1.0000  1.00
10   prevalence   0.17800        1.00           1 1.00 1.0000 1.0000  1.00
11     max(MCC)   0.17800        1.00           1 1.00 1.0000 1.0000  1.00
12          P10   0.61096        0.90           1 0.90 0.9045 0.9474  0.90
13           P5   0.56970        0.95           1 0.95 0.9512 0.9744  0.95
14           P1   0.29436        0.99           1 0.99 0.9900 0.9950  0.99
15           P0   0.17800        1.00           1 1.00 1.0000 1.0000  1.00
      NMI    phi ppv    npv   ccr   mcr ommission commission prevalence
1  1.0000 1.0000   1 1.0000 1.000 0.000      0.00          0      0.500
2  1.0000 1.0000   1 1.0000 1.000 0.000      0.00          0      0.500
3  1.0000 1.0000   1 1.0000 1.000 0.000      0.00          0      0.500
4  1.0000 1.0000   1 1.0000 1.000 0.000      0.00          0      0.500
5  1.0000 1.0000   1 1.0000 1.000 0.000      0.00          0      0.500
6  1.0000 1.0000   1 1.0000 1.000 0.000      0.00          0      0.500
7  1.0000 1.0000   1 1.0000 1.000 0.000      0.00          0      0.500
8  1.0000 1.0000   1 1.0000 1.000 0.000      0.00          0      0.500
9  1.0000 1.0000   1 1.0000 1.000 0.000      0.00          0      0.500
10 1.0000 1.0000   1 1.0000 1.000 0.000      0.00          0      0.500
11 1.0000 1.0000   1 1.0000 1.000 0.000      0.00          0      0.500
12 0.7583 0.9045   1 0.9091 0.950 0.050      0.10          0      0.450
13 0.8550 0.9512   1 0.9524 0.975 0.025      0.05          0      0.475
14 0.9595 0.9900   1 0.9901 0.995 0.005      0.01          0      0.495
15 1.0000 1.0000   1 1.0000 1.000 0.000      0.00          0      0.500

Code

eval1@statistics

$Prevalence
[1] 0.5

$AUC
[1] 1

$cBoyce
[1] 1

$COR
          cor       p.value 
 9.600000e-01 1.120148e-111 

$Deviance
[1] 0.2133508

Code

mapview(tea_no_tea_map) +
  mapview(test_data)

Figure 13: Interactive map of tea map and test data points

7 Area under tea

To find the area under tea within the predicted area. We can use expanse function from terra package.

Code

expanse(x = tea_no_tea_map, unit = "ha", byValue = TRUE)

  layer value     area
1     1     0 445.6201
2     1     1 173.2857

We can go ahead and visualize the area under each class; tea and no tea Figure 14.

Code

tea_no_tea_map_f <- as.factor(tea_no_tea_map)
levels(tea_no_tea_map_f) <- data.frame(value = c(0, 1),label = c("No tea", "Tea"))

df_area <- expanse(tea_no_tea_map_f, 
                   unit = 'ha', 
                   byValue = TRUE)
df_area |> 
  ggplot(aes(x = reorder(value, -area), y = area, fill = value)) +
  geom_col() +
  geom_text(aes(label = round(area, 2)),
            vjust = -0.3,
            colour = 'black',
            size = 5,
            fontface = 'bold') +
  theme_minimal() +
  theme(
    legend.position = 'none',
    axis.text = element_text(
      color = 'black',
      face = 'bold',
      size = 12
    ),
    axis.title = element_text(
      color = 'black',
      face = 'bold',
      size = 14
    ),
    plot.title = element_text(
      color = 'black',
      face = 'bold',
      size = 16,
      hjust = 0.5
    )
  ) +
  labs(x = "Land cover",
       y = "Area (ha)") +
  scale_fill_manual(
    name = "Class",
    values = c(
    "No tea" = "orange",
    "Tea" = "darkgreen"
  ))

8 Application of the model and map

Manual monitoring of tea plantations is costly and limited in scale. Using machine learning models and high resolution satellite images, we can automate and enhance the monitoring process, with applications including:

Pest and disease detection: Early identification for timely detection.
Drought monitoring: Could help detect water stress for irrigation planning.
Yield estimation: Predicting yield based on spectral data and plant health.
Damage assessment: Identifying hail and other weather related impacts.
Encroachment detection: Monitoring illegal expansion of plantations into protected areas.
Carbon quantification: This can make it possible to efficiently quantify carbon in tea plantations.
Tacking tea shift: Monitoring shifts in tea cultivation due to climate change and other reasons over time.

9 Conclusion: Key Observations from the workflow

The map of tea Camellia sinensis achieved superb accuracy in central Kenya region. We are therefore quite confident in our model output. We could finally send the map to stakeholders in the tea industry hailing from the region to help validate its accuracy. In national scale mapping event, we would encounter a lot of challenges including agro-forestry, pruning of tea, variations in cultivars among others.

10 Exercise: Use the tea_model to predict tea elsewhere or at different time

Since we have built our model, we can use it to make predictions across space and time. For instance, we can use the model to make prediction in another country, say Uganda. We can also use the same model to make predictions into the past, say 2020.

Feel free to reach us out if you got any questions Land Economics Group.

11 References

Hernangómez, Diego. 2023. Using the Tidyverse with Terra Objects: The Tidyterra Package. 8: 5751. https://doi.org/10.21105/joss.05751.

Hijmans, Robert J. 2025. Terra: Spatial Data Analysis. https://doi.org/10.32614/CRAN.package.terra.

Kuhn, Max. 2008. “Building Predictive Models inRusing thecaretpackage.” Journal of Statistical Software 28 (5). https://doi.org/10.18637/jss.v028.i05.

Kuhn, Max, and Hadley Wickham. 2020. Tidymodels: A Collection of Packages for Modeling and Machine Learning Using Tidyverse Principles. https://www.tidymodels.org.

Lang, Michel, Martin Binder, Jakob Richter, et al. 2019. Mlr3: A Modern Object-Oriented Machine Learning Framework in r. https://doi.org/10.21105/joss.01903.

Naimi, Babak, and Miguel B. Araujo. 2016. Sdm: A Reproducible and Extensible r Platform for Species Distribution Modelling. 39: 368–75. https://doi.org/10.1111/ecog.01881.

Pebesma, Edzer, and Roger Bivand. 2023. Spatial Data Science: With Applications in r. https://doi.org/10.1201/9780429459016.

R Core Team. 2025. R: A Language and Environment for Statistical Computing. https://www.r-project.org/.