David Wuepper, Lisa Biber-Freudenberger, Hadi, Wyclife Agumba Oluoch
Published
June 10, 2026
1Background
Tea (Camellia sinensis (L.) Kuntze) is a major cash crop and key driver of livelihoods in many parts of central highlands of Kenya with favorable altitude, rainfall, and soil conditions. Accurate monitoring of tea plantations is important for supporting agricultural planning, optimizing resource allocation like fertilizer, and monitoring land use change such as encroachment into protected areas. Traditional field surveys, though accurate, are often time-consuming and resource intensive, making remote sensing approaches increasingly valuable.
In this tutorial, we demonstrate how to map tea plantations in a small region at the foot of Mount Kenya by combining Sentinel-2 satellite imagery, true presence and absence field data, and random forest machine learning techniques implemented in R(R Core Team 2025). We focus on workflow that spans from data preparation and model training to prediction and visualization of binary maps.
Note
While the case study is centered on tea mapping, the general approach can be adapted to map other crops or land cover types in different regions.
Inside the data folder we have the following files that are projected to epsg code 3857:
roi_p.gpkg: This is the region of interest (roi) geopackage for the tea plantations in Kenya.
s2_large.tif: The is the Sentinel-2 image for the region of interest.
vect.gpkg: This is the polygons of tea and non-tea areas for model training.
eval.gpkg: This is observed 200 points consisting of 100 true tea and 100 non tea for testing the model performance.
2Training data: tea and no tea points
In QGIS, we identified a small area at the foot of Mount Kenya and manually digitized tea plantation and non-tea areas. These were to be used for extracting the Sentinel-2 data for training the model.
2.1Loading Training Data in R
We begin by loading the boundary data using the vect function in terra package (Hijmans 2025) version 1.9.27, which offers a convenient way to read a lot of geospatial data into R. There are several other options or packages one can use to read geospatial data into R such as sf package (Pebesma and Bivand 2023). We also include tidyterra package (Hernangómez 2023) version 1.1.0 for plotting in conjunction with assorted tidyverse package 2.0.0 functionalities in the background. The extent of the study area is shown in Figure 1.
Code
library(terra) # version 1.9.27library(tidyterra) # version 1.1.0library(tidyverse) # version 2.0.0study_area <-vect(x ="data/roi_p.gpkg")ggplot() +geom_spatvector(data = study_area, fill =NA, col ="blue")
Figure 1: Study area extent
The above code snippet loads and visualizes the study area boundary. However, we have only digitized a small portion of it for our training data. So, below, we add the roi to the map, see Figure 2.
Code
roi <-vect(x ="data/vect.gpkg")ggplot() +geom_spatvector(data = study_area, fill =NA, col ="red") +geom_spatvector(data = roi, fill =NA, col ="blue")
Figure 2: Study area (red) border and small portion used for training (blue) border
Now, let us turn our attention to the roi. It has 29 polygons. Polygons 1:28 are tea fields and 29 is non-tea field. We can visualize them separately, Figure 3.
Figure 3: Region of interest where all tea plantations were digitized (green) and non tea areas (orange) used for model training
We can then obtain the training data from the satellite image. We will use the extract function from terra package to extract the values of the pixels in the polygons.
Code
s2 <-rast("data/s2_large.tif")tea_extract <- terra::extract(x = s2, y = roi)head(tea_extract)
The next step is to make a true color composite map from Sentinel-2 images within the study area showing the study extent and region of interest (where training points will be obtained from) Figure 4.
Code
plotRGB(x = s2, r =3, g =2, b =1, stretch ="lin")plot(study_area, add = T, lwd =4, border ='red')plot(roi, add = T, fill =FALSE, border ='blue', lwd =2)
Figure 4: Study area (red) on background RGB image of Sentinel-2. Training area is shown in blue borders
Since we know that the polygons 1 to 28 are tea plantations and 29th is non-tea field, we can subset the data to have tea and non-tea extracted values.
Great, now we have 2226 tea records and 3884 non-tea records. We can combine them into one data frame. But before that, we add column called tea which will contain a value of 1 for tea and 0 for non-tea. We then bind them and drop the ID column as follows.
In this next step, we will build the machine learning model which will be able to take the current tea and no tea points and predictor variables to calibrate a model that can predict where else the crop could be within the study area. For that, we will use sdm package (Naimi and Araujo 2016) version 1.2.67. There are several other packages that can achieve this, such as caret(Kuhn 2008), mlr3(Lang et al. 2019), tidymodels(Kuhn and Wickham 2020) among others. I prefer sdm because of it’s simplicity, extensibile, and ease of use.
class : sdmdata
===========================================================
number of species : 1
species names : tea
number of features : 10
feature names : B2, B3, B4, ...
type : Presence-Absence
has independent test data? : FALSE
number of records : 6024
has coordinates? : FALSE
Then the model calibration stage comes with sdm function from sdm package itself:
Code
tea_model <-sdm(formula = tea ~.,data = sdm_data, methods ="rf",replications ="sub", # can use boot or cv tootest.p =30,n =5,parallelSettings =list(ncores =10,method ="parallel"))# write.sdm(tea_model, "model/tea_model.sdm")
class : sdmModels
========================================================
number of species : 1
number of modelling methods : 1
names of modelling methods : rf
replicate.methods (data partitioning) : bootstrap
number of replicates (each method) : 5
total number of replicates per model : 5 (per species)
------------------------------------------
model run success percentage (per species) :
------------------------------------------
method tea
----------------------
rf : 100 %
###################################################################
model Mean performance (per species), using test dataset (generated using partitioning):
-------------------------------------------------------------------------------
## species : tea
=========================
methods : AUC | COR | TSS | Deviance
-------------------------------------------------------------------------
rf : 0.99 | 0.91 | 0.9 | 0.3
Other checks on the model include looking at the response curves Figure 5.
Code
rcurve(tea_model)
The id argument is not specified; The modelIDs of 5 successfully fitted models are assigned to id...!
Figure 5: Response curve of all the predictor variables used to build the tea_model
We can also check for the area under the receiver operating characteristic curve (AUC), Figure 6.
Code
roc(tea_model)
Figure 6: Area under the receiver operating characteristic curve of the tea_model
Visualizing how each of the predictor variables used in the model was important for the crop under mapping is also useful Figure 7.
Code
plot(getVarImp(tea_model))
The variable importance for all the models are combined (averaged)...
Figure 7: Plot of the variable importance indicating which variables were most important in predicting the tea plantations
Important
One can redo what we have done up to here several times, tuning the modeling process until a model with desirable characteristics is attained. Basically, this is a model that meets the specific needs of the project. For example, one can change the number of replications (20 etc), replication methods (sub, cv), model methods (brt, gam, glm, rf, svm). One can also add occurrence records (e.g. from tea factories), one can also add predictor variables (soil, elevation).
4Assess Area of Applicability
It is important to assess the area of applicability of the model. This helps us to know how far we can use the model to make predictions. We can do this by using the aoa function from the sdm package.
Code
tea_aoa <- sdm::aoa(x = s2, d = tea_model)plot(tea_aoa)
Figure 8: Area of Applicability of the tea_model
Values close to 1 in Figure 8 indicate areas that are similar to the training data. We notice that we can use our model to make predictions in most parts of the study area with minimal caution to the northeastern end,
5Predict tea plantations throughout the study area
Finally, we get to the point where we use the trained model to predict the tea plantations over the study area. For this, we use predict function.
Figure 9: Predicted likelihood of pixels being tea
Normally, we take the median of the predictions since we only have one method ‘maxent’. When more than one method, we may be careful to only get median of each method separately Figure 10.
Figure 10: Predicted median likelihood of pixels being tea
It is also useful to quantify the uncertainty across the models run. This we can give using standard deviation metric or variance, among others Figure 11.
Figure 11: Uncertainty around predicted likelihood of pixels being tea
5.1Binarize the likelihood map for tea-no_tea map
We now need to make an important step which is to binarize the map into tea-no_tea map. This is a map which we can therefore use to tell how much land is under tea. However, since we actually have not been in the field to make the observation of the fields, we use threshold based evaluation metrics to create a cut-off point along the 0 to 1 likelihood scale. To get the binary map, we are going to use pa function as follows:
Code
tea_no_tea_map <-pa(x = pred_median,y = tea_model, id ='ensemble', opt =2)
Now, with this map we can show the potential tea map of the study area Figure 12.
Figure 12: Predicted tea plantations and background
6Testing the model on independent data
We can now test the model on independent data. We have a geopackage file called eval.gpkg which contains 200 points of tea and non-tea. We can visualize these Figure 13, and test the model.
Manual monitoring of tea plantations is costly and limited in scale. Using machine learning models and high resolution satellite images, we can automate and enhance the monitoring process, with applications including:
Pest and disease detection: Early identification for timely detection.
Drought monitoring: Could help detect water stress for irrigation planning.
Yield estimation: Predicting yield based on spectral data and plant health.
Damage assessment: Identifying hail and other weather related impacts.
Encroachment detection: Monitoring illegal expansion of plantations into protected areas.
Carbon quantification: This can make it possible to efficiently quantify carbon in tea plantations.
Tacking tea shift: Monitoring shifts in tea cultivation due to climate change and other reasons over time.
9Conclusion: Key Observations from the workflow
The map of tea Camellia sinensis achieved superb accuracy in central Kenya region. We are therefore quite confident in our model output. We could finally send the map to stakeholders in the tea industry hailing from the region to help validate its accuracy. In national scale mapping event, we would encounter a lot of challenges including agro-forestry, pruning of tea, variations in cultivars among others.
10Exercise: Use the tea_model to predict tea elsewhere or at different time
Since we have built our model, we can use it to make predictions across space and time. For instance, we can use the model to make prediction in another country, say Uganda. We can also use the same model to make predictions into the past, say 2020.
Kuhn, Max, and Hadley Wickham. 2020. Tidymodels: A Collection of Packages for Modeling and Machine Learning Using Tidyverse Principles.https://www.tidymodels.org.
Lang, Michel, Martin Binder, Jakob Richter, et al. 2019. Mlr3: A Modern Object-Oriented Machine Learning Framework in r. https://doi.org/10.21105/joss.01903.
Naimi, Babak, and Miguel B. Araujo. 2016. Sdm: A Reproducible and Extensible r Platform for Species Distribution Modelling. 39: 368–75. https://doi.org/10.1111/ecog.01881.