SATELLITE DATA FOR AGRICULTURAL ECONOMISTS: Theory and Practice
Lecture date: 13-01-2026
1 Introduction
In this session, we will learn how to build spatial machine learning model using sdm package (Naimi and Araujo 2016). The main task is to segment tea fields in a given region of interest within Kericho region of Kenya using Google Embeddings data set.
2 Google Earth Engine
You can run the code in the snippet below in the GEE code editor to obtain the Google embeddings for use in this script
var embed = ee.ImageCollection("GOOGLE/SATELLITE_EMBEDDING/V1/ANNUAL")
var roi = "projects/ee-wyclifeaoluoch/assets/roi_win_2025" // Repplace with your assest path
/*
SATELLITE DATA FOR AGRICULTURAL ECONOMISTS
THEORY AND PRACTICE
Mapping tea plantations in Kericho region of Kenya: Deep Learning Approach
David Wuepper, Hadi, Wyclife Agumba Oluoch
Affiliations:
- Land Economics Group (https://www.ilr1.uni-bonn.de/en/research/research-groups/land-economics), University of Bonn, Bonn, Germany
This script, demonstrates how to create, visualize, and export to drive Google Embeddings for use in machine and deep learning models.
*/
// --------------------------------------------
// 1. Define Time Range for Image Collection
// --------------------------------------------
var start_date = "2024-01-01"; // Start date of image collection
var end_date = "2025-01-01"; // End date of image collection
// --------------------------------------------
// 2. Obtain the embedding for the roi and time span
// --------------------------------------------
var embed = embed
.filterBounds(roi) // Keep images overlapping the region of interest
.filterDate(start_date, end_date) // Filter only for the dates needed
.mosaic(); // Reduce image collection to a single median image
// Print image properties to console to verify band structure and completeness
print('Embeddings', embed); // Expecting a single image with 64 bands
// --------------------------------------------
// 5. Visualization of the Embedding
// --------------------------------------------
// Add any dimension to the map for visual inspection
Map.centerObject({object: roi, zoom: 13});
Map.addLayer(
embed.clip(roi),
{
min: -0.031, // Obtained from the Layers panel, min value (sometimes based on 98% of values)
max: 0.177, // Obtained from the Layers panel, these are to enhance visualization
bands: ['A01', 'A02', 'A03'] // Bands to use for visualization, RGB (true color composite)
},
'composite', // Name of the layers in Layers panel
false // Do not display layer by default (can be turned on in Layers panel), makes computations faster
);
// --------------------------------------------
// 6. Display ROI Outline
// --------------------------------------------
// Paint the ROI geometry outline in red
var roi_outline = ee.Image().paint({ // Paints the edges of roi as new raster layer
featureCollection: roi, // The region of interest to be visualized
width: 4 // Outline thickness in pixels
});
// Center the map view on the ROI and add the red border outline
Map.addLayer(roi_outline, {palette: 'blue'}, 'roi');
// --------------------------------------------
// 7. Export Image Composite to Google Drive
// --------------------------------------------
// Save the processed image composite for offline use (e.g., in R or GIS software)
Export.image.toDrive({ // Function to export an image to Google drive
image: embed.clip(roi), // The image to be exported
description: 'embed_image', // Task name to appear in the Tasks tab while exporting
folder: 'winter2025', // Folder in your Google Drive to hold the image
fileNamePrefix: 'embed2024', // Prefix for output file name
region: roi, // Export region bounds
scale: 10, // Spatial resolution (Sentinel-2 native resolution)
maxPixels: 1e13, // Allow large exports
crs: "EPSG:3857" // Set coordinate reference system (Web Mercator)
});
// --------------------------------------------
// 10. Script Summary
// --------------------------------------------
/*
This script demonstrates the workflow for generating a small Google Embeddings image composite
for a small administrative region in Kenya, specifically tailored for agricultural analysis thereafter.
It includes:
- Loading, selecting, and mosaicking the embeddings
- Visualizing the composite and roi
- Exporting the composite image for further analysis in GIS or machine learning tools
Adapt and extend this script to fit your specific region, time period, or classification objectives.
*/
/////////////////E N D //////////////////3 Datasets
We are going to make use of Google Embeddings data set we downloaded from Google Earth Engine. You need to have GEE account to be able to see the script in GEE and rerun the task. The raster data is 64 dimension Google Embeddings image for the year 2024 (filtered for the dates from 2024-01-01 to 2025-01-01). The last date ensures that the images for 2025-12-31 are also obtained, otherwise they will be ignored.
After exporting the image to drive, we download it to our predefined data folder inside our R project. We will use our previously created training data from our previous modeling session involving Sentinel-2 images. The training data has \(label\) attribute as 1 for tea and 0 for non-tea points. Such data you could get from very many different sources such as tea farmer organizations or research institute, research organizations, past publications etc.
4 Loading libraries
We will make use of the following packages to achieve the classification/segmentation task:
5 Loading the data
The following are the data sets which we will use:
predictor_stack: Raster data set of the
Google Embeddingsimages to predict tea plantationstrain_points: Vector data set of observed tea and non-tea points
We need to project the training points data to the crs of the rasters. Remember, when exporting the table from GEE, the default is WGS84 while the raster we exported in EPSG:3857. So these are different and need to be harmonized. Even though we can project raster to the crs of the vector, it is always a preferred operation to convert the vector to the crs of the raster.
6 Visualize the data
7 Creating sdm data object
sdm primarily expects the presence and absence/pseudo-absence/background to be provided. They can also have predictor variables associated with them or the predictors be provided separately as a spatRaster object.
In our case, we have 100 records of occurrences (tea points) and 100 records of absence (non-tea points), that is, perfectly balanced.
class : sdmdata
===========================================================
number of species : 1
species names : label
number of features : 64
feature names : A00, A01, A02, ...
type : Presence-Absence
has independent test data? : FALSE
number of records : 200
has coordinates? : TRUE
Next, we train a single method model using sdm function:
m <- sdm(
formula = label ~.,
data = d,
methods = "maxent",
replication = "sub",
test.p = 30,
n = 3,
parallelSetting = list(ncore = 10,
method = 'parallel')
)Loading required package: parallel
We can check the results of the model:
class : sdmModels
========================================================
number of species : 1
number of modelling methods : 1
names of modelling methods : maxent
replicate.methods (data partitioning) : subsampling
number of replicates (each method) : 3
total number of replicates per model : 3 (per species)
test percentage (in subsampling) : 30
------------------------------------------
model run success percentage (per species) :
------------------------------------------
method label
----------------------
maxent : 100 %
###################################################################
model Mean performance (per species), using test dataset (generated using partitioning):
-------------------------------------------------------------------------------
## species : label
=========================
methods : AUC | COR | TSS | Deviance
-------------------------------------------------------------------------
maxent : 1 | 0.94 | 0.96 | 0.53
We can also make predictions. Adjust ncore depending on available cores in your machine.
p <- predict(m, embed_stack, mean = TRUE, parallelSetting = list(ncore = 10, method = "parallel"))
plot(p)We can also show the variable importance:
7.1 Ensemble model by combining several methods (Recommended)
We built the previous simple model by considering only one method, MaxEnt. However, in sdm, there are up to 22 different methods which can be used and ensemble. To do this, we provide the model with names of the methods we want to use. The model will then run all the methods and the user will be able to pick one that best performs or ensemble a couple of best performing methods. Here we go for three methods rf, svm, and brt.
7.2 Model output
class : sdmModels
========================================================
number of species : 1
number of modelling methods : 3
names of modelling methods : rf, svm, brt
replicate.methods (data partitioning) : subsampling
number of replicates (each method) : 3
total number of replicates per model : 3 (per species)
test percentage (in subsampling) : 30
------------------------------------------
model run success percentage (per species) :
------------------------------------------
method label
----------------------
rf : 100 %
svm : 100 %
brt : 100 %
###################################################################
model Mean performance (per species), using test dataset (generated using partitioning):
-------------------------------------------------------------------------------
## species : label
=========================
methods : AUC | COR | TSS | Deviance
-------------------------------------------------------------------------
rf : 1 | 0.95 | 0.97 | 0.25
svm : 1 | 0.97 | 0.99 | 0.12
brt : 0.99 | 0.94 | 0.94 | 0.62
We can then proceed with the predictions.
We can also see the plots of all methods. Remember the mean = TRUE will address each method individually.
8 Ensemble the methods
Visualize the ensemble output
8.1 Binary output
We can use the pa function from sdm package (Naimi and Araujo 2016) to calculate the threshold to use that maximizes tss in order to derive a binary map from the probability output:
Now we can have the binary map based on a threshold that maximizes specificity and sensitivity.
9 Area under tea
We can then determine how much land is under tea in the region.
layer value area
1 1 0 2421.286
2 1 1 1063.942
tea_area |>
mutate(Class = case_when(value == 1 ~ "Tea",
value == 0 ~ "Non-tea",
TRUE ~ as.character(value))) |>
ggplot(aes(Class, area, fill = Class)) +
geom_col() +
# Add text labels on top of the bars
geom_text(aes(label = round(area)), # Display area value, rounded to 2 decimals
vjust = -0.5, # Adjust vertical position (move up)
color = "black", # Set text color
size = 5) + # Set text size
scale_fill_manual(values = c("Non-tea" = "gray70", "Tea" = "darkgreen")) +
theme_classic() +
# Optional: Adjust y-axis limits to make room for the text
ylim(0, max(tea_area$area) * 1.1) # Extends the y-axis limit by 10% to prevent clipping10 Session summary
In this session, we have demonstrated how to create a binary tea map using machine learning procedure and recently developed Google Embeddings dataset. We will check how the prediction compares with the other methods.
11 Assignments (Optional)
Practical exercise ==> Build a spatial ML model using Google Embeddings data to create a binary tea map and calculate the area under tea for your area of interest.
Practical exercise (optional) ==> Compare the prediction results from Google Embeddings data with previous Sentinel-2 models for the same region.
Conceptual question ==> Why might Google Embeddings improve predictions compared to traditional spectral bands?
Encodes high-dimensional, pre-learned features capturing complex land cover patterns.
May generalize better in heterogeneous or mixed-use landscapes.
Helps model subtle differences between tea and non-tea areas not captured by raw bands.
- Applied question ==> What are key considerations when using a new data source (like Google Embeddings) for spatial ML modeling?
Ensure training points are accurately labeled (only possible from optic bands).
Harmonize projection/CRS between raster and vector data.
Evaluate model performance and variable importance.
Check consistency and applicability across regions or time periods.