Background

Accurate and timely crop yield estimation is essential for ensuring food security, optimizing food inflation monitoring, and supporting agricultural decision-making processes. Traditional methods of crop yield estimation often rely on ground surveys and historical data, which can be time-consuming, labor-intensive, and prone to inaccuracies. With advancements in technology, remote sensing and machine learning have emerged as powerful tools that can significantly enhance the precision and efficiency of crop yield prediction. This research aims to develop a comprehensive framework that combines remote sensing data, and seasonal agriculture survey data with machine learning techniques to predict crop yields accurately. Rwanda, with its diverse agroecological zones and reliance on agriculture as a primary source of livelihood, presents a unique and challenging environment for crop yield estimation. The country’s topography, varying climatic conditions, and smallholder farming systems necessitate a robust and adaptable approach to yield prediction. This study integrates Landsat 8 satellite imagery with data from the Rwanda Seasonal Agricultural Survey to harness high-resolution spatial and temporal data for reliable crop yield predictions.

Study Gap

This research addresses several gaps in the current literature. While previous studies have explored the use of remote sensing and machine learning for crop yield estimation, few have focused specifically on the unique agroecological conditions of Rwanda. By tailoring the models to the specific challenges and characteristics of Rwandan agriculture, this study aims to enhance the applicability and accuracy of crop yield predictions in this context. Additionally, the integration of high-resolution satellite imagery with ground-truth data provides a more comprehensive and detailed understanding of crop growth dynamics.

Methods

Data Source

The methodology involves several key steps. First, pre-processing and analyzing remote sensed data sourced from Power Access Viewer by Rwanda regional cropping to extract relevant climate features. For example temperature, precipitation, humidity, etc. These features serve as indicators of crop health and growth conditions. Next, the extracted features are combined with ground-truth agricultural data from the Rwanda Seasonal Agricultural Survey, including crop types, planting seasons, harvested and cultivated areas, and historical yield records. This integrated dataset provides a rich source of information for training machine learning models.

Modelling

Machine learning algorithms, including regression models with ridge or lasso to relevantly set penalties to eliminate the multicollinearity and neural networks, will be employed to capture the complex relationships and patterns that influence crop productivity. These models will be trained and validated using the integrated dataset, ensuring that they can generalize well to different regions and cropping seasons in Rwanda. The performance of the models will be evaluated using metrics such as root mean square error (RMSE), mean absolute error (MAE), and R-squared to ensure their accuracy and reliability.

Study Impacts

The expected outcomes of this research are multifaceted. Firstly, it will provide the estimated yield for major crops on a season ahead so that the policymakers, farmers, and agricultural stakeholders can plan accordingly. Accurate crop yield predictions can inform agricultural planning, resource allocation, and early warning systems for food security. Secondly, the research will contribute to the advancement of precision agriculture by demonstrating the potential of remote sensing and machine learning technologies to enhance agricultural productivity and sustainability. Lastly, the findings and methodologies developed in this study can be adapted and applied to other regions with similar agricultural challenges, promoting the global advancement of precision farming practices.

Data Processing

The term feature engineering refers to the data cleansing and preparation processes for achieving better predictive model

Automatic Exploratory

There are several packages for automatic EDA in R. Intentionally, we have used SmartEDA R in-built library which allows us to generate a complete HTML report using the ExpReport function. For more automatic exploration packages visit Automatic EDA

In addition, we will be using keras and tensorflow, we need the python environment and R interface with python to be set. We used reticulate library to interface R with python. You need to make sure Python is installed from your end for creating a venv, otherwise you first install it from PYTHON or use reticulate::install_python(version = '<version>') and specify the stable version of your choice (i.e 3.8.10) Here below is the process:

Multicollinearity

The higher correlation among the variables is being tested by using the pearson correlation.

The results show that even though we have linearly combined the highly correlated variables. We still faced the issue of multicollinearity among some features. From the existing knowledge, we address and handle this by using the regularization techniques that can handle multicollinearity by adding a penalty to the regression coefficients such as Regularization (Ridge or Lasso Regression). This has been achieved by using Ridge regression and Lasso models through the setting the engine glmnet and set penalty = 0.1, mixture = 0 for Ridge and mixture=1 for Lasso model.

Hot-encoding

This serves to set the dummies for categorical variables to achieve the

Machine Learning Pipeline

The ML pipeline consists of data resampling, model fitting, model performance evaluation and predictive or estimation analysis

Data Importing

We will be working with the combined data set in which we combined the Rwanda seasonal agriculture survey and seasonal aggregated agro-meteorological data. Each row represents a yield in Kgs/Ha for a given crop per season across the year ranges from agri season A 2018 to season A 2024 and their associated agro-meteorological information. The data has been imported using the rio package.

Take a moment to explore these data sets below.

Data Resampling

The first step in building ML models is to split the original data into a training and test set. We then perform all feature engineering and model fitting tasks on the training set and use the test set as an independent assessment of our model’s performance accuracy.

We will be using the initial_split() function from rsample to partition the advertising data into training and test sets. Remember to always use set.seed() to ensure your results are reproducible.

Model Specifications

They are exist the myriad of models to be trained based on the nature of problem to be addressed. And for every model type, there are numerous packages (or engines) in R that can be used. Intuitively, the parsnip package from tidymodels model ecosystem acts like an aggregator across the various modeling engines within R. This makes it easy to implement machine learning algorithms from different R packages with one unifying syntax. And based on the nature of our problem, the linear relationship will be used. Therefore, to specify a model object with parsnip, we must:

  • Pick a model type
  • Set the engine
  • Set the mode (either regression or classification)

Linear regression is implemented with the linear_reg() function in parsnip. To the set the engine and mode, we use set_engine() and set_mode() respectively. Each one of these functions takes a parsnip object as an argument and updates its properties.

To explore all parsnip models, please see the documentation where you can search by keyword.

Let’s create a different regression models objects with the their respective engines from the parsnip packages. The reference made to the principle of No Free Lunch, we shall select the optimal outstanding and winner model to be used for yield estimation and prediction based on the most powerful predictive variables.

Model Fitting

We intentionally created the different five linear model objects to be fitted on split training set and then evaluated on standalone test set to select the optimal model for final estimation and prediction.

Now we are ready to train our model object on the training data. We can do this using the fit() function from the parsnip package. The fit() function takes the following arguments:

  • a parnsip model object specification
  • a model formula
  • a data frame with the training data

The code below trains our set model objects on the training data. In our formula, we have specified that yield is the response or target variable and the remaining variables (refer to the data) are our predictor variables.

Model Training

Results Exploration

Model Summary Statistics

It is well known among R users especially the ML engineers that most model objects in R are stored as specialized lists.

The lm_fit object is list that contains all of the information about how our model was trained as well as the detailed results. Let’s use the names() function to print the named objects that are stored within lm_fit.

The important objects are fit and preproc. These contain the trained model and pre-processing steps (if any are used), respectively.

To print a summary of our model, we can extract fit from lm_fit and pass it to the summary() function OR we can simply use tidy() from broom package built-in tidyverse ecosystem to see the estimated coefficients. We can explore the estimated coefficients, F-statistics, p-values, residual standard error also known as RMSE and \(R^2\) value.

However, this feature is best for visually exploring our results on the training data since the results are returned as a data frame.

Performance Resample

We can use the plot() function to obtain diagnostic plots for our trained regression model. Again, we must first extract the fit object from lm_fit and then pass it into plot(). These plots provide a check for the main assumptions of the linear regression model.

# Define a workflow for each model
xytrain <- df_train |>
                select(-year) |> 
                filter(crop_types == "Banana") |> select(-crop_types)|> 
                mutate_if(is.factor, as.numeric)
# Define the resampling strategy
set.seed(12345)
cv_folds <- vfold_cv(xytrain, v = 5)

workflow_1 <- workflow() %>% add_model(lm_model) %>% add_formula(yield ~ .)
workflow_2 <- workflow() %>% add_model(glm_model) %>% add_formula(yield ~ .)
workflow_3 <- workflow() %>% add_model(ridge_model) %>% add_formula(yield ~ .)
workflow_4 <- workflow() %>% add_model(lasso_model) %>% add_formula(yield ~ .)
workflow_5 <- workflow() %>% add_model(keras_model) %>% add_formula(yield ~ .)
workflow_6 <- workflow() %>% add_model(stan_model) %>% add_formula(yield ~ .)
# Train and evaluate models using cross-validation
results_1 <- fit_resamples(workflow_1, resamples = cv_folds, metrics = metric_set(rmse, mae))
results_2 <- fit_resamples(workflow_2, resamples = cv_folds, metrics = metric_set(rmse, mae))
results_3 <- fit_resamples(workflow_3, resamples = cv_folds, metrics = metric_set(rmse, mae))
results_4 <- fit_resamples(workflow_4, resamples = cv_folds, metrics = metric_set(rmse, mae))
results_5 <- fit_resamples(workflow_5, resamples = cv_folds, metrics = metric_set(rmse, mae))
## 1/1 - 0s - 72ms/epoch - 72ms/step
## 1/1 - 0s - 69ms/epoch - 69ms/step
## 1/1 - 0s - 44ms/epoch - 44ms/step
## 1/1 - 0s - 52ms/epoch - 52ms/step
## 1/1 - 0s - 61ms/epoch - 61ms/step
results_6 <- fit_resamples(workflow_6, resamples = cv_folds, metrics = metric_set(rmse, mae))


# Combine results into a single dataframe
all_results <- bind_rows(
  collect_metrics(results_1) %>% mutate(model = "LM"),
  collect_metrics(results_2) %>% mutate(model = "GLM"),
  collect_metrics(results_3) %>% mutate(model = "GLM_Ridge"),
  collect_metrics(results_4) %>% mutate(model = "GLM_Rasso"),
  collect_metrics(results_5) %>% mutate(model = "LM_Keras"),
  collect_metrics(results_6) %>% mutate(model = "LM_Bayesian")
)

# Create a summary table
summary_table <- all_results %>%
  select(model, .metric, mean) %>%
  pivot_wider(names_from = .metric, values_from = mean) |> 
  arrange(rmse,mae)

flextable(summary_table)

model

mae

rmse

GLM_Rasso

0.2521782

0.3122995

LM_Bayesian

0.3572395

0.3792184

LM

0.3630216

0.3941495

GLM

0.3630216

0.3941495

GLM_Ridge

0.4262406

0.4798674

LM_Keras

0.7004091

0.7518982

# Plot the performance metrics
ggplot(all_results, aes(x = model, y = mean, fill = .metric)) +
  geom_bar(stat = "identity", position = position_dodge()) +
  facet_wrap(~ .metric, scales = "free") +
  labs(title = "Model Performance Comparison", x = "Model", y = "Mean Metric Value") +
  theme_minimal()
## Error : The fig.showtext code chunk option must be TRUE

Model Selection

From the obtained results, the Ridge or Lasso Regression model is outperformed and being voted as powerful one from the pool of trained model since it has smallest both MAE and RMSE. Then why it is outperformed? The answer is simple it deals with multicollinearity by

  • Ridge adds a penalty equal to the sum of the squared coefficients. It helps to handle multicollinearity by shrinking coefficients.

  • Lasso adds a penalty equal to the sum of the absolute values of the coefficients. And It can shrink some coefficients to zero, effectively selecting a simpler model.

# Data frame of estimated coefficients
#tidy(lasso_fit)
tidy(ridge_fit)
## # A tibble: 7 × 3
##   term             estimate penalty
##   <chr>               <dbl>   <dbl>
## 1 (Intercept)        1.16       0.1
## 2 seasonsB           0.0232     0.1
## 3 arable_size        0.706      0.1
## 4 precipitation      0.167      0.1
## 5 surface_pressure   0.890      0.1
## 6 temperatures       0.289      0.1
## 7 soil_wetness      -0.0182     0.1
#summary(ridge_fit$fit)
# Performance metrics on training data
#glance(lasso_fit)
#glance(ridge_fit)

#tidy(stan_fit)
#tidy(glm_fit)

We can also use the vip() function to plot the variable importance for each predictor in our model. The importance value is determined based on the F-statistics and estimate coefficents in our trained model object.

#Variable importance
#vip(lasso_fit)
vip(ridge_fit)
## Error : The fig.showtext code chunk option must be TRUE

#vip(glm_fit)

Test Evaluation

To assess the accuracy of our selected linear regression model, lasso_fit, we must use it to make predictions on our test data. This is done with the predict() function from parnsip. This function takes two important arguments:

a trained parnsip model object new_data for which to generate predictions The code below uses the predict function to generate a data frame with a single column, .pred, which contains the predicted yield values on the crop information and meteo data.

Conclusion

In conclusion, this research aims to develop a robust framework for crop yield estimation in Rwanda by leveraging remote sensing data, seasonal agriculture survey and machine learning techniques. By integrating high-resolution satellite imagery data with ground-truth agricultural data, the study seeks to provide accurate and timely crop yield predictions and estimation, addressing key gaps in the literature and contributing to the advancement of precision agriculture in Rwanda and beyond.