Accurate and timely crop yield estimation is essential for ensuring food security, optimizing food inflation monitoring, and supporting agricultural decision-making processes. Traditional methods of crop yield estimation often rely on ground surveys and historical data, which can be time-consuming, labor-intensive, and prone to inaccuracies. With advancements in technology, remote sensing and machine learning have emerged as powerful tools that can significantly enhance the precision and efficiency of crop yield prediction. This research aims to develop a comprehensive framework that combines remote sensing data, and seasonal agriculture survey data with machine learning techniques to predict crop yields accurately. Rwanda, with its diverse agroecological zones and reliance on agriculture as a primary source of livelihood, presents a unique and challenging environment for crop yield estimation. The country’s topography, varying climatic conditions, and smallholder farming systems necessitate a robust and adaptable approach to yield prediction. This study integrates Landsat 8 satellite imagery with data from the Rwanda Seasonal Agricultural Survey to harness high-resolution spatial and temporal data for reliable crop yield predictions.
This research addresses several gaps in the current literature. While previous studies have explored the use of remote sensing and machine learning for crop yield estimation, few have focused specifically on the unique agroecological conditions of Rwanda. By tailoring the models to the specific challenges and characteristics of Rwandan agriculture, this study aims to enhance the applicability and accuracy of crop yield predictions in this context. Additionally, the integration of high-resolution satellite imagery with ground-truth data provides a more comprehensive and detailed understanding of crop growth dynamics.
The methodology involves several key steps. First, pre-processing and analyzing remote sensed data sourced from Power Access Viewer by Rwanda regional cropping to extract relevant climate features. For example temperature, precipitation, humidity, etc. These features serve as indicators of crop health and growth conditions. Next, the extracted features are combined with ground-truth agricultural data from the Rwanda Seasonal Agricultural Survey, including crop types, planting seasons, harvested and cultivated areas, and historical yield records. This integrated dataset provides a rich source of information for training machine learning models.
Machine learning algorithms, including regression models with ridge or lasso to relevantly set penalties to eliminate the multicollinearity and neural networks, will be employed to capture the complex relationships and patterns that influence crop productivity. These models will be trained and validated using the integrated dataset, ensuring that they can generalize well to different regions and cropping seasons in Rwanda. The performance of the models will be evaluated using metrics such as root mean square error (RMSE), mean absolute error (MAE), and R-squared to ensure their accuracy and reliability.
The expected outcomes of this research are multifaceted. Firstly, it will provide the estimated yield for major crops on a season ahead so that the policymakers, farmers, and agricultural stakeholders can plan accordingly. Accurate crop yield predictions can inform agricultural planning, resource allocation, and early warning systems for food security. Secondly, the research will contribute to the advancement of precision agriculture by demonstrating the potential of remote sensing and machine learning technologies to enhance agricultural productivity and sustainability. Lastly, the findings and methodologies developed in this study can be adapted and applied to other regions with similar agricultural challenges, promoting the global advancement of precision farming practices.
The term feature engineering refers to the data cleansing and preparation processes for achieving better predictive model
There are several packages for automatic EDA in R. Intentionally, we
have used SmartEDA
R in-built library which allows us to
generate a complete HTML report using the ExpReport function. For more
automatic exploration packages visit Automatic
EDA
In addition, we will be using keras
and
tensorflow
, we need the python environment and R interface
with python to be set. We used reticulate
library to
interface R with python. You need to make sure Python
is
installed from your end for creating a venv
, otherwise you
first install it from PYTHON or use
reticulate::install_python(version = '<version>')
and
specify the stable version of your choice (i.e 3.8.10) Here below is the
process:
The higher correlation among the variables is being tested by using the pearson correlation.
The results show that even though we have linearly combined the
highly correlated variables. We still faced the issue of
multicollinearity among some features. From the existing knowledge, we
address and handle this by using the regularization techniques that can
handle multicollinearity by adding a penalty to the regression
coefficients such as Regularization (Ridge or Lasso
Regression). This has been achieved by using Ridge regression
and Lasso models through the setting the engine glmnet
and
set penalty = 0.1, mixture = 0 for Ridge and mixture=1 for Lasso
model.
This serves to set the dummies for categorical variables to achieve the
The ML pipeline consists of data resampling, model fitting, model performance evaluation and predictive or estimation analysis
We will be working with the combined data set in which we combined
the Rwanda seasonal agriculture survey and seasonal aggregated
agro-meteorological data. Each row represents a yield in Kgs/Ha
for a given crop per season across the year ranges from agri season A
2018 to season A 2024 and their associated agro-meteorological
information. The data has been imported using the rio
package.
Take a moment to explore these data sets below.
The first step in building ML models is to split the original data into a training and test set. We then perform all feature engineering and model fitting tasks on the training set and use the test set as an independent assessment of our model’s performance accuracy.
We will be using the initial_split()
function from
rsample
to partition the advertising data into training and
test sets. Remember to always use set.seed()
to ensure your
results are reproducible.
They are exist the myriad of models to be trained based on the nature
of problem to be addressed. And for every model type, there are numerous
packages (or engines) in R
that can be used. Intuitively,
the parsnip
package from tidymodels
model
ecosystem acts like an aggregator
across the various
modeling engines within R
. This makes it easy to implement
machine learning algorithms from different R packages with one unifying
syntax. And based on the nature of our problem, the linear relationship
will be used. Therefore, to specify a model object with parsnip, we
must:
regression
or
classification
)Linear regression is implemented with the linear_reg()
function in parsnip. To the set the engine and mode, we use
set_engine()
and set_mode()
respectively. Each
one of these functions takes a parsnip object as an argument and updates
its properties.
To explore all parsnip models, please see the documentation where you can search by keyword.
Let’s create a different regression models objects with the their
respective engines from the parsnip
packages. The reference
made to the principle of No Free Lunch, we shall select
the optimal outstanding and winner model to be used for yield estimation
and prediction based on the most powerful predictive variables.
We intentionally created the different five linear model objects to be fitted on split training set and then evaluated on standalone test set to select the optimal model for final estimation and prediction.
Now we are ready to train our model object on the training data. We
can do this using the fit()
function from the
parsnip
package. The fit()
function takes the
following arguments:
The code below trains our set model objects on the training data. In our formula, we have specified that yield is the response or target variable and the remaining variables (refer to the data) are our predictor variables.
It is well known among R users especially the ML engineers that most model objects in R are stored as specialized lists.
The lm_fit
object is list that contains all of the
information about how our model was trained as well as the detailed
results. Let’s use the names()
function to print the named
objects that are stored within lm_fit
.
The important objects are fit
and preproc
.
These contain the trained model and pre-processing steps (if any are
used), respectively.
To print a summary of our model, we can extract fit from
lm_fit
and pass it to the summary()
function
OR we can simply use tidy()
from broom
package
built-in tidyverse
ecosystem to see the estimated
coefficients. We can explore the estimated coefficients,
F-statistics, p-values, residual standard
error also known as RMSE and \(R^2\) value.
However, this feature is best for visually exploring our results on the training data since the results are returned as a data frame.
We can use the plot() function to obtain diagnostic plots for our
trained regression model. Again, we must first extract the fit object
from lm_fit
and then pass it into plot()
.
These plots provide a check for the main assumptions of the linear
regression model.
# Define a workflow for each model
xytrain <- df_train |>
select(-year) |>
filter(crop_types == "Banana") |> select(-crop_types)|>
mutate_if(is.factor, as.numeric)
# Define the resampling strategy
set.seed(12345)
cv_folds <- vfold_cv(xytrain, v = 5)
workflow_1 <- workflow() %>% add_model(lm_model) %>% add_formula(yield ~ .)
workflow_2 <- workflow() %>% add_model(glm_model) %>% add_formula(yield ~ .)
workflow_3 <- workflow() %>% add_model(ridge_model) %>% add_formula(yield ~ .)
workflow_4 <- workflow() %>% add_model(lasso_model) %>% add_formula(yield ~ .)
workflow_5 <- workflow() %>% add_model(keras_model) %>% add_formula(yield ~ .)
workflow_6 <- workflow() %>% add_model(stan_model) %>% add_formula(yield ~ .)
# Train and evaluate models using cross-validation
results_1 <- fit_resamples(workflow_1, resamples = cv_folds, metrics = metric_set(rmse, mae))
results_2 <- fit_resamples(workflow_2, resamples = cv_folds, metrics = metric_set(rmse, mae))
results_3 <- fit_resamples(workflow_3, resamples = cv_folds, metrics = metric_set(rmse, mae))
results_4 <- fit_resamples(workflow_4, resamples = cv_folds, metrics = metric_set(rmse, mae))
results_5 <- fit_resamples(workflow_5, resamples = cv_folds, metrics = metric_set(rmse, mae))
## 1/1 - 0s - 72ms/epoch - 72ms/step
## 1/1 - 0s - 69ms/epoch - 69ms/step
## 1/1 - 0s - 44ms/epoch - 44ms/step
## 1/1 - 0s - 52ms/epoch - 52ms/step
## 1/1 - 0s - 61ms/epoch - 61ms/step
results_6 <- fit_resamples(workflow_6, resamples = cv_folds, metrics = metric_set(rmse, mae))
# Combine results into a single dataframe
all_results <- bind_rows(
collect_metrics(results_1) %>% mutate(model = "LM"),
collect_metrics(results_2) %>% mutate(model = "GLM"),
collect_metrics(results_3) %>% mutate(model = "GLM_Ridge"),
collect_metrics(results_4) %>% mutate(model = "GLM_Rasso"),
collect_metrics(results_5) %>% mutate(model = "LM_Keras"),
collect_metrics(results_6) %>% mutate(model = "LM_Bayesian")
)
# Create a summary table
summary_table <- all_results %>%
select(model, .metric, mean) %>%
pivot_wider(names_from = .metric, values_from = mean) |>
arrange(rmse,mae)
flextable(summary_table)
model | mae | rmse |
---|---|---|
GLM_Rasso | 0.2521782 | 0.3122995 |
LM_Bayesian | 0.3572395 | 0.3792184 |
LM | 0.3630216 | 0.3941495 |
GLM | 0.3630216 | 0.3941495 |
GLM_Ridge | 0.4262406 | 0.4798674 |
LM_Keras | 0.7004091 | 0.7518982 |
# Plot the performance metrics
ggplot(all_results, aes(x = model, y = mean, fill = .metric)) +
geom_bar(stat = "identity", position = position_dodge()) +
facet_wrap(~ .metric, scales = "free") +
labs(title = "Model Performance Comparison", x = "Model", y = "Mean Metric Value") +
theme_minimal()
## Error : The fig.showtext code chunk option must be TRUE
From the obtained results, the Ridge or Lasso Regression model is outperformed and being voted as powerful one from the pool of trained model since it has smallest both MAE and RMSE. Then why it is outperformed? The answer is simple it deals with multicollinearity by
Ridge adds a penalty equal to the sum of the squared coefficients. It helps to handle multicollinearity by shrinking coefficients.
Lasso adds a penalty equal to the sum of the absolute values of the coefficients. And It can shrink some coefficients to zero, effectively selecting a simpler model.
# Data frame of estimated coefficients
#tidy(lasso_fit)
tidy(ridge_fit)
## # A tibble: 7 × 3
## term estimate penalty
## <chr> <dbl> <dbl>
## 1 (Intercept) 1.16 0.1
## 2 seasonsB 0.0232 0.1
## 3 arable_size 0.706 0.1
## 4 precipitation 0.167 0.1
## 5 surface_pressure 0.890 0.1
## 6 temperatures 0.289 0.1
## 7 soil_wetness -0.0182 0.1
#summary(ridge_fit$fit)
# Performance metrics on training data
#glance(lasso_fit)
#glance(ridge_fit)
#tidy(stan_fit)
#tidy(glm_fit)
We can also use the vip()
function to plot the variable
importance for each predictor in our model. The importance value is
determined based on the F-statistics and estimate coefficents in our
trained model object.
#Variable importance
#vip(lasso_fit)
vip(ridge_fit)
## Error : The fig.showtext code chunk option must be TRUE
#vip(glm_fit)
To assess the accuracy of our selected linear regression model,
lasso_fit, we must use it to make predictions on our test data.
This is done with the predict()
function from parnsip. This
function takes two important arguments:
a trained parnsip model object new_data for which to generate predictions The code below uses the predict function to generate a data frame with a single column, .pred, which contains the predicted yield values on the crop information and meteo data.
In conclusion, this research aims to develop a robust framework for crop yield estimation in Rwanda by leveraging remote sensing data, seasonal agriculture survey and machine learning techniques. By integrating high-resolution satellite imagery data with ground-truth agricultural data, the study seeks to provide accurate and timely crop yield predictions and estimation, addressing key gaps in the literature and contributing to the advancement of precision agriculture in Rwanda and beyond.