This investigation builds upon the Data Geeks group project that investigated combining agricultural commodity data with environmental factors such as climate and soil conditions to model the influence these geographical based factors have on agricultural yield. The previous investigation was limited to data for grapes and simple linear regression models.
This study furthers that work by expanding the dataset to 21 different agricultural commodities grown widely in Australia. I investigate methods of analysing and modelling this dataset which is divided up by commodities that each respond differently to environmental variables. The modelling tools developed could be used to predict how changes in environmental factors could impact the output of various agricultural commodities in Australia. Alternatively, they could be used to identify areas that are well suited for growing a particular commodity or vice versa. I have sought to answer the following research question:
Can multilevel models be used to analyse the interaction between geographic environmental factors and the yields of different agricultural commodities grown in Australia?
In this study I have first modelled the commodity groups as a fixed effect in a linear regression model - the baseline. I have then introduced the commodity groups as random effects in a multilevel model. This study assesses the performance of a number of multilevel models, discusses the advantages of such an approach and identifies future steps to be explored.
Crop yield is a measure of how efficiently a commodity is produced within an area. In this dataset the yield is either measured in t/ha or kg/tree, depending on the commodity type. The agricultural yield may be affected by a range of environmental, societal and biological factors. The initial study which focused on grapes on the effect of climate, soil conditions, irrigation and the grape varieties used. As the irrigation and variety data is data only collected by the wine industry in Australia and is not available for other commodities the new study focuses only on climate and soil variables, however the variables from these sources have been expanded.
It is to be expected that different agricultural commodities will have significantly different yields due to the inherent biological differences and farming practices. Because of this yields will differ in both absolute amount and also how they respond to various external factors.
Multilevel models, also known as linear mixed effect models amongst other things, are well suited for modelling data with a clustered or a hierarchical structure. The input agricultural data included variables that explicitly defined a structure of this kind so such a model seems appropriate. This structure is demonstrated with a sample of the categories in the Figure 1 diagram. Level 1 of the structure is the observations which are each associated with an SA2 region. Level 2 is the agricultural commodity. For each observation the yield is of a particular commodity and the predictor variables are extracted from areas within that region where the commodity is grown. Using these two levels we can create a two level nested model. We could also add Level 3 to create a three level nested model, for example the broad type of commodity as defined in the input source.
Note that in this case Level 1 will feature multiple observations in a single region, but for different commodities. We could create an additional (non-nested) level based on region.
Another way of thinking about how these types of models work is to consider their other name - mixed effect models. The mixed effects refer to a combination of fixed and random effects. Whether variables are designated as fixed or random effects will depend on the dataset and goals of the investigation. In multilevel models the groups as shown in Figure 1 are designated as random" effects, allowing each group to have its own mean and variance.
The use of multilevel models can have the following advantages:
The data was prepared using the same procedures employed during the group project stage. The basic steps are as follows:
The dataset was expanded to 21 different agricultural commodities by merging the yield data for each commodity with the predictor variables at the landuse areas where these commodities are grown. The commodities with the most landuse areas were selected to maximise the number of observations within each group and improve the potential performance of the model.
Figure 2 shows number of observations (SA2 regions) in the merged dataset for each commodity. Grapes which was used in the initial model are present in the most regions. Most of the commodities have between 10-30 observations and four commodities have less than 10. For multilevel models to perform well at least 5 groups is required. However repeated observations within groups is also necessary for good performance. To best satisfy these constraints the top 5 commodities that are measured in t/ha have been selected for modelling (grapes, olives, sugar cane, cotton and wheat).
Figure 3 shows that the yield for sugar cane are significantly higher than the other commodities. This difference may pose an issue for modelling.
The geographic spread of the observations available in the dataset for the chosen commodities is shown in Figure 4. The grape data is most widespread which was part of the reason for selection for the initial model. While the wheat region in South Australia is present in the data Western Australia’s wheat belt is not and landuse data is was not available for this commodity and region. From this geographic spread we can expect a large range of environmental variables.
## Warning: Duplicated aesthetics after name standardisation: colour
First a linear regression model was run on the whole dataset to serve as a baseline. By doing this we are designating commodity as a fixed effect (along with all the other predictors). An Elastic Net regression model was run with 70% of the observations randomly sampled to create the train set and the remaining 30% used for an out-of-sample test set. In both models the predictor variables were centred and scaled and a five-fold cross validation run. The variables used, after filtering for multicollinearity, are shown in Figure 5.
As expected certain commodity categories appear to be the most influential features, Sugar cane is having the most significant influence as it produces much higher yields. A range of different climate and soil features also appear to be significant including soil pH and depth, the minimum temperature during winter months, the variability or rainfall and the total rainfall of particular months.
The results of the models are shown in Table 1 and the diagnostic residual plots are shown in Figure 6. This model’s performance is not as good as what was achieved in the previous model that only included grapes (scatter index = 0.36). The diagnostic plots indicate the model may be influenced outliers from the sugar cane subset and that the data is somewhat skewed. How these issues are handled by the multilevel models will be investigated in the next section.
| Metric | Value |
|---|---|
| rmse for test set | 11.3 |
| rmse train/test ratio | 0.84 |
| Scatter index | 0.56 |
The same test/train setup and pre-processing steps were used for all multilevel models. The same features were initially selected as used in the baseline model.
# preprocess by scaling and centering numeric features
yield_preproc_ha <- yield_top_ha %>%
select(-yield) %>%
mutate_if(is.numeric, scale) %>%
mutate(yield = yield_top_ha$yield)
# set train and test
train_size <- floor(0.7 * nrow(yield_preproc_ha))
set.seed(11)
train_n <- sample(seq_len(nrow(yield_preproc_ha)), size = train_size)
train <- yield_preproc_ha[train_n, ]
test <- yield_preproc_ha[-train_n, ]
# check sizes
nrow(yield_top_ha) == nrow(train) + nrow(test)
## [1] TRUE
# selected fixed effect features
fixed_feat <- "ph + minmaysep_temp + soil_depth + clay + varmay_rain + rainjul + bulk_density + phosphorus"
Model 1 was a random intercept model with a fixed mean. This models random effect intercepts for each commodity group while using one global slope across all models and is represented by: \(yield \sim environmental\_features + (1|commodity)+ \epsilon\)
Here yield is response variable, the environmental features are fixed effects, the commodity intercept is a random effect and \(\epsilon\) is the error.
The following observations can be made from the summary outputs:
# run the multilevel model
lme <- lmer(as.formula(paste("yield ~", fixed_feat, " + (1 | commodity)")), data=train)
# summary outputs
summary(lme)
## Linear mixed model fit by REML ['lmerMod']
## Formula: yield ~ ph + minmaysep_temp + soil_depth + clay + varmay_rain +
## rainjul + bulk_density + phosphorus + (1 | commodity)
## Data: train
##
## REML criterion at convergence: 2122
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -4.5195 -0.4653 -0.0383 0.3943 7.8169
##
## Random effects:
## Groups Name Variance Std.Dev.
## commodity (Intercept) 1768.05 42.048
## Residual 93.75 9.683
## Number of obs: 287, groups: commodity, 5
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 21.0583 18.8266 1.119
## ph 5.5442 0.9202 6.025
## minmaysep_temp -4.1843 1.0857 -3.854
## soil_depth 3.2725 0.8720 3.753
## clay -2.5426 0.9826 -2.588
## varmay_rain 2.5599 1.0301 2.485
## rainjul 2.5558 0.9492 2.693
## bulk_density -1.7669 0.7318 -2.414
## phosphorus 1.6851 0.9772 1.724
##
## Correlation of Fixed Effects:
## (Intr) ph mnmys_ sl_dpt clay vrmy_r rainjl blk_dn
## ph -0.009
## minmysp_tmp -0.012 -0.231
## soil_depth 0.000 -0.177 -0.162
## clay -0.012 -0.060 0.127 -0.045
## varmay_rain -0.008 0.094 0.106 -0.141 0.016
## rainjul -0.004 0.487 -0.167 -0.146 0.319 0.444
## bulk_densty 0.002 -0.220 -0.150 0.096 0.204 -0.054 0.156
## phosphorus 0.000 -0.030 0.172 -0.221 -0.353 -0.096 -0.115 0.157
coef(lme)
## $commodity
## (Intercept) ph minmaysep_temp soil_depth clay
## cotton -6.441424 5.544188 -4.184349 3.272528 -2.542639
## grapes 7.438580 5.544188 -4.184349 3.272528 -2.542639
## olives 6.037166 5.544188 -4.184349 3.272528 -2.542639
## sugar cane 95.502917 5.544188 -4.184349 3.272528 -2.542639
## wheat 2.754218 5.544188 -4.184349 3.272528 -2.542639
## varmay_rain rainjul bulk_density phosphorus
## cotton 2.559921 2.555772 -1.766887 1.685055
## grapes 2.559921 2.555772 -1.766887 1.685055
## olives 2.559921 2.555772 -1.766887 1.685055
## sugar cane 2.559921 2.555772 -1.766887 1.685055
## wheat 2.559921 2.555772 -1.766887 1.685055
##
## attr(,"class")
## [1] "coef.mer"
Predicting on the test set the results are essential the same as the baseline model.
# Make predictions on train
train_pred <- predict(lme, train)
# Make predictions on test
test_pred <- predict(lme, test)
# calculate rmse and scatter index
rmse_test <- rmse(test_pred, test$yield)
rmse_train <- rmse(train_pred, train$yield)
SI <- rmse_test/mean(yield_top_ha$yield)
rmse_ratio <- rmse_train/rmse_test
round(rmse_test,1)
## [1] 11.3
round(rmse_ratio,2)
## [1] 0.84
round(SI,2)
## [1] 0.56
A number of advantages for applying multilevel models to this type of data are evident. Having direct control over the fixed and random effects and essentially defining the structure of the model is very powerful and allows the user to adapt the model to suit the dataset and aim. It also provides more meaningful feedback on what factors influence the response. The results demonstrate that a multilevel model can outperform a linear regression model on this dataset and through further work could likely be improved further. However it is clear that multilevel models have limitations based on group and observation sizes that need to be considered. In this case the lack of data points meant that more complicated models with additional correlations terms or more nested levels could not be used.
For this dataset running separate elastic net models for each commodity may deliver improved performance. As the dataset is very broad with a selection of environmental variables this would allow the models to select new predictors for each commodity. Multilevel models appear to be better suited to narrower models with a very high number of repeated measurements.
Another alternative approach could be to perform logistic regression instead and predict for commodity, rather than yield. This would still deliver on the broad research aim but would eliminate the need to summarise the results by SA2 region. Instead the whole landuse area dataset could be used which contains over 50,000 observations, compared to region dataset which has 622. It would also allow for more sophisticated geospatial based analysis.
Investigating multilevel models has given me a greater appreciation for complicated interactions that are involved with the agricultural dataset we have created. It has also affirmed that we made the correct decsion selecting grapes for our initial model as it is clearly the commodity best suited for modelling.
Multilevel modelling has been performed on the agricultural dataset to predict the influence of environmental variables of the yield of a selection of commodities. It was found that a two level nested model (with random intercept and slope) was able to outperform a baseline simple regression model. However the complexity of multilevel models needs to be limited when working with datasets with a relatively small number of observations. Further investigation into alternative approaches is needed to overcome this.
D Bates, M Mächler, B Bolker, S Walker 2014. Fitting linear mixed-effects models using lme4, view 8 November 2020, https://cran.r-project.org/web/packages/lme4/vignettes/lmer.pdf
B Winter 2014. A very basic tutorial for performing linear mixed effects analyses, view 8 November 2020, http://www.bodowinter.com/uploads/1/2/9/3/129362560/bw_lme_tutorial2.pdf
X A Harrison, L Donaldson, M E Correa-Cano, J Evans, D N Fisher, C E D Goodwin, B S Robinson, D J Hodgson, R Inger 2018. A brief introduction to mixed effects modelling and multi-model inference in ecology, view 8 November 2020, https://peerj.com/articles/4794/