Multiple Regression

A paramount concern in agriculture is maximizing crop production and regression analysis can be used to assist in solving that problem in a number of ways. We can use it to answer questions such as:

  • What are the environmental and meteorological factors that influence crop yield?
  • Given meteorological and/or environmental/spatial information about a crop area during the ongoing growing season, can we predict the crop’s yield for that year?
  • Can we predict crop loss due to inclement weather?

These questions can be answered using a multiple regression model with a continuous response variable (yield in unit qty) and a combination of continuous and discrete predictor variables.

From the variety of factors that may impact a crop’s yield, there are a number of sensible predictors that may affect yield such as:

  • Location (are particular regions of the world more or less productive). If you comparing countries, the overall crop area for production per country might also be useful.
  • Low, average, high temperatures for individual months or growing periods (growth, bloom, seed)
    • Can be used to derive binary variables such as: occurrence of frost or occurence of overheating
  • Low, average, high temperatures for individual months or growing periods (growth, bloom, seed)
    • Can be used to determine the occurence of drought

This situation lends itself to Linear Modeling using multiple regression with continuous and discrete variables

The data below is gathered from multiple sources for illustrative purposes but does not form a coherent dataset.

US Production & Fresno Meteorological Data

Below is a chart of the US production of a crop from 2014 to 2019. In this case the crop is raisins. We also acquired meteorological data from NOAA for Fresno Country in California which is an area where raisins are grown.

The plots below show how the data is distributed for temperature and precipation over the months of a year. We notice a few outliers in both temperature and precipitation which could be investigated further to determine if the yield that year was particularly affected.

Feature Engineering

We manipulate the data to create features which might be of interest. Continuous variables are created for seasonal precipitation and discrete variables for using both precipitation and temperature data. However we note that some of the discrete variables have only one level and provide no useful additional data. There should be removed before from the dataset being fed to the model.

Assembled Dataset
yearlyprcp jun_rain jul_rain aug_rain wetwinter drywinter colddec coldjan summerprcp fallprcp winterprcp springprpc PRODUCTION
7.25 0 1 0 0 0 0 0 0.02 3.43 3.11 1.23 368408
9.66 0 1 0 0 0 0 0 0.25 7.00 1.28 1.15 332211
16.18 0 0 0 1 0 0 0 0.00 5.97 9.66 3.14 352441
14.34 0 0 0 1 0 1 0 0.15 1.25 8.80 4.74 304723
11.20 0 0 0 0 0 0 0 0.00 2.34 8.32 4.77 241402
9.61 0 0 0 0 1 0 0 0.00 0.00 8.57 2.69 263000

Modeling

This modeling attempt is restricted to the continuous variables only and unfortunately, the model is not significant so we cannot draw any conclusions from it. This could be due to a number of factors such as the size of the dataset, the nature of the dataset (innapropriate link/proxy between US production and Fresno county), functional form of the model, or innapropraite varibale distributions requiring transformations. Further effort could be put into compiling a more precise data set using local production numbers instead. This exercise could be aided with domain knowledge of agricultural prudction.

## 
## Call:
## lm(formula = PRODUCTION ~ ., data = model_data2 %>% select(-YEAR))
## 
## Residuals:
##      1      2      3      4      5      6 
##  38928 -25257  11361  36905 -30206 -31731 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)   311088     125194   2.485    0.244
## summerprcp     18199     451809   0.040    0.974
## fallprcp        8476      15191   0.558    0.676
## winterprcp      4408      21494   0.205    0.871
## springprpc    -20122      41271  -0.488    0.711
## 
## Residual standard error: 74590 on 1 degrees of freedom
## Multiple R-squared:  0.5601, Adjusted R-squared:  -1.199 
## F-statistic: 0.3183 on 4 and 1 DF,  p-value: 0.849

The distributions of the continuous variables are not normal and while transformations might help, it would be more useful to gather more data. With significant coefficients, we could have explained phenomenons like the increase of 1mm of rain in a season the impact on yield in tons. If possible, this would be a powerful tool.