A paramount concern in agriculture is maximizing crop production and regression analysis can be used to assist in solving that problem in a number of ways. We can use it to answer questions such as:
These questions can be answered using a multiple regression model with a continuous response variable (yield in unit qty) and a combination of continuous and discrete predictor variables.
From the variety of factors that may impact a crop’s yield, there are a number of sensible predictors that may affect yield such as:
This situation lends itself to Linear Modeling using multiple regression with continuous and discrete variables
The data below is gathered from multiple sources for illustrative purposes but does not form a coherent dataset.
Below is a chart of the US production of a crop from 2014 to 2019. In this case the crop is raisins. We also acquired meteorological data from NOAA for Fresno Country in California which is an area where raisins are grown.
The plots below show how the data is distributed for temperature and precipation over the months of a year. We notice a few outliers in both temperature and precipitation which could be investigated further to determine if the yield that year was particularly affected.
We manipulate the data to create features which might be of interest. Continuous variables are created for seasonal precipitation and discrete variables for using both precipitation and temperature data. However we note that some of the discrete variables have only one level and provide no useful additional data. There should be removed before from the dataset being fed to the model.
| yearlyprcp | jun_rain | jul_rain | aug_rain | wetwinter | drywinter | colddec | coldjan | summerprcp | fallprcp | winterprcp | springprpc | PRODUCTION |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 7.25 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0.02 | 3.43 | 3.11 | 1.23 | 368408 |
| 9.66 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0.25 | 7.00 | 1.28 | 1.15 | 332211 |
| 16.18 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0.00 | 5.97 | 9.66 | 3.14 | 352441 |
| 14.34 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0.15 | 1.25 | 8.80 | 4.74 | 304723 |
| 11.20 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.00 | 2.34 | 8.32 | 4.77 | 241402 |
| 9.61 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0.00 | 0.00 | 8.57 | 2.69 | 263000 |
This modeling attempt is restricted to the continuous variables only and unfortunately, the model is not significant so we cannot draw any conclusions from it. This could be due to a number of factors such as the size of the dataset, the nature of the dataset (innapropriate link/proxy between US production and Fresno county), functional form of the model, or innapropraite varibale distributions requiring transformations. Further effort could be put into compiling a more precise data set using local production numbers instead. This exercise could be aided with domain knowledge of agricultural prudction.
##
## Call:
## lm(formula = PRODUCTION ~ ., data = model_data2 %>% select(-YEAR))
##
## Residuals:
## 1 2 3 4 5 6
## 38928 -25257 11361 36905 -30206 -31731
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 311088 125194 2.485 0.244
## summerprcp 18199 451809 0.040 0.974
## fallprcp 8476 15191 0.558 0.676
## winterprcp 4408 21494 0.205 0.871
## springprpc -20122 41271 -0.488 0.711
##
## Residual standard error: 74590 on 1 degrees of freedom
## Multiple R-squared: 0.5601, Adjusted R-squared: -1.199
## F-statistic: 0.3183 on 4 and 1 DF, p-value: 0.849
The distributions of the continuous variables are not normal and while transformations might help, it would be more useful to gather more data. With significant coefficients, we could have explained phenomenons like the increase of 1mm of rain in a season the impact on yield in tons. If possible, this would be a powerful tool.