This datasets aims to estimate the biomass of paddocks by using satellites measurements of the relfected wavelengths of green, blue, red, and near-infrared (henceforth NIR). These measurements can be used individually or combind to determine various Vegetation Index’s which describes greenness. The Normalized Difference Vegetation Index (henceforth, NDVI) is a popular index that allows us to easily read these outputs on a scale from -1 to 1. NDVI will be used as one of the benchmarks evaluation during this analysis.
\[ NDVI = (\frac{NIR - Red}{NIR + Red}) \]
| DryMatter | drymatter | NDVI | Date | Paddock | blue | green | red | nir |
|---|---|---|---|---|---|---|---|---|
| 1875 | 1536.238 | 0.6785089 | 2017-10-13 | A12 | 325.0920 | 532.5517 | 520.3084 | 2716.536 |
| 2250 | 1748.723 | 0.7144538 | 2017-10-18 | A12 | 335.7911 | 551.1741 | 528.0619 | 3170.547 |
| 2550 | 2012.143 | 0.7427501 | 2018-02-15 | A12 | 435.8410 | 592.2337 | 509.5287 | 3451.824 |
| 1950 | 1815.039 | 0.6794601 | 2018-05-03 | A12 | 554.5660 | 672.9560 | 601.7782 | 3153.000 |
| 1950 | 1879.764 | 0.7208708 | 2018-05-03 | A12 | 388.9216 | 649.5143 | 614.5010 | 3788.486 |
| 2700 | 2393.333 | 0.8187152 | 2018-05-28 | A12 | 218.9326 | 498.8568 | 442.2653 | 4436.966 |
Other variables include DryMatter (ground biomass measurments), drymatter (current algorithm estimate of biomass), Date (date of estimate), Paddock (paddock on farm) and RGB, NIR (respective reflected wavelengths).
Shown in the matrix below there is strong positive correlation with the biomass estimate from our current prediction and the popular vegitation index NDVI (expected). Also NIR by itself also has a strong positive correlation while RGB all have weaker negitive correlations. All variables are significant.
There is a lot of multicollinearity between the predictors especially between the RGB reflected wavelengths. This can show why NDVI is so popular as it takes the most highly correlated value between RGB (being red) and combines it with NIR which shows little to no correlation between rgb.
The graph below shows the NIR wavelengths relfected the most but also has the most vairance relative to RGB. It was shown by it’s strong linear relationship that has NIR increases, biomass increases. RGB on the other hand have smaller variances and weak negitive relationships so as they increase biomass decreases (very slightly).
Below demonstrates what we aim to improve. Using NDVI we can see a strong positive linear relationship but towards highier estimates we see a lot more variation. Ideally as the highier estimates are the most usful for farmers in terms of paddock management, we aim to reduce the variation between values 2000-3000 KG DM/ha.
Important to Note this dataset has been thourghly cleaned for outliers and the datasets used in practicality will have more noise in the data. Generally they tend to overestimate at lower values and underestimate at highier values.
Important to remember the current algoirthm has been designed to be robust across a variatiy of situations and the full model (RGB + NIR) may be inclined to overfit on this particular dataset.
(DryMatter ~ drymatter)
(DryMatter ~ NDVI)
(DryMatter ~ blue + red + green + nir)
In practicality data for this type of problem would look along the lines of the graph shown below. Generalised Additive Models (GAMs) is often useful in enviromental sciences where the models are not so linear.
Gams simply change the simple linear model;
\[ y = \beta_{o} + x_{1}\beta_{1} + \epsilon,\;\;\;\;\; \epsilon \sim N(0, \sigma^2) \] To something very similar but where our linear predictor is now of some function f.
\[ y = \beta_{o} + f(x_{1}) + \epsilon,\;\;\;\;\; \epsilon \sim N(0, \sigma^2) \]
When plotting the residuals vs fitted plot for the linear model we can see even though our line is horizontial around 0 there is a slight trend (wiggle) thoughout the data. This gives indication that a GAM model may be oppropriate to implement.
The edf gives our estimated degree of freedom, essenially the larger the number the more wiggly the model fit. Values around 1 tend to be close to linear. In our output below we can see all our terms have been shown to be significant in predicting biomass. The assumptions show that there is still nonconstant variance in our residuals but otherwise though look normal.
(DryMatter ~ s(blue) + s(green) + s(red) + s(nir))
##
## Family: gaussian
## Link function: identity
##
## Formula:
## DryMatter ~ s(blue) + s(green) + s(red) + s(nir)
##
## Parametric coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2166.79 8.45 256.4 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## s(blue) 5.268 6.407 33.843 < 2e-16 ***
## s(green) 6.237 7.364 20.916 < 2e-16 ***
## s(red) 5.269 6.393 8.052 1.06e-08 ***
## s(nir) 4.315 5.347 133.962 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.648 Deviance explained = 65.6%
## -REML = 6238 Scale est. = 64046 n = 897
##
## Method: REML Optimizer: outer newton
## full convergence after 4 iterations.
## Gradient range [-0.000935496,0.0001600761]
## (score 6238.016 & scale 64045.73).
## Hessian positive definite, eigenvalue range [1.046682,446.043].
## Model rank = 37 / 37
##
## Basis dimension (k) checking results. Low p-value (k-index<1) may
## indicate that k is too low, especially if edf is close to k'.
##
## k' edf k-index p-value
## s(blue) 9.00 5.27 0.91 <2e-16 ***
## s(green) 9.00 6.24 1.00 0.47
## s(red) 9.00 5.27 0.96 0.13
## s(nir) 9.00 4.32 1.03 0.81
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Using anova we can check to see if the added complexity using splines is worth it. This model has been shown to be significantly different (p-value = 2.2e-16) from our simple linear model, indicates the added complexity in our gam model is probably worth it. I am only comparing to the best base model as if it is significantly better than the best it will also be for the other two.
## Analysis of Variance Table
##
## Model 1: DryMatter ~ blue + red + green + nir
## Model 2: DryMatter ~ s(blue) + s(green) + s(red) + s(nir)
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 892.00 64609068
## 2 874.91 56034264 17.09 8574803 7.8342 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
As we cannot make inference from any of our coefficients our interpretibility comes from the partial effects of the smooth terms visually or make inference from our predicted. If we had any normal linear terms in our model it would be okay to make inference from their coefficents as normal.
The analysis for these plots of based of this model and everything included in it
The reflected wavelengths of blue increase so does our boismass estimate.
The reflected wavelengths of green increase our boismass estimate tends to decrease.
The reflected wavelengths of red increase from 200 to 400 our boismass estimate tends to increase then plateaus of.
The reflected wavelengths of NIR increase our boismass estimate tends to increase but less and less with highier values.
For this dataset the use of GAMs has significantly improved our model. We do lose some interpretability in the model compared to the simple linear model but this model is simple enough to where it shouldn’t have to much of an impact. Comparing the predictive plots below (data split 0.75 train and 0.25 test, graph on testing dataset) we see the GAM and the linear model are very similar except the game model has slightly less variation throughout.
This model would be worth testing on more than one farm worth of data and evaluating the predictive capibilities on a larger sample size. Since the model would need to be more robust the felxibility in the GAM model may potientially lead to a stronger model.
Is it okay to visually look at GAM model like this on same data
Because our data is relatively linear I choosed to use a linear activation function. I did play around with a few others but they didn’t make any significant improvements. As we can see the NN has made a reduction by around 13% to MSE. This MSE was estimated on a testing dataset, the data was normalized then randomly split into a training (0.75) and testing set (0.25) for evualuation. The \(R^2\) are almost exactly the same between both models (note this isn’t RSA).
(DryMatter ~ blue + red + green + nir)
NN(DryMatter ~ blue + red + green + nir)
## [1] 0.586514
Below are a variety of NN models of different lengths and sizes.The thickness of the lines connecting each perceptron visually show where the weights are focused. One thing you can notice in the bigger NN is a lot of the pathways become unused and the model only ends up focusing on a few perceptrons anyway. This indicates the the simple NN is all that is need for this problem. Given the simplicity of the data this is to be expected. So given this I settle on the first network as it still has two layers but doesn’t add unnecessary complexity. Note even then this is probably still more complex than is needed as the model mainly focuses on two perceptrons. Remembering a single perceptron NN would be similar to just using a simple linear regression. So this gives us early indication the a NN may not be appropriate for this dataset.
In the following evaluation 10-fold cross validation was used to compare the MSE of the simple linear model and the NN model. Although the MSE is a lot lower the variation of MSE shown in the box is very large, some folds have a signifcantly lower MSE than others. Could be an indication of needed a larger dataset to determine more consistant results.
(DryMatter ~ blue + red + green + nir)
NN(DryMatter ~ blue + red + green + nir)
##
|
| | 0%
##
|
|====== | 10%
|
|============= | 20%
|
|==================== | 30%
|
|========================== | 40%
|
|================================ | 50%
|
|======================================= | 60%
|
|============================================== | 70%
|
|==================================================== | 80%
|
|========================================================== | 90%
|
|=================================================================| 100%
Although the mean squared error on our training set decreases we can see that the neural network has much less spread across the line and tends estimate highier values more often. This is where the trouble comes with NN as they resemble black boxes and explaining or interpreting the output is much more difficult. Also as we say with the MSE testing on different sets of data lead to a large variation in the result. With the lose of interpretability a simple linear model is probably the stronger model in this case even with slightly better predictive capabilities.