Dataset Overview

This datasets aims to estimate the biomass of paddocks by using satellites measurements of the relfected wavelengths of green, blue, red, and near-infrared (henceforth NIR). These measurements can be used individually or combind to determine various Vegetation Index’s which describes greenness. The Normalized Difference Vegetation Index (henceforth, NDVI) is a popular index that allows us to easily read these outputs on a scale from -1 to 1. NDVI will be used as one of the benchmarks evaluation during this analysis.


\[ NDVI = (\frac{NIR - Red}{NIR + Red}) \]


DryMatter drymatter NDVI Date Paddock blue green red nir
1875 1536.238 0.6785089 2017-10-13 A12 325.0920 532.5517 520.3084 2716.536
2250 1748.723 0.7144538 2017-10-18 A12 335.7911 551.1741 528.0619 3170.547
2550 2012.143 0.7427501 2018-02-15 A12 435.8410 592.2337 509.5287 3451.824
1950 1815.039 0.6794601 2018-05-03 A12 554.5660 672.9560 601.7782 3153.000
1950 1879.764 0.7208708 2018-05-03 A12 388.9216 649.5143 614.5010 3788.486
2700 2393.333 0.8187152 2018-05-28 A12 218.9326 498.8568 442.2653 4436.966

Other variables include DryMatter (ground biomass measurments), drymatter (current algorithm estimate of biomass), Date (date of estimate), Paddock (paddock on farm) and RGB, NIR (respective reflected wavelengths).

Shown in the matrix below there is strong positive correlation with the biomass estimate from our current prediction and the popular vegitation index NDVI (expected). Also NIR by itself also has a strong positive correlation while RGB all have weaker negitive correlations. All variables are significant.

There is a lot of multicollinearity between the predictors especially between the RGB reflected wavelengths. This can show why NDVI is so popular as it takes the most highly correlated value between RGB (being red) and combines it with NIR which shows little to no correlation between rgb.

Data Exploration

The graph below shows the NIR wavelengths relfected the most but also has the most vairance relative to RGB. It was shown by it’s strong linear relationship that has NIR increases, biomass increases. RGB on the other hand have smaller variances and weak negitive relationships so as they increase biomass decreases (very slightly).

Below demonstrates what we aim to improve. Using NDVI we can see a strong positive linear relationship but towards highier estimates we see a lot more variation. Ideally as the highier estimates are the most usful for farmers in terms of paddock management, we aim to reduce the variation between values 2000-3000 KG DM/ha.

Important to Note this dataset has been thourghly cleaned for outliers and the datasets used in practicality will have more noise in the data. Generally they tend to overestimate at lower values and underestimate at highier values.

Benchmark Models

Important to remember the current algoirthm has been designed to be robust across a variatiy of situations and the full model (RGB + NIR) may be inclined to overfit on this particular dataset.

  • Current Algorithm - RSA = 0.5814

(DryMatter ~ drymatter)

  • NDVI Model - RSA = 0.5095

(DryMatter ~ NDVI)

  • RGB and NIR model - RSA = 0.6017

(DryMatter ~ blue + red + green + nir)

Generalised Additive Model (GAMs)

In practicality data for this type of problem would look along the lines of the graph shown below. Generalised Additive Models (GAMs) is often useful in enviromental sciences where the models are not so linear.

Gams simply change the simple linear model;

\[ y = \beta_{o} + x_{1}\beta_{1} + \epsilon,\;\;\;\;\; \epsilon \sim N(0, \sigma^2) \] To something very similar but where our linear predictor is now of some function f.

\[ y = \beta_{o} + f(x_{1}) + \epsilon,\;\;\;\;\; \epsilon \sim N(0, \sigma^2) \]


When plotting the residuals vs fitted plot for the linear model we can see even though our line is horizontial around 0 there is a slight trend (wiggle) thoughout the data. This gives indication that a GAM model may be oppropriate to implement.


Fitting the Model

The edf gives our estimated degree of freedom, essenially the larger the number the more wiggly the model fit. Values around 1 tend to be close to linear. In our output below we can see all our terms have been shown to be significant in predicting biomass. The assumptions show that there is still nonconstant variance in our residuals but otherwise though look normal.

  • GAM model - RSA = 0.648

(DryMatter ~ s(blue) + s(green) + s(red) + s(nir))

## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## DryMatter ~ s(blue) + s(green) + s(red) + s(nir)
## 
## Parametric coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2166.79       8.45   256.4   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##            edf Ref.df       F  p-value    
## s(blue)  5.268  6.407  33.843  < 2e-16 ***
## s(green) 6.237  7.364  20.916  < 2e-16 ***
## s(red)   5.269  6.393   8.052 1.06e-08 ***
## s(nir)   4.315  5.347 133.962  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.648   Deviance explained = 65.6%
## -REML =   6238  Scale est. = 64046     n = 897

## 
## Method: REML   Optimizer: outer newton
## full convergence after 4 iterations.
## Gradient range [-0.000935496,0.0001600761]
## (score 6238.016 & scale 64045.73).
## Hessian positive definite, eigenvalue range [1.046682,446.043].
## Model rank =  37 / 37 
## 
## Basis dimension (k) checking results. Low p-value (k-index<1) may
## indicate that k is too low, especially if edf is close to k'.
## 
##            k'  edf k-index p-value    
## s(blue)  9.00 5.27    0.91  <2e-16 ***
## s(green) 9.00 6.24    1.00    0.47    
## s(red)   9.00 5.27    0.96    0.13    
## s(nir)   9.00 4.32    1.03    0.81    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1


Comparing to Base Model

Using anova we can check to see if the added complexity using splines is worth it. This model has been shown to be significantly different (p-value = 2.2e-16) from our simple linear model, indicates the added complexity in our gam model is probably worth it. I am only comparing to the best base model as if it is significantly better than the best it will also be for the other two.

## Analysis of Variance Table
## 
## Model 1: DryMatter ~ blue + red + green + nir
## Model 2: DryMatter ~ s(blue) + s(green) + s(red) + s(nir)
##   Res.Df      RSS    Df Sum of Sq      F    Pr(>F)    
## 1 892.00 64609068                                     
## 2 874.91 56034264 17.09   8574803 7.8342 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1


As we cannot make inference from any of our coefficients our interpretibility comes from the partial effects of the smooth terms visually or make inference from our predicted. If we had any normal linear terms in our model it would be okay to make inference from their coefficents as normal.

The analysis for these plots of based of this model and everything included in it

  • The reflected wavelengths of blue increase so does our boismass estimate.

  • The reflected wavelengths of green increase our boismass estimate tends to decrease.

  • The reflected wavelengths of red increase from 200 to 400 our boismass estimate tends to increase then plateaus of.

  • The reflected wavelengths of NIR increase our boismass estimate tends to increase but less and less with highier values.

Evaluation

For this dataset the use of GAMs has significantly improved our model. We do lose some interpretability in the model compared to the simple linear model but this model is simple enough to where it shouldn’t have to much of an impact. Comparing the predictive plots below (data split 0.75 train and 0.25 test, graph on testing dataset) we see the GAM and the linear model are very similar except the game model has slightly less variation throughout.

This model would be worth testing on more than one farm worth of data and evaluating the predictive capibilities on a larger sample size. Since the model would need to be more robust the felxibility in the GAM model may potientially lead to a stronger model.

Is it okay to visually look at GAM model like this on same data

Neural Network

Because our data is relatively linear I choosed to use a linear activation function. I did play around with a few others but they didn’t make any significant improvements. As we can see the NN has made a reduction by around 13% to MSE. This MSE was estimated on a testing dataset, the data was normalized then randomly split into a training (0.75) and testing set (0.25) for evualuation. The \(R^2\) are almost exactly the same between both models (note this isn’t RSA).

  • RGB and NIR Model - MSE = 79881.186 - R^2 = 0.6034

(DryMatter ~ blue + red + green + nir)

  • Neural Network Model - MSE = 69544.466 - R^2 = 0.6048

NN(DryMatter ~ blue + red + green + nir)

## [1] 0.586514

Below are a variety of NN models of different lengths and sizes.The thickness of the lines connecting each perceptron visually show where the weights are focused. One thing you can notice in the bigger NN is a lot of the pathways become unused and the model only ends up focusing on a few perceptrons anyway. This indicates the the simple NN is all that is need for this problem. Given the simplicity of the data this is to be expected. So given this I settle on the first network as it still has two layers but doesn’t add unnecessary complexity. Note even then this is probably still more complex than is needed as the model mainly focuses on two perceptrons. Remembering a single perceptron NN would be similar to just using a simple linear regression. So this gives us early indication the a NN may not be appropriate for this dataset.

Cross Validation

In the following evaluation 10-fold cross validation was used to compare the MSE of the simple linear model and the NN model. Although the MSE is a lot lower the variation of MSE shown in the box is very large, some folds have a signifcantly lower MSE than others. Could be an indication of needed a larger dataset to determine more consistant results.

  • RGB and NIR Model - MSE = 72857.49

(DryMatter ~ blue + red + green + nir)

  • Neural Network Model - MSE = 64041.91

NN(DryMatter ~ blue + red + green + nir)

## 
  |                                                                       
  |                                                                 |   0%
## 
  |                                                                       
  |======                                                           |  10%
  |                                                                       
  |=============                                                    |  20%
  |                                                                       
  |====================                                             |  30%
  |                                                                       
  |==========================                                       |  40%
  |                                                                       
  |================================                                 |  50%
  |                                                                       
  |=======================================                          |  60%
  |                                                                       
  |==============================================                   |  70%
  |                                                                       
  |====================================================             |  80%
  |                                                                       
  |==========================================================       |  90%
  |                                                                       
  |=================================================================| 100%

Evaluation

Although the mean squared error on our training set decreases we can see that the neural network has much less spread across the line and tends estimate highier values more often. This is where the trouble comes with NN as they resemble black boxes and explaining or interpreting the output is much more difficult. Also as we say with the MSE testing on different sets of data lead to a large variation in the result. With the lose of interpretability a simple linear model is probably the stronger model in this case even with slightly better predictive capabilities.