1 Introduction

In previous research, the Data Geeks analytic group have undertaken statistical analysis to understand how geographical factors can affect crop yield in Australia. Though a set of regression models has been designed with promising prediction accuracy, it’s pending further exploration on determining importance of a large group of underlying dependent variables and summarizing a model with solid business interpretation.

1.1 Research Question

In this extended study, we aim to understand how environmental factors including climate and soil quality influence wine grape yield in Australia. In addition, we will try to design a model not only showing good forecast performance, but also better explain the causality between response variable and predictor variables.

2 Justification of Methodologies

2.1 Data Cleanup

The dataset utilised in this study is inherited from the one used in previous group work. Moreover, we made some adjustments in adopted variables after conducting a thorough examination on the prepared datasets.

There is a corruption in data entries of soil pH and soil water capacity attributes – two attributes are showing exact same values for every single record; with extra study to understand soil quality, we reached conclusion that soil water capacity variable is incorrect and needs to be removed from the dataset
The mapping between GI based data and SA2 based data is problematic, causing the GI water usage and grape variety data not matching for each region/state
Outliers Detection – Based on previous group research, grape planted in North Territory is mainly for table grape instead of wine grape; Since there is only one record related to this state, it is decided to be excluded As a final decision, only the information from Land Use datasets (Table 1 in group assignment) with outlier removal is utilised to maintain data integrity.

# load data
df<- read_csv(here::here("yield_by_region.csv"))
df1 <- df %>% 
  dplyr::select(3:56) %>%
  filter(state %in% c('NSW', 'WA', 'Tas', 'Vic', 'SA')) %>% # NT data is related to table grape instead of wine grape, so exclude it 
  filter(yield < 38) # exclude outlier of yield
df1 <- df1 %>%
  dplyr::select(-c("lu_water_capacity", "state")) # exclude water capacity and state
glimpse(df1)

## Rows: 215
## Columns: 52
## $ lon                  <dbl> 149.3426, 149.8055, 149.6395, 148.6423, 148.61...
## $ lat                  <dbl> -35.22835, -36.69631, -33.37590, -33.83565, -3...
## $ grape_area_region    <dbl> 7.15, 3.22, 10.99, 53.43, 368.37, 108.90, 79.9...
## $ grape_area_landuse   <dbl> 40.752869, 21.162236, 43.974474, 174.394450, 1...
## $ lu_mean_temp_annual  <dbl> 11.73768, 14.37882, 12.33928, 16.10698, 15.804...
## $ lu_mean_temp_winter  <dbl> 6.460130, 10.390000, 7.108636, 10.374250, 9.98...
## $ lu_mean_temp_summer  <dbl> 15.49699, 17.22393, 16.12078, 20.20036, 19.963...
## $ lu_max_temp_annual   <dbl> 18.06087, 19.90500, 18.44000, 22.53750, 22.514...
## $ lu_max_temp_winter   <dbl> 12.11774, 16.27967, 12.49164, 15.81450, 15.818...
## $ lu_max_temp_summer   <dbl> 22.28522, 22.48643, 22.77948, 27.33679, 27.299...
## $ lu_min_temp_annual   <dbl> 5.414493, 8.852639, 6.238561, 9.676458, 9.0945...
## $ lu_min_temp_winter   <dbl> 0.8025217, 4.5003333, 1.7256364, 4.9340000, 4....
## $ lu_min_temp_summer   <dbl> 8.708758, 11.961427, 9.462079, 13.063930, 12.6...
## $ lu_mean_temp_jan     <dbl> 18.77522, 19.32417, 19.43773, 23.78500, 23.695...
## $ lu_mean_temp_feb     <dbl> 18.54848, 19.73000, 19.19591, 23.61125, 23.523...
## $ lu_mean_temp_mar     <dbl> 16.12696, 18.12750, 16.70864, 20.96625, 20.608...
## $ lu_mean_temp_apr     <dbl> 11.95522, 15.26667, 12.74182, 16.55125, 16.040...
## $ lu_mean_temp_may     <dbl> 8.177826, 12.152500, 8.952727, 12.402500, 12.0...
## $ lu_mean_temp_jun     <dbl> 5.430217, 9.730000, 6.197727, 9.231250, 8.8302...
## $ lu_mean_temp_jul     <dbl> 4.435869, 8.720833, 5.014091, 8.315000, 7.8875...
## $ lu_mean_temp_aug     <dbl> 5.840869, 9.670833, 6.484091, 9.762500, 9.3469...
## $ lu_mean_temp_sep     <dbl> 8.415870, 11.675833, 8.894545, 12.160000, 11.8...
## $ lu_mean_temp_oct     <dbl> 11.61130, 14.11250, 12.13773, 15.71125, 15.498...
## $ lu_mean_temp_nov     <dbl> 14.31783, 15.98167, 14.81045, 18.81375, 18.598...
## $ lu_mean_temp_dec     <dbl> 17.14391, 18.02500, 17.81318, 21.96375, 21.776...
## $ lu_mean_rain_annual  <dbl> 701.8815, 923.6184, 703.1797, 632.7203, 616.14...
## $ lu_mean_rain_dry     <dbl> 465.3587, 575.4270, 428.9841, 414.9306, 401.79...
## $ lu_mean_rain_wet     <dbl> 434.7203, 611.4777, 450.7538, 378.4017, 369.10...
## $ lu_mean_rain_jan     <dbl> 58.99304, 64.87000, 65.68545, 49.75750, 51.520...
## $ lu_mean_rain_feb     <dbl> 52.28522, 96.82333, 62.86909, 49.84250, 52.434...
## $ lu_mean_rain_mar     <dbl> 48.23087, 77.62500, 47.16182, 37.71250, 39.651...
## $ lu_mean_rain_apr     <dbl> 43.75261, 75.85167, 42.90636, 39.08500, 39.334...
## $ lu_mean_rain_may     <dbl> 43.12826, 62.33833, 42.46364, 39.57750, 41.082...
## $ lu_mean_rain_jun     <dbl> 53.64957, 64.15167, 44.05909, 44.54000, 46.563...
## $ lu_mean_rain_jul     <dbl> 58.95087, 61.90000, 58.96545, 57.58250, 59.770...
## $ lu_mean_rain_aug     <dbl> 56.32826, 38.31000, 62.63000, 48.88250, 51.931...
## $ lu_mean_rain_sep     <dbl> 59.54696, 55.35667, 51.68182, 50.00000, 50.243...
## $ lu_mean_rain_oct     <dbl> 59.06609, 66.21000, 63.35546, 50.54000, 53.266...
## $ lu_mean_rain_nov     <dbl> 73.36652, 86.67333, 72.03000, 54.06250, 54.821...
## $ lu_mean_rain_dec     <dbl> 62.92565, 80.02000, 73.12636, 55.66000, 57.788...
## $ lu_mean_solar_annual <dbl> 17.24855, 15.61668, 17.49392, 18.19937, 18.201...
## $ lu_mean_solar_dry    <dbl> 11.14162, 10.58263, 11.53976, 11.51478, 11.620...
## $ lu_mean_solar_wet    <dbl> 21.61064, 19.21243, 21.74689, 22.97407, 22.901...
## $ lu_bulk_density      <dbl> 1.419132, 1.307434, 1.413116, 1.446464, 1.4691...
## $ lu_carbon            <dbl> 2.2090407, 2.5544023, 1.5361696, 1.3150985, 1....
## $ lu_clay              <dbl> 14.11885, 14.41100, 15.36331, 26.25615, 24.288...
## $ lu_nitrogen          <dbl> 0.14302469, 0.19321756, 0.13032669, 0.12377553...
## $ lu_ph                <dbl> 4.713461, 4.625989, 4.765173, 5.214870, 5.1098...
## $ lu_phosphorus        <dbl> 0.02885974, 0.02884064, 0.02333373, 0.03167121...
## $ lu_sand              <dbl> 57.06934, 56.08776, 59.12095, 56.23438, 56.007...
## $ lu_silt              <dbl> 18.090666, 14.325860, 15.069221, 17.746149, 17...
## $ yield                <dbl> 3.33, 0.80, 1.89, 8.41, 11.12, 9.58, 3.30, 5.8...

2.2 Box-Cox Transformation and Variable Standardisations

A series of assumptions of linear regression model should be held to ensure the model fit with sound theoretical basis. Two most critical ones are the Normality and Equal Variance of the error/residual. The former one requires the distribution of the regression errors to follow a normal distribution, while the later one expects the error variance showing in same level at any set of predictor values. Based on previous group work, these two assumptions are not held strictly in our dataset. To mitigate the issue, two preprocessing steps are applied to the dataset.

On response variable side, Box-Cox transformation is applied. The Box-Cox method [1] considers a family of transformations based on a parameter lambda (λ), which is chosen by numerically maximizing the log-likelihood. As shown in the following formula, when the optized lambda equals to 0, the mehod is equivalent to perform log function on the variable. When lambda is not equal to 0, the operation is equivalent to taking a power of lambda on the variable. This method is one of the classic approaches to transform variables when they are strictly positive. The transformation of Y has the form:

On predicor variable side, standardisation/scale is applied onto all numeric variables. Standardisation calculates the mean and standard deviation of all the values for a varaible, and then “scale” each element by subtracting the mean and dividing by the standard deviation.

2.3 Causality Inference

In previous research, many predictor variables act as significant indicators in different versions of model fits. However, strong overlap of business representation and collinearity are shown among these variables, hindering researchers gaining reliable conclusion regarding the causality across them. To achieve a model with aligned interpretation of agriculture science, we would like to find causation between variables. Structural Equation Model (SEM) is one of the most popular methodologies to conduct such analysis. SEM aims to formulating the latent variables that potentially fits the data better as well as reveals stronger causal relationship between reponse and predictor variables. There are several indices to evaluate the goodness of fit of the identified model. They are generally categorized into 3 groups, well known as absolute, incremental and parsimony fit indices. The classic absolute indices include Model Chi-square, RMSEA and GFI, whereas the most commonly used incremental indices include NFI and CFI. The difference between Chi-square and the other indices is that most of the indices consist of a single value that must be compared against a benchmark that has been defined by statistics scholars. For instance, a typical cutoff for the goodness of fit index (GFI) is 0.9. It indicates that values greater or equal to 0.9 represent good fit, whereas values less than 0.9 imply poor fit. However, such benchmarks are not always universally agreed upon. In contrast, a fit goodness hypothesis based on Chi-Square statistics provides a statistically consistent outcome with a given p-value.

3 Modelling

3.1 Baseline

We started with the baseline model achieved from previous group work. 11 predictor variables are selected with most of the temperature related attributes included.

df1_final <- df1 %>%
  dplyr::select(
  lu_mean_temp_summer,
  lu_mean_rain_oct,
  lu_mean_temp_jan,
  lu_mean_temp_feb,
  lu_mean_temp_sep,
  lu_mean_temp_nov,
  lu_mean_temp_dec,
  lu_mean_solar_wet,
  lu_bulk_density,
  lu_ph,
  lu_phosphorus,
  yield
)

title="Grape Histogram"
ggplot(data = df1_final, aes(yield)) + 
  geom_histogram(aes(y=..density..), bins=100, color="black", fill="white") +
  geom_density(alpha=.2, fill="#FF6666")+
  labs(title=title)

model1=lm(yield ~.,data=df1_final[,c(1:12)])
par(mfrow=c(2,2))
plot(model1)

From above histogram we can observe that yield variable has a non-normal distribution. Moreover, from model diagnostic plots we can observe that the basic assumptions for linear regression are not always held for our scenario. On one side, the mean of the residuals does not show 0 at each fitted value. This indicates the linearity assumption is invalid. On the other hand, the spread of the residuals is larger for larger fitted values. This implies that the constant variance assumption is violated. One solution to deal with ill-shaped variable distribution problem is to conduct Box-Cox transformation. Firstly, we search for a proper lambda by running the Box-Cox analysis. After identifying that the lambda should reside between 0.4 and 0.7, we further zoom in the selection range of lambda and increase the step precision to 0.05, and finally retrieve a lambda of 0.65. We then repeat the model fitting process on the new response variable.

par(mfrow=c(1,2))
boxcox(model1, lambda = seq(-1, 1, by = 0.1), plotit = TRUE)
boxcox(model1, lambda = seq(0.5, 0.7, by = 0.05), plotit = TRUE)

lambda = 0.65
df1_final$yield_tf <- (df1_final$yield^(lambda) - 1)/lambda
ggplot(data = df1_final, aes(yield_tf)) + 
  geom_histogram(aes(y=..density..), bins=100, color="black", fill="white") +
  geom_density(alpha=.2, fill="#FF6666")+
  labs(title='Yield Histogram (After Transformation)')

model1_bc = lm(yield_tf ~.,data=df1_final[,c(1:11,13)])
par(mfrow=c(2,2))
plot(model1_bc)

It is clear that after transformation the distribution of yield no longer shows skewness and has a better fit into normal distribution. By comparing the Residual vs. Fitted plot from initial Model1 and the same model after Box-Cox transformation, we can observe that the Normality violation of Residual and inconsistent Variance of Residual issues are mitigated. To keep our analysis and model creation principle consistent, we adopt the same transformation on response variable in the following model fits practices.

3.2 Model Constrcution with SEM

Based on the work by researchers[2], wine grape yield is affected by combination of various environmental factors (e.g., climate, pest, disease, etc.). Among them, climate and soil condition collected in our dataset play most important roles in wine grape yield. In addition, water capacity is no longer a predominate factor due to the quick development of irrigation system, leaving solar radiation and temperature as the more critical influencer. They also argued that climate shows more impact during grape growing season, i.e., summer/wet season. Therefore, it can be found that in our following study only months related to summer and wet season are selected. Soil-wise, not many indices show significance due to the heavy usage of fertiliser. Based on this knowledge, we define our conceptual causality model with two latent variables, soil and climate. Their causation with yield is depicted in the following Directed Acyclic Graph (DAG). As for details, the identifier table summarises the observed variables for each latent variable.

As an initial SEM based model, we include all 11 observed variables in the baseline model and use lavaan to verify whether it fits the data well. (Some of the observed variables are not shown in our table because from the following analysis we find that they are not significant, e.g., bulk density and phosphorus)

df1_sem <- df1_final %>% 
  dplyr::select(lu_mean_temp_summer,
  lu_mean_rain_oct,
  lu_mean_temp_jan,
  lu_mean_temp_feb,
  lu_mean_temp_sep,
  lu_mean_temp_nov,
  lu_mean_temp_dec,
  lu_mean_solar_wet,
  lu_bulk_density,
  lu_ph,
  lu_phosphorus,
  yield_tf)

model <- '
  # latent variable model
    soil =~ lu_ph + lu_phosphorus + lu_bulk_density
    climate =~ lu_mean_temp_summer + lu_mean_rain_oct + lu_mean_temp_jan + lu_mean_temp_feb + lu_mean_temp_sep + lu_mean_temp_nov + lu_mean_temp_dec + lu_mean_solar_wet
    
  # regressions
    yield_tf ~ soil + climate
  # corvariance
    lu_mean_temp_summer ~~ lu_mean_temp_jan
    lu_mean_temp_summer ~~ lu_mean_temp_feb
    lu_mean_temp_summer ~~ lu_mean_temp_sep
    lu_mean_temp_summer ~~ lu_mean_temp_nov
    lu_mean_temp_summer ~~ lu_mean_temp_dec
'
fit <- sem(model, data=df1_sem)
summary(fit)

## lavaan 0.6-7 did NOT end normally after 840 iterations
## ** WARNING ** Estimates below are most likely unreliable
## 
##   Estimator                                         ML
##   Optimization method                           NLMINB
##   Number of free parameters                         31
##                                                       
##   Number of observations                           215
##                                                       
## Model Test User Model:
##                                                       
##   Test statistic                                    NA
##   Degrees of freedom                                NA
## 
## Parameter Estimates:
## 
##   Standard errors                             Standard
##   Information                                 Expected
##   Information saturated (h1) model          Structured
## 
## Latent Variables:
##                    Estimate  Std.Err  z-value  P(>|z|)
##   soil =~                                             
##     lu_ph             1.000                           
##     lu_phosphorus     0.001       NA                  
##     lu_bulk_densty    0.117       NA                  
##   climate =~                                          
##     lu_mn_tmp_smmr    1.000                           
##     lu_mean_ran_ct   -5.138       NA                  
##     lu_mean_tmp_jn    1.201       NA                  
##     lu_mean_tmp_fb    1.152       NA                  
##     lu_mean_tmp_sp    0.582       NA                  
##     lu_mean_tmp_nv    1.004       NA                  
##     lu_mean_tmp_dc    1.179       NA                  
##     lu_mean_slr_wt    0.673       NA                  
## 
## Regressions:
##                    Estimate  Std.Err  z-value  P(>|z|)
##   yield_tf ~                                          
##     soil            -70.471       NA                  
##     climate          15.874       NA                  
## 
## Covariances:
##                          Estimate  Std.Err  z-value  P(>|z|)
##  .lu_mean_temp_summer ~~                                    
##    .lu_mean_tmp_jn          0.019       NA                  
##    .lu_mean_tmp_fb          0.037       NA                  
##    .lu_mean_tmp_sp          0.356       NA                  
##    .lu_mean_tmp_nv          0.038       NA                  
##    .lu_mean_tmp_dc          0.052       NA                  
##   soil ~~                                                   
##     climate                 0.989       NA                  
## 
## Variances:
##                    Estimate  Std.Err  z-value  P(>|z|)
##    .lu_ph             0.373       NA                  
##    .lu_phosphorus     0.000       NA                  
##    .lu_bulk_densty    0.004       NA                  
##    .lu_mn_tmp_smmr    0.150       NA                  
##    .lu_mean_ran_ct  204.851       NA                  
##    .lu_mean_tmp_jn   -0.017       NA                  
##    .lu_mean_tmp_fb    0.071       NA                  
##    .lu_mean_tmp_sp    1.232       NA                  
##    .lu_mean_tmp_nv    0.334       NA                  
##    .lu_mean_tmp_dc    0.113       NA                  
##    .lu_mean_slr_wt    0.828       NA                  
##    .yield_tf         23.752       NA                  
##     soil              0.214       NA                  
##     climate           4.495       NA

Above lavaan output shows no reliable model is identified after 984 iterations. This is because when variables have large range of variance or the covariance matrix is not positive-definite, it is difficult for lavaan to find a converged solution. Therefore, rescaling the variables into a smaller range according to their distributions is necessary.

df1_sem[,1:11] <- scale(df1_sem[,1:11])
df1_sem <-as.data.frame((df1_sem))
fit <- sem(model, data=df1_sem, check.gradient=FALSE)
summary(fit, standardized=TRUE)

## lavaan 0.6-7 ended normally after 285 iterations
## 
##   Estimator                                         ML
##   Optimization method                           NLMINB
##   Number of free parameters                         31
##                                                       
##   Number of observations                           215
##                                                       
## Model Test User Model:
##                                                       
##   Test statistic                              1314.198
##   Degrees of freedom                                47
##   P-value (Chi-square)                           0.000
## 
## Parameter Estimates:
## 
##   Standard errors                             Standard
##   Information                                 Expected
##   Information saturated (h1) model          Structured
## 
## Latent Variables:
##                    Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
##   soil =~                                                               
##     lu_ph             1.000                               0.616    0.617
##     lu_phosphorus     0.136    0.108    1.264    0.206    0.084    0.084
##     lu_bulk_densty    1.109    0.123    9.051    0.000    0.683    0.684
##   climate =~                                                            
##     lu_mn_tmp_smmr    1.000                               0.973    0.984
##     lu_mean_ran_ct   -0.621    0.056  -11.166    0.000   -0.604   -0.606
##     lu_mean_tmp_jn    1.026    0.011   94.389    0.000    0.999    1.001
##     lu_mean_tmp_fb    1.019    0.012   82.833    0.000    0.992    0.994
##     lu_mean_tmp_sp    0.762    0.039   19.460    0.000    0.742    0.744
##     lu_mean_tmp_nv    0.989    0.020   48.782    0.000    0.963    0.965
##     lu_mean_tmp_dc    1.016    0.012   82.097    0.000    0.989    0.991
##     lu_mean_slr_wt    0.864    0.039   22.257    0.000    0.841    0.843
## 
## Regressions:
##                    Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
##   yield_tf ~                                                            
##     soil             48.639  238.585    0.204    0.838   29.940   11.848
##     climate         -29.663  149.412   -0.199    0.843  -28.874  -11.426
## 
## Covariances:
##                          Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
##  .lu_mean_temp_summer ~~                                                      
##    .lu_mean_tmp_jn          0.003    0.001    3.361    0.001    0.003    0.379
##    .lu_mean_tmp_fb          0.007    0.001    6.361    0.000    0.007    0.355
##    .lu_mean_tmp_sp          0.098    0.009   10.587    0.000    0.098    0.827
##    .lu_mean_tmp_nv          0.008    0.001    6.623    0.000    0.008    0.168
##    .lu_mean_tmp_dc          0.009    0.001    8.086    0.000    0.009    0.401
##   soil ~~                                                                     
##     climate                 0.594    0.077    7.690    0.000    0.991    0.991
## 
## Variances:
##                    Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
##    .lu_ph             0.616    0.068    9.040    0.000    0.616    0.619
##    .lu_phosphorus     0.988    0.095   10.372    0.000    0.988    0.993
##    .lu_bulk_densty    0.529    0.066    8.075    0.000    0.529    0.532
##    .lu_mn_tmp_smmr    0.032    0.003   10.992    0.000    0.032    0.032
##    .lu_mean_ran_ct    0.630    0.060   10.481    0.000    0.630    0.633
##    .lu_mean_tmp_jn   -0.003    0.001   -4.272    0.000   -0.003   -0.003
##    .lu_mean_tmp_fb    0.012    0.001    9.743    0.000    0.012    0.012
##    .lu_mean_tmp_sp    0.445    0.043   10.422    0.000    0.445    0.447
##    .lu_mean_tmp_nv    0.068    0.006   10.755    0.000    0.068    0.069
##    .lu_mean_tmp_dc    0.018    0.002   10.347    0.000    0.018    0.018
##    .lu_mean_slr_wt    0.288    0.027   10.657    0.000    0.288    0.289
##    .yield_tf        -10.832   80.605   -0.134    0.893  -10.832   -1.696
##     soil              0.379    0.082    4.596    0.000    1.000    1.000
##     climate           0.948    0.094   10.036    0.000    1.000    1.000

fitmeasures(fit)

##                npar                fmin               chisq                  df 
##              31.000               3.056            1314.198              47.000 
##              pvalue      baseline.chisq         baseline.df     baseline.pvalue 
##               0.000            6141.036              66.000               0.000 
##                 cfi                 tli                nnfi                 rfi 
##               0.791               0.707               0.707               0.699 
##                 nfi                pnfi                 ifi                 rni 
##               0.786               0.560               0.792               0.791 
##                logl   unrestricted.logl                 aic                 bic 
##           -1441.240            -784.141            2944.479            3048.969 
##              ntotal                bic2               rmsea      rmsea.ci.lower 
##             215.000            2950.736               0.354               0.338 
##      rmsea.ci.upper        rmsea.pvalue                 rmr          rmr_nomean 
##               0.371               0.000               0.144               0.144 
##                srmr        srmr_bentler srmr_bentler_nomean                crmr 
##               0.099               0.099               0.099               0.107 
##         crmr_nomean          srmr_mplus   srmr_mplus_nomean               cn_05 
##               0.107               0.099               0.099              11.470 
##               cn_01                 gfi                agfi                pgfi 
##              12.852               0.568               0.283               0.342 
##                 mfi                ecvi 
##               0.052               6.401

From the re-run output, we can find the the model converging. However, the fitness of model is not impressive. Below table with evaluation of critical indices show that this model is not a good fit.

Index	Actual Value	Preferred Value	Conclusion
Chi-square	1314.198(p=0)	p > 0.05	Not fit
RMSEA	0.354	< 0.05	Not fit
GFI	0.659	> 0.9	Not fit
NFI	0.786	> 0.9	Not fit
IFI	0.792	> 0.9	Not fit
CFI	0.791	> 0.9	Not fit

We also tried several other models with different manifest/observed variable selection. Model 2 considers lu_max_temp_summer, lu_mean_rain_wet, lu_mean_solar_annual and lu_mean_solar_wet as manifest variables for climate latent variable, and lu_ph, lu_carbon and lu_nitrogen as observed variables for the latent variable representing soil condition.

From below lavaan output, we can find that the Chi-square test, CFI, IFI, NFI, RMSEA and GFI all fail again.

df_2 <- df1 %>% 
  dplyr::select(lu_max_temp_summer, lu_mean_rain_wet,
  lu_mean_solar_annual, lu_mean_solar_wet,
  lu_ph, lu_carbon, lu_nitrogen,
  yield)
df_2[,1:7] <- scale(df_2[,1:7])
df_2 <-as.data.frame((df_2))

model_2 = lm(yield ~.,data=df_2)

boxcox(model_2, lambda = seq(0.5, 0.7, by = 0.05), plotit = TRUE)

lambda = 0.6
df_2$yield_tf <- (df_2$yield^(lambda) - 1)/lambda
model <- '
  # latent variable model
    soil =~ lu_ph + lu_carbon + lu_nitrogen
    climate =~ lu_max_temp_summer + lu_mean_solar_annual + lu_mean_rain_wet + lu_mean_solar_wet
    
  # regressions
    yield_tf ~ soil + climate
  # corvariance
  lu_mean_solar_annual ~~ lu_mean_solar_wet
'
fit <- sem(model, data=df_2)
summary(fit, standardized=TRUE, )

## lavaan 0.6-7 ended normally after 47 iterations
## 
##   Estimator                                         ML
##   Optimization method                           NLMINB
##   Number of free parameters                         19
##                                                       
##   Number of observations                           215
##                                                       
## Model Test User Model:
##                                                       
##   Test statistic                               412.134
##   Degrees of freedom                                17
##   P-value (Chi-square)                           0.000
## 
## Parameter Estimates:
## 
##   Standard errors                             Standard
##   Information                                 Expected
##   Information saturated (h1) model          Structured
## 
## Latent Variables:
##                    Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
##   soil =~                                                               
##     lu_ph             1.000                               0.745    0.747
##     lu_carbon        -1.220    0.088  -13.931    0.000   -0.909   -0.911
##     lu_nitrogen      -1.237    0.088  -14.106    0.000   -0.921   -0.923
##   climate =~                                                            
##     lu_mx_tmp_smmr    1.000                               0.970    0.972
##     lu_men_slr_nnl    0.953    0.037   25.625    0.000    0.924    0.926
##     lu_mean_ran_wt   -0.526    0.063   -8.395    0.000   -0.510   -0.512
##     lu_mean_slr_wt    0.896    0.043   20.977    0.000    0.868    0.870
## 
## Regressions:
##                    Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
##   yield_tf ~                                                            
##     soil              1.697    0.428    3.967    0.000    1.264    0.551
##     climate          -0.406    0.315   -1.290    0.197   -0.394   -0.172
## 
## Covariances:
##                           Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
##  .lu_mean_solar_annual ~~                                                      
##    .lu_mean_slr_wt           0.148    0.025    5.916    0.000    0.148    0.801
##   soil ~~                                                                      
##     climate                  0.603    0.076    7.923    0.000    0.835    0.835
## 
## Variances:
##                    Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
##    .lu_ph             0.441    0.047    9.429    0.000    0.441    0.443
##    .lu_carbon         0.169    0.027    6.372    0.000    0.169    0.170
##    .lu_nitrogen       0.147    0.026    5.728    0.000    0.147    0.147
##    .lu_mx_tmp_smmr    0.055    0.022    2.516    0.012    0.055    0.056
##    .lu_men_slr_nnl    0.142    0.024    5.960    0.000    0.142    0.143
##    .lu_mean_ran_wt    0.735    0.072   10.216    0.000    0.735    0.738
##    .lu_mean_slr_wt    0.241    0.030    7.937    0.000    0.241    0.242
##    .yield_tf          4.340    0.432   10.039    0.000    4.340    0.825
##     soil              0.555    0.088    6.287    0.000    1.000    1.000
##     climate           0.940    0.098    9.572    0.000    1.000    1.000

fitmeasures(fit)

##                npar                fmin               chisq                  df 
##              19.000               0.958             412.134              17.000 
##              pvalue      baseline.chisq         baseline.df     baseline.pvalue 
##               0.000            2014.684              28.000               0.000 
##                 cfi                 tli                nnfi                 rfi 
##               0.801               0.672               0.672               0.663 
##                 nfi                pnfi                 ifi                 rni 
##               0.795               0.483               0.802               0.801 
##                logl   unrestricted.logl                 aic                 bic 
##           -1814.291           -1608.224            3666.582            3730.624 
##              ntotal                bic2               rmsea      rmsea.ci.lower 
##             215.000            3670.417               0.329               0.302 
##      rmsea.ci.upper        rmsea.pvalue                 rmr          rmr_nomean 
##               0.357               0.000               0.164               0.164 
##                srmr        srmr_bentler srmr_bentler_nomean                crmr 
##               0.105               0.105               0.105               0.119 
##         crmr_nomean          srmr_mplus   srmr_mplus_nomean               cn_05 
##               0.119               0.105               0.105              15.392 
##               cn_01                 gfi                agfi                pgfi 
##              18.428               0.740               0.449               0.349 
##                 mfi                ecvi 
##               0.399               2.094

Model 3 removes lu_mean_rain_wet and lu_mean_solar_wet for climate from Model 2, and keep the same group of variables as observed variables for soil condition latent variable. From below lavaan output, we can observe that Model 3 pass the model fitness test for CFI, IFI, NFI and GFI, but still does not satisfy the Chi-square test.

df_3 <- df1 %>% 
  dplyr::select(lu_max_temp_summer, 
  lu_mean_solar_annual, 
  lu_ph, lu_carbon, lu_nitrogen,
  yield)
df_3[,1:5] <- scale(df_3[,1:5])
df_3 <-as.data.frame((df_3))

model_3 = lm(yield ~.,data=df_3)

boxcox(model_3, lambda = seq(0.5, 0.7, by = 0.05), plotit = TRUE)

lambda = 0.6
df_3$yield_tf <- (df_3$yield^(lambda) - 1)/lambda
model <- '
  # latent variable model
    soil =~ lu_ph + lu_carbon + lu_nitrogen
    climate =~ lu_max_temp_summer + lu_mean_solar_annual
    
  # regressions
    yield_tf ~ soil + climate
  # corvariance
'
fit <- sem(model, data=df_3)
summary(fit, standardized=TRUE, )

## lavaan 0.6-7 ended normally after 42 iterations
## 
##   Estimator                                         ML
##   Optimization method                           NLMINB
##   Number of free parameters                         14
##                                                       
##   Number of observations                           215
##                                                       
## Model Test User Model:
##                                                       
##   Test statistic                                64.919
##   Degrees of freedom                                 7
##   P-value (Chi-square)                           0.000
## 
## Parameter Estimates:
## 
##   Standard errors                             Standard
##   Information                                 Expected
##   Information saturated (h1) model          Structured
## 
## Latent Variables:
##                    Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
##   soil =~                                                               
##     lu_ph             1.000                               0.746    0.748
##     lu_carbon        -1.220    0.087  -13.977    0.000   -0.910   -0.912
##     lu_nitrogen      -1.233    0.087  -14.110    0.000   -0.920   -0.922
##   climate =~                                                            
##     lu_mx_tmp_smmr    1.000                               0.986    0.989
##     lu_men_slr_nnl    0.922    0.038   24.317    0.000    0.910    0.912
## 
## Regressions:
##                    Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
##   yield_tf ~                                                            
##     soil              1.601    0.400    4.000    0.000    1.195    0.521
##     climate          -0.320    0.287   -1.113    0.266   -0.315   -0.137
## 
## Covariances:
##                    Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
##   soil ~~                                                               
##     climate           0.604    0.076    7.930    0.000    0.821    0.821
## 
## Variances:
##                    Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
##    .lu_ph             0.439    0.047    9.412    0.000    0.439    0.441
##    .lu_carbon         0.167    0.027    6.268    0.000    0.167    0.168
##    .lu_nitrogen       0.150    0.026    5.776    0.000    0.150    0.150
##    .lu_mx_tmp_smmr    0.022    0.025    0.884    0.377    0.022    0.022
##    .lu_men_slr_nnl    0.168    0.027    6.286    0.000    0.168    0.169
##    .yield_tf          4.353    0.432   10.082    0.000    4.353    0.827
##     soil              0.557    0.088    6.300    0.000    1.000    1.000
##     climate           0.973    0.099    9.812    0.000    1.000    1.000

fitmeasures(fit)

##                npar                fmin               chisq                  df 
##              14.000               0.151              64.919               7.000 
##              pvalue      baseline.chisq         baseline.df     baseline.pvalue 
##               0.000            1080.445              15.000               0.000 
##                 cfi                 tli                nnfi                 rfi 
##               0.946               0.884               0.884               0.871 
##                 nfi                pnfi                 ifi                 rni 
##               0.940               0.439               0.946               0.946 
##                logl   unrestricted.logl                 aic                 bic 
##           -1498.662           -1466.202            3025.323            3072.512 
##              ntotal                bic2               rmsea      rmsea.ci.lower 
##             215.000            3028.149               0.196               0.154 
##      rmsea.ci.upper        rmsea.pvalue                 rmr          rmr_nomean 
##               0.241               0.000               0.141               0.141 
##                srmr        srmr_bentler srmr_bentler_nomean                crmr 
##               0.064               0.064               0.064               0.076 
##         crmr_nomean          srmr_mplus   srmr_mplus_nomean               cn_05 
##               0.076               0.064               0.064              47.588 
##               cn_01                 gfi                agfi                pgfi 
##              62.187               0.913               0.739               0.304 
##                 mfi                ecvi 
##               0.874               0.432

After several more experiments, We finally reach a variable selection where the model is further simplified by only considering summer temperature and annual solar radiation as observed variables for climate, and removing carbon and nitrogen from soil related observed variables. This time, the model converges and shows a good fitness. Below table shows a summary of the fitness test result.

df_n <- df1 %>% 
  dplyr::select(lu_max_temp_summer,
  lu_mean_solar_annual,
  lu_ph,
  yield)
df_n[,1:3] <- scale(df_n[,1:3])
df_n <-as.data.frame((df_n))

model_n = lm(yield ~.,data=df_n)

boxcox(model_n, lambda = seq(0.5, 0.7, by = 0.05), plotit = TRUE)

lambda = 0.6
df_n$yield_tf <- (df_n$yield^(lambda) - 1)/lambda
model_n_bc = lm(yield_tf ~.,data=df_n[,c(1:3,5)])

model <- '
  # latent variable model
    soil =~ lu_ph 
    climate =~ lu_max_temp_summer + lu_mean_solar_annual
    
  # regressions
    yield_tf ~ soil + climate
  # corvariance
'
fit <- sem(model, data=df_n)
summary(fit, standardized=TRUE, )

## lavaan 0.6-7 ended normally after 27 iterations
## 
##   Estimator                                         ML
##   Optimization method                           NLMINB
##   Number of free parameters                          9
##                                                       
##   Number of observations                           215
##                                                       
## Model Test User Model:
##                                                       
##   Test statistic                                 0.789
##   Degrees of freedom                                 1
##   P-value (Chi-square)                           0.374
## 
## Parameter Estimates:
## 
##   Standard errors                             Standard
##   Information                                 Expected
##   Information saturated (h1) model          Structured
## 
## Latent Variables:
##                    Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
##   soil =~                                                               
##     lu_ph             1.000                               0.998    1.000
##   climate =~                                                            
##     lu_mx_tmp_smmr    1.000                               1.034    1.036
##     lu_men_slr_nnl    0.839    0.051   16.306    0.000    0.868    0.870
## 
## Regressions:
##                    Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
##   yield_tf ~                                                            
##     soil              1.383    0.151    9.137    0.000    1.380    0.602
##     climate          -0.104    0.139   -0.744    0.457   -0.107   -0.047
## 
## Covariances:
##                    Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
##   soil ~~                                                               
##     climate           0.575    0.078    7.339    0.000    0.558    0.558
## 
## Variances:
##                    Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
##    .lu_ph             0.000                               0.000    0.000
##    .lu_mx_tmp_smmr   -0.074    0.054   -1.371    0.170   -0.074   -0.074
##    .lu_men_slr_nnl    0.242    0.044    5.482    0.000    0.242    0.243
##    .yield_tf          3.511    0.339   10.371    0.000    3.511    0.667
##     soil              0.995    0.096   10.368    0.000    1.000    1.000
##     climate           1.069    0.110    9.758    0.000    1.000    1.000

fitmeasures(fit)

##                npar                fmin               chisq                  df 
##               9.000               0.002               0.789               1.000 
##              pvalue      baseline.chisq         baseline.df     baseline.pvalue 
##               0.374             537.473               6.000               0.000 
##                 cfi                 tli                nnfi                 rfi 
##               1.000               1.002               1.002               0.991 
##                 nfi                pnfi                 ifi                 rni 
##               0.999               0.166               1.000               1.000 
##                logl   unrestricted.logl                 aic                 bic 
##           -1128.941           -1128.547            2275.883            2306.218 
##              ntotal                bic2               rmsea      rmsea.ci.lower 
##             215.000            2277.699               0.000               0.000 
##      rmsea.ci.upper        rmsea.pvalue                 rmr          rmr_nomean 
##               0.173               0.491               0.022               0.022 
##                srmr        srmr_bentler srmr_bentler_nomean                crmr 
##               0.009               0.009               0.009               0.012 
##         crmr_nomean          srmr_mplus   srmr_mplus_nomean               cn_05 
##               0.012               0.009               0.009            1047.821 
##               cn_01                 gfi                agfi                pgfi 
##            1809.050               0.998               0.982               0.100 
##                 mfi                ecvi 
##               1.000               0.087

Index	Actual Value	Preferred Value	Conclusion
Chi-square	0.798(p=0.374)	p > 0.05	Good fit
RMSEA	0.000	< 0.05	Good fit
GFI	0.998	> 0.9	Good fit
NFI	0.999	> 0.9	Good fit
IFI	1.000	> 0.9	Good fit
CFI	1.000	> 0.9	Good fit

semPaths(fit, what = "stand", rotation = 4, layout = "tree")

We also run semCors function to understand the difference between the implied and observed fits of the correlations in our model vs. the data. The only difference observed is the causation between annual mean of solar radiation and yield, which implies a relatively good fit of our model.

semCors(fit)

semCors(fit, include="difference")

We also compared the model-implied covariance matrix and the original covariance matrix of the data. It seems the difference between the two covariance matrices are not large, which indicates a relatively good model fit (not sure).

fitted(fit)

## $cov
##                      lu_ph l_mx__ l_mn__ yld_tf
## lu_ph                0.995                     
## lu_max_temp_summer   0.575 0.995               
## lu_mean_solar_annual 0.483 0.897  0.995        
## yield_tf             1.317 0.685  0.575  5.262

cov(df_n[,c(1,2,3,5)])

##                      lu_max_temp_summer lu_mean_solar_annual     lu_ph
## lu_max_temp_summer            1.0000000            0.9014171 0.5774924
## lu_mean_solar_annual          0.9014171            1.0000000 0.4828236
## lu_ph                         0.5774924            0.4828236 1.0000000
## yield_tf                      0.6711426            0.5111696 1.3231109
##                       yield_tf
## lu_max_temp_summer   0.6711426
## lu_mean_solar_annual 0.5111696
## lu_ph                1.3231109
## yield_tf             5.2863341

modindices(fit)

##                     lhs op                  rhs    mi    epc sepc.lv sepc.all
## 13                 soil =~   lu_max_temp_summer 0.788  0.447   0.446    0.447
## 14                 soil =~ lu_mean_solar_annual 0.788 -0.376  -0.375   -0.376
## 16                lu_ph ~~   lu_max_temp_summer 0.788 -0.050  -0.050       NA
## 17                lu_ph ~~ lu_mean_solar_annual 0.788  0.042   0.042       NA
## 20   lu_max_temp_summer ~~             yield_tf 0.788  0.059   0.059    0.116
## 21 lu_mean_solar_annual ~~             yield_tf 0.788 -0.050  -0.050   -0.054
##    sepc.nox
## 13    0.447
## 14   -0.376
## 16       NA
## 17       NA
## 20    0.116
## 21   -0.054

We also tried improving our model by following the recommendation of modification indices. Unfortunately, no model with valid fit is identified.

model_revised <- '
  # latent variable model
    soil =~ lu_ph 
    climate =~ lu_max_temp_summer + lu_mean_solar_annual
    soil =~ lu_max_temp_summer
    soil =~ lu_mean_solar_annual
    
  # regressions
    yield_tf ~ soil + climate
  # corvariance
'
fit_test <- sem(model_revised, data=df_n)
summary(fit_test, standardized=TRUE)

## lavaan 0.6-7 ended normally after 42 iterations
## 
##   Estimator                                         ML
##   Optimization method                           NLMINB
##   Number of free parameters                         12
##                                                       
##   Number of observations                           215
##                                                       
## Model Test User Model:
##                                                       
##   Test statistic                                    NA
##   Degrees of freedom                                -2
##   P-value (Unknown)                                 NA
## 
## Parameter Estimates:
## 
##   Standard errors                             Standard
##   Information                                 Expected
##   Information saturated (h1) model          Structured
## 
## Latent Variables:
##                    Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
##   soil =~                                                               
##     lu_ph             1.000                               0.964    0.966
##   climate =~                                                            
##     lu_mx_tmp_smmr    1.000                               0.772    0.774
##     lu_men_slr_nnl    1.175       NA                      0.907    0.909
##   soil =~                                                               
##     lu_mx_tmp_smmr    0.922       NA                      0.888    0.890
##     lu_men_slr_nnl    0.873       NA                      0.842    0.844
## 
## Regressions:
##                    Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
##   yield_tf ~                                                            
##     soil              1.330       NA                      1.282    0.559
##     climate          -0.287       NA                     -0.222   -0.097
## 
## Covariances:
##                    Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
##   soil ~~                                                               
##     climate          -0.282       NA                     -0.378   -0.378
## 
## Variances:
##                    Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
##    .lu_ph             0.066       NA                      0.066    0.067
##    .lu_mx_tmp_smmr    0.129       NA                      0.129    0.130
##    .lu_men_slr_nnl    0.042       NA                      0.042    0.042
##    .yield_tf          3.353       NA                      3.353    0.637
##     soil              0.929       NA                      1.000    1.000
##     climate           0.596       NA                      1.000    1.000

fitmeasures(fit_test)

##                npar                fmin               chisq                  df 
##              12.000               0.000                  NA              -2.000 
##              pvalue                 cfi                 tli                nnfi 
##                  NA                  NA                  NA                  NA 
##                 rfi                 nfi                pnfi                 ifi 
##                  NA                  NA                  NA                  NA 
##                 rni                logl   unrestricted.logl                 aic 
##                  NA           -1128.547           -1128.547            2281.094 
##                 bic              ntotal                bic2               rmsea 
##            2321.541             215.000            2283.516                  NA 
##      rmsea.ci.lower      rmsea.ci.upper        rmsea.pvalue                 rmr 
##                  NA                  NA                  NA               0.000 
##          rmr_nomean                srmr        srmr_bentler srmr_bentler_nomean 
##               0.000               0.000               0.000               0.000 
##                crmr         crmr_nomean          srmr_mplus   srmr_mplus_nomean 
##               0.000               0.000               0.000               0.000 
##               cn_05               cn_01                 gfi                agfi 
##                 NaN                 NaN               1.000               1.000 
##                pgfi                 mfi                ecvi 
##              -0.200                  NA                  NA

By now, we have achieved a much simplified model with only 3 predictor variables involved. The interpretation of this model is that the determine soil condition factor can be predicted from pH value, and climate factor can be indicated by the max temperature during summer season and the the solar radiation across the year.

3.3 Model Performance Evaluation

Following the same principles used in previous group work, we evaluate the prediction performance of the new model by evaluating the RMSE metrics on both of the train and test datasets. The new RMSE is 3.82 on train dataset and 4.48 on test dataset, not as outstanding as the benchmark model. Besides, the RMSE ratio between training set and test set is 0.851. As a rule of thumb, when such ratio has deviation of more than 15% from 1.0, the model can be concluded as underfitting or overfitting in its predictions. In our scenario, it implies the new model is still a reasonable fit.

set.seed(10)
train_pct = 0.7
# identify train dataset size (number of rows)
train_size <- floor(train_pct * nrow(df_n))
# use sample function to form a random train dataset
train_n <- sample(seq_len(nrow(df_n)), size = train_size)
# generate train dataset
train <- df_n[train_n, ]
# test dataset is the supplementary of train dataset
test <- df_n[-train_n, ]
model_n =lm(yield_tf ~.,data=train[,c(1,2,3,5)])

test_feat <- dplyr::select(test, -yield_tf)
test$predict=predict(model_n,test_feat)
# Predict Train Set
train_feat <- dplyr::select(train, -yield_tf)
train$predict=predict(model_n,train_feat)
rmse_train <- rmse(((train$predict)*lambda+1)^(1/lambda), train$yield)
rmse_test <- rmse(((test$predict)*lambda+1)^(1/lambda), test$yield)
rmse_ratio <- rmse_train/rmse_test

4 Conclusion

In this study, we further analyzed how environmental factors affect grape yield. To ensure the derived model satisfying the critical assumptions of linear regression model, Box-Cox Transformation on response variable (yield) and standardisation of predictor variables are applied for data pre-processing. In addition, Causal Inference Analysis is conducted to enhance the model interpretation alignment with business insights. As a final result, we have designed a model consistent with previous group work results and research directions. This confirms that maximum temperature in summer, annual average solar radiation and soil pH values are highly correlated with grape yield in Australia which simplifies the best performance model from earlier group assignment. Although RMSE

5 Reflection

Structural Equation Modelling (SEM) is commonly justified in theoretical models to explain relationships among different variables. These variables are not only observed variables and also unobserved constructs (Latent variables) which can be independent or dependent in nature [3]. SEM starts with estimation of a model and aims at assessing whether the model is of absolute, incremental and parsimony fit. It is noticeable that many advantages are introduced by SEM, including (1) better understanding of the problem while verifying relationships simultaneously; (2) modelling of measurement errors and unexplained variables; (3) ability to choose best fit models and evaluate the assumptions or theories. However, it is not always performing well in practices as many challenges can be faced (1) model does not converge due to approximately negative variance or covariance matrix showing not positive definite (2) A derived model with good fitness indices does not demonstrate forecast advantage. To achieve a model showing both solid theoretical foundation and promising prediction accuracy, it is very important to have a comprehensive understanding of the background knowledge of the problem before SEM analysis. Such understanding does not only impact the conceptual model construction, but also should influence the data collection process.

6 References

[1] David D.(2020). Applied Statistics with R, viewed 8 November 2020, < http://daviddalpiaz.github.io/appliedstats/>

[2] Webb,L,B. P.H.Whetton and E.W.R Barlow (2005). Impact on Australian Viticulture fromGreenhouse Induced Temperature Change, viewed 20 September 2020, <https://www.researchgate.net/publication/237556186_Impact_on_Australian_Viticulture_from_Greenhouse_Induced_Temperature_Change >

[3] Thakkar, J. (2020). Structural equation modelling : application for research and practice (with AMOS and R) . Springer.

Structural Equation Modelling On Grape Data

Sai Wang

8th November 2020