In previous research, the Data Geeks analytic group have undertaken statistical analysis to understand how geographical factors can affect crop yield in Australia. Though a set of regression models has been designed with promising prediction accuracy, it’s pending further exploration on determining importance of a large group of underlying dependent variables and summarizing a model with solid business interpretation.
In this extended study, we aim to understand how environmental factors including climate and soil quality influence wine grape yield in Australia. In addition, we will try to design a model not only showing good forecast performance, but also better explain the causality between response variable and predictor variables.
The dataset utilised in this study is inherited from the one used in previous group work. Moreover, we made some adjustments in adopted variables after conducting a thorough examination on the prepared datasets.
There is a corruption in data entries of soil pH and soil water capacity attributes – two attributes are showing exact same values for every single record; with extra study to understand soil quality, we reached conclusion that soil water capacity variable is incorrect and needs to be removed from the dataset
The mapping between GI based data and SA2 based data is problematic, causing the GI water usage and grape variety data not matching for each region/state
Outliers Detection – Based on previous group research, grape planted in North Territory is mainly for table grape instead of wine grape; Since there is only one record related to this state, it is decided to be excluded As a final decision, only the information from Land Use datasets (Table 1 in group assignment) with outlier removal is utilised to maintain data integrity.
# load data
df<- read_csv(here::here("yield_by_region.csv"))
df1 <- df %>%
dplyr::select(3:56) %>%
filter(state %in% c('NSW', 'WA', 'Tas', 'Vic', 'SA')) %>% # NT data is related to table grape instead of wine grape, so exclude it
filter(yield < 38) # exclude outlier of yield
df1 <- df1 %>%
dplyr::select(-c("lu_water_capacity", "state")) # exclude water capacity and state
glimpse(df1)## Rows: 215
## Columns: 52
## $ lon <dbl> 149.3426, 149.8055, 149.6395, 148.6423, 148.61...
## $ lat <dbl> -35.22835, -36.69631, -33.37590, -33.83565, -3...
## $ grape_area_region <dbl> 7.15, 3.22, 10.99, 53.43, 368.37, 108.90, 79.9...
## $ grape_area_landuse <dbl> 40.752869, 21.162236, 43.974474, 174.394450, 1...
## $ lu_mean_temp_annual <dbl> 11.73768, 14.37882, 12.33928, 16.10698, 15.804...
## $ lu_mean_temp_winter <dbl> 6.460130, 10.390000, 7.108636, 10.374250, 9.98...
## $ lu_mean_temp_summer <dbl> 15.49699, 17.22393, 16.12078, 20.20036, 19.963...
## $ lu_max_temp_annual <dbl> 18.06087, 19.90500, 18.44000, 22.53750, 22.514...
## $ lu_max_temp_winter <dbl> 12.11774, 16.27967, 12.49164, 15.81450, 15.818...
## $ lu_max_temp_summer <dbl> 22.28522, 22.48643, 22.77948, 27.33679, 27.299...
## $ lu_min_temp_annual <dbl> 5.414493, 8.852639, 6.238561, 9.676458, 9.0945...
## $ lu_min_temp_winter <dbl> 0.8025217, 4.5003333, 1.7256364, 4.9340000, 4....
## $ lu_min_temp_summer <dbl> 8.708758, 11.961427, 9.462079, 13.063930, 12.6...
## $ lu_mean_temp_jan <dbl> 18.77522, 19.32417, 19.43773, 23.78500, 23.695...
## $ lu_mean_temp_feb <dbl> 18.54848, 19.73000, 19.19591, 23.61125, 23.523...
## $ lu_mean_temp_mar <dbl> 16.12696, 18.12750, 16.70864, 20.96625, 20.608...
## $ lu_mean_temp_apr <dbl> 11.95522, 15.26667, 12.74182, 16.55125, 16.040...
## $ lu_mean_temp_may <dbl> 8.177826, 12.152500, 8.952727, 12.402500, 12.0...
## $ lu_mean_temp_jun <dbl> 5.430217, 9.730000, 6.197727, 9.231250, 8.8302...
## $ lu_mean_temp_jul <dbl> 4.435869, 8.720833, 5.014091, 8.315000, 7.8875...
## $ lu_mean_temp_aug <dbl> 5.840869, 9.670833, 6.484091, 9.762500, 9.3469...
## $ lu_mean_temp_sep <dbl> 8.415870, 11.675833, 8.894545, 12.160000, 11.8...
## $ lu_mean_temp_oct <dbl> 11.61130, 14.11250, 12.13773, 15.71125, 15.498...
## $ lu_mean_temp_nov <dbl> 14.31783, 15.98167, 14.81045, 18.81375, 18.598...
## $ lu_mean_temp_dec <dbl> 17.14391, 18.02500, 17.81318, 21.96375, 21.776...
## $ lu_mean_rain_annual <dbl> 701.8815, 923.6184, 703.1797, 632.7203, 616.14...
## $ lu_mean_rain_dry <dbl> 465.3587, 575.4270, 428.9841, 414.9306, 401.79...
## $ lu_mean_rain_wet <dbl> 434.7203, 611.4777, 450.7538, 378.4017, 369.10...
## $ lu_mean_rain_jan <dbl> 58.99304, 64.87000, 65.68545, 49.75750, 51.520...
## $ lu_mean_rain_feb <dbl> 52.28522, 96.82333, 62.86909, 49.84250, 52.434...
## $ lu_mean_rain_mar <dbl> 48.23087, 77.62500, 47.16182, 37.71250, 39.651...
## $ lu_mean_rain_apr <dbl> 43.75261, 75.85167, 42.90636, 39.08500, 39.334...
## $ lu_mean_rain_may <dbl> 43.12826, 62.33833, 42.46364, 39.57750, 41.082...
## $ lu_mean_rain_jun <dbl> 53.64957, 64.15167, 44.05909, 44.54000, 46.563...
## $ lu_mean_rain_jul <dbl> 58.95087, 61.90000, 58.96545, 57.58250, 59.770...
## $ lu_mean_rain_aug <dbl> 56.32826, 38.31000, 62.63000, 48.88250, 51.931...
## $ lu_mean_rain_sep <dbl> 59.54696, 55.35667, 51.68182, 50.00000, 50.243...
## $ lu_mean_rain_oct <dbl> 59.06609, 66.21000, 63.35546, 50.54000, 53.266...
## $ lu_mean_rain_nov <dbl> 73.36652, 86.67333, 72.03000, 54.06250, 54.821...
## $ lu_mean_rain_dec <dbl> 62.92565, 80.02000, 73.12636, 55.66000, 57.788...
## $ lu_mean_solar_annual <dbl> 17.24855, 15.61668, 17.49392, 18.19937, 18.201...
## $ lu_mean_solar_dry <dbl> 11.14162, 10.58263, 11.53976, 11.51478, 11.620...
## $ lu_mean_solar_wet <dbl> 21.61064, 19.21243, 21.74689, 22.97407, 22.901...
## $ lu_bulk_density <dbl> 1.419132, 1.307434, 1.413116, 1.446464, 1.4691...
## $ lu_carbon <dbl> 2.2090407, 2.5544023, 1.5361696, 1.3150985, 1....
## $ lu_clay <dbl> 14.11885, 14.41100, 15.36331, 26.25615, 24.288...
## $ lu_nitrogen <dbl> 0.14302469, 0.19321756, 0.13032669, 0.12377553...
## $ lu_ph <dbl> 4.713461, 4.625989, 4.765173, 5.214870, 5.1098...
## $ lu_phosphorus <dbl> 0.02885974, 0.02884064, 0.02333373, 0.03167121...
## $ lu_sand <dbl> 57.06934, 56.08776, 59.12095, 56.23438, 56.007...
## $ lu_silt <dbl> 18.090666, 14.325860, 15.069221, 17.746149, 17...
## $ yield <dbl> 3.33, 0.80, 1.89, 8.41, 11.12, 9.58, 3.30, 5.8...
A series of assumptions of linear regression model should be held to ensure the model fit with sound theoretical basis. Two most critical ones are the Normality and Equal Variance of the error/residual. The former one requires the distribution of the regression errors to follow a normal distribution, while the later one expects the error variance showing in same level at any set of predictor values. Based on previous group work, these two assumptions are not held strictly in our dataset. To mitigate the issue, two preprocessing steps are applied to the dataset.
On response variable side, Box-Cox transformation is applied. The Box-Cox method [1] considers a family of transformations based on a parameter lambda (λ), which is chosen by numerically maximizing the log-likelihood. As shown in the following formula, when the optized lambda equals to 0, the mehod is equivalent to perform log function on the variable. When lambda is not equal to 0, the operation is equivalent to taking a power of lambda on the variable. This method is one of the classic approaches to transform variables when they are strictly positive. The transformation of Y has the form:
On predicor variable side, standardisation/scale is applied onto all numeric variables. Standardisation calculates the mean and standard deviation of all the values for a varaible, and then “scale” each element by subtracting the mean and dividing by the standard deviation.
In previous research, many predictor variables act as significant indicators in different versions of model fits. However, strong overlap of business representation and collinearity are shown among these variables, hindering researchers gaining reliable conclusion regarding the causality across them. To achieve a model with aligned interpretation of agriculture science, we would like to find causation between variables. Structural Equation Model (SEM) is one of the most popular methodologies to conduct such analysis. SEM aims to formulating the latent variables that potentially fits the data better as well as reveals stronger causal relationship between reponse and predictor variables. There are several indices to evaluate the goodness of fit of the identified model. They are generally categorized into 3 groups, well known as absolute, incremental and parsimony fit indices. The classic absolute indices include Model Chi-square, RMSEA and GFI, whereas the most commonly used incremental indices include NFI and CFI. The difference between Chi-square and the other indices is that most of the indices consist of a single value that must be compared against a benchmark that has been defined by statistics scholars. For instance, a typical cutoff for the goodness of fit index (GFI) is 0.9. It indicates that values greater or equal to 0.9 represent good fit, whereas values less than 0.9 imply poor fit. However, such benchmarks are not always universally agreed upon. In contrast, a fit goodness hypothesis based on Chi-Square statistics provides a statistically consistent outcome with a given p-value.
We started with the baseline model achieved from previous group work. 11 predictor variables are selected with most of the temperature related attributes included.
df1_final <- df1 %>%
dplyr::select(
lu_mean_temp_summer,
lu_mean_rain_oct,
lu_mean_temp_jan,
lu_mean_temp_feb,
lu_mean_temp_sep,
lu_mean_temp_nov,
lu_mean_temp_dec,
lu_mean_solar_wet,
lu_bulk_density,
lu_ph,
lu_phosphorus,
yield
)
title="Grape Histogram"
ggplot(data = df1_final, aes(yield)) +
geom_histogram(aes(y=..density..), bins=100, color="black", fill="white") +
geom_density(alpha=.2, fill="#FF6666")+
labs(title=title)From above histogram we can observe that yield variable has a non-normal distribution. Moreover, from model diagnostic plots we can observe that the basic assumptions for linear regression are not always held for our scenario. On one side, the mean of the residuals does not show 0 at each fitted value. This indicates the linearity assumption is invalid. On the other hand, the spread of the residuals is larger for larger fitted values. This implies that the constant variance assumption is violated. One solution to deal with ill-shaped variable distribution problem is to conduct Box-Cox transformation. Firstly, we search for a proper lambda by running the Box-Cox analysis. After identifying that the lambda should reside between 0.4 and 0.7, we further zoom in the selection range of lambda and increase the step precision to 0.05, and finally retrieve a lambda of 0.65. We then repeat the model fitting process on the new response variable.
par(mfrow=c(1,2))
boxcox(model1, lambda = seq(-1, 1, by = 0.1), plotit = TRUE)
boxcox(model1, lambda = seq(0.5, 0.7, by = 0.05), plotit = TRUE)lambda = 0.65
df1_final$yield_tf <- (df1_final$yield^(lambda) - 1)/lambda
ggplot(data = df1_final, aes(yield_tf)) +
geom_histogram(aes(y=..density..), bins=100, color="black", fill="white") +
geom_density(alpha=.2, fill="#FF6666")+
labs(title='Yield Histogram (After Transformation)')It is clear that after transformation the distribution of yield no longer shows skewness and has a better fit into normal distribution. By comparing the Residual vs. Fitted plot from initial Model1 and the same model after Box-Cox transformation, we can observe that the Normality violation of Residual and inconsistent Variance of Residual issues are mitigated. To keep our analysis and model creation principle consistent, we adopt the same transformation on response variable in the following model fits practices.
Based on the work by researchers[2], wine grape yield is affected by combination of various environmental factors (e.g., climate, pest, disease, etc.). Among them, climate and soil condition collected in our dataset play most important roles in wine grape yield. In addition, water capacity is no longer a predominate factor due to the quick development of irrigation system, leaving solar radiation and temperature as the more critical influencer. They also argued that climate shows more impact during grape growing season, i.e., summer/wet season. Therefore, it can be found that in our following study only months related to summer and wet season are selected. Soil-wise, not many indices show significance due to the heavy usage of fertiliser. Based on this knowledge, we define our conceptual causality model with two latent variables, soil and climate. Their causation with yield is depicted in the following Directed Acyclic Graph (DAG). As for details, the identifier table summarises the observed variables for each latent variable.
As an initial SEM based model, we include all 11 observed variables in the baseline model and use lavaan to verify whether it fits the data well. (Some of the observed variables are not shown in our table because from the following analysis we find that they are not significant, e.g., bulk density and phosphorus)
df1_sem <- df1_final %>%
dplyr::select(lu_mean_temp_summer,
lu_mean_rain_oct,
lu_mean_temp_jan,
lu_mean_temp_feb,
lu_mean_temp_sep,
lu_mean_temp_nov,
lu_mean_temp_dec,
lu_mean_solar_wet,
lu_bulk_density,
lu_ph,
lu_phosphorus,
yield_tf)
model <- '
# latent variable model
soil =~ lu_ph + lu_phosphorus + lu_bulk_density
climate =~ lu_mean_temp_summer + lu_mean_rain_oct + lu_mean_temp_jan + lu_mean_temp_feb + lu_mean_temp_sep + lu_mean_temp_nov + lu_mean_temp_dec + lu_mean_solar_wet
# regressions
yield_tf ~ soil + climate
# corvariance
lu_mean_temp_summer ~~ lu_mean_temp_jan
lu_mean_temp_summer ~~ lu_mean_temp_feb
lu_mean_temp_summer ~~ lu_mean_temp_sep
lu_mean_temp_summer ~~ lu_mean_temp_nov
lu_mean_temp_summer ~~ lu_mean_temp_dec
'
fit <- sem(model, data=df1_sem)
summary(fit)## lavaan 0.6-7 did NOT end normally after 840 iterations
## ** WARNING ** Estimates below are most likely unreliable
##
## Estimator ML
## Optimization method NLMINB
## Number of free parameters 31
##
## Number of observations 215
##
## Model Test User Model:
##
## Test statistic NA
## Degrees of freedom NA
##
## Parameter Estimates:
##
## Standard errors Standard
## Information Expected
## Information saturated (h1) model Structured
##
## Latent Variables:
## Estimate Std.Err z-value P(>|z|)
## soil =~
## lu_ph 1.000
## lu_phosphorus 0.001 NA
## lu_bulk_densty 0.117 NA
## climate =~
## lu_mn_tmp_smmr 1.000
## lu_mean_ran_ct -5.138 NA
## lu_mean_tmp_jn 1.201 NA
## lu_mean_tmp_fb 1.152 NA
## lu_mean_tmp_sp 0.582 NA
## lu_mean_tmp_nv 1.004 NA
## lu_mean_tmp_dc 1.179 NA
## lu_mean_slr_wt 0.673 NA
##
## Regressions:
## Estimate Std.Err z-value P(>|z|)
## yield_tf ~
## soil -70.471 NA
## climate 15.874 NA
##
## Covariances:
## Estimate Std.Err z-value P(>|z|)
## .lu_mean_temp_summer ~~
## .lu_mean_tmp_jn 0.019 NA
## .lu_mean_tmp_fb 0.037 NA
## .lu_mean_tmp_sp 0.356 NA
## .lu_mean_tmp_nv 0.038 NA
## .lu_mean_tmp_dc 0.052 NA
## soil ~~
## climate 0.989 NA
##
## Variances:
## Estimate Std.Err z-value P(>|z|)
## .lu_ph 0.373 NA
## .lu_phosphorus 0.000 NA
## .lu_bulk_densty 0.004 NA
## .lu_mn_tmp_smmr 0.150 NA
## .lu_mean_ran_ct 204.851 NA
## .lu_mean_tmp_jn -0.017 NA
## .lu_mean_tmp_fb 0.071 NA
## .lu_mean_tmp_sp 1.232 NA
## .lu_mean_tmp_nv 0.334 NA
## .lu_mean_tmp_dc 0.113 NA
## .lu_mean_slr_wt 0.828 NA
## .yield_tf 23.752 NA
## soil 0.214 NA
## climate 4.495 NA
Above lavaan output shows no reliable model is identified after 984 iterations. This is because when variables have large range of variance or the covariance matrix is not positive-definite, it is difficult for lavaan to find a converged solution. Therefore, rescaling the variables into a smaller range according to their distributions is necessary.
df1_sem[,1:11] <- scale(df1_sem[,1:11])
df1_sem <-as.data.frame((df1_sem))
fit <- sem(model, data=df1_sem, check.gradient=FALSE)
summary(fit, standardized=TRUE)## lavaan 0.6-7 ended normally after 285 iterations
##
## Estimator ML
## Optimization method NLMINB
## Number of free parameters 31
##
## Number of observations 215
##
## Model Test User Model:
##
## Test statistic 1314.198
## Degrees of freedom 47
## P-value (Chi-square) 0.000
##
## Parameter Estimates:
##
## Standard errors Standard
## Information Expected
## Information saturated (h1) model Structured
##
## Latent Variables:
## Estimate Std.Err z-value P(>|z|) Std.lv Std.all
## soil =~
## lu_ph 1.000 0.616 0.617
## lu_phosphorus 0.136 0.108 1.264 0.206 0.084 0.084
## lu_bulk_densty 1.109 0.123 9.051 0.000 0.683 0.684
## climate =~
## lu_mn_tmp_smmr 1.000 0.973 0.984
## lu_mean_ran_ct -0.621 0.056 -11.166 0.000 -0.604 -0.606
## lu_mean_tmp_jn 1.026 0.011 94.389 0.000 0.999 1.001
## lu_mean_tmp_fb 1.019 0.012 82.833 0.000 0.992 0.994
## lu_mean_tmp_sp 0.762 0.039 19.460 0.000 0.742 0.744
## lu_mean_tmp_nv 0.989 0.020 48.782 0.000 0.963 0.965
## lu_mean_tmp_dc 1.016 0.012 82.097 0.000 0.989 0.991
## lu_mean_slr_wt 0.864 0.039 22.257 0.000 0.841 0.843
##
## Regressions:
## Estimate Std.Err z-value P(>|z|) Std.lv Std.all
## yield_tf ~
## soil 48.639 238.585 0.204 0.838 29.940 11.848
## climate -29.663 149.412 -0.199 0.843 -28.874 -11.426
##
## Covariances:
## Estimate Std.Err z-value P(>|z|) Std.lv Std.all
## .lu_mean_temp_summer ~~
## .lu_mean_tmp_jn 0.003 0.001 3.361 0.001 0.003 0.379
## .lu_mean_tmp_fb 0.007 0.001 6.361 0.000 0.007 0.355
## .lu_mean_tmp_sp 0.098 0.009 10.587 0.000 0.098 0.827
## .lu_mean_tmp_nv 0.008 0.001 6.623 0.000 0.008 0.168
## .lu_mean_tmp_dc 0.009 0.001 8.086 0.000 0.009 0.401
## soil ~~
## climate 0.594 0.077 7.690 0.000 0.991 0.991
##
## Variances:
## Estimate Std.Err z-value P(>|z|) Std.lv Std.all
## .lu_ph 0.616 0.068 9.040 0.000 0.616 0.619
## .lu_phosphorus 0.988 0.095 10.372 0.000 0.988 0.993
## .lu_bulk_densty 0.529 0.066 8.075 0.000 0.529 0.532
## .lu_mn_tmp_smmr 0.032 0.003 10.992 0.000 0.032 0.032
## .lu_mean_ran_ct 0.630 0.060 10.481 0.000 0.630 0.633
## .lu_mean_tmp_jn -0.003 0.001 -4.272 0.000 -0.003 -0.003
## .lu_mean_tmp_fb 0.012 0.001 9.743 0.000 0.012 0.012
## .lu_mean_tmp_sp 0.445 0.043 10.422 0.000 0.445 0.447
## .lu_mean_tmp_nv 0.068 0.006 10.755 0.000 0.068 0.069
## .lu_mean_tmp_dc 0.018 0.002 10.347 0.000 0.018 0.018
## .lu_mean_slr_wt 0.288 0.027 10.657 0.000 0.288 0.289
## .yield_tf -10.832 80.605 -0.134 0.893 -10.832 -1.696
## soil 0.379 0.082 4.596 0.000 1.000 1.000
## climate 0.948 0.094 10.036 0.000 1.000 1.000
## npar fmin chisq df
## 31.000 3.056 1314.198 47.000
## pvalue baseline.chisq baseline.df baseline.pvalue
## 0.000 6141.036 66.000 0.000
## cfi tli nnfi rfi
## 0.791 0.707 0.707 0.699
## nfi pnfi ifi rni
## 0.786 0.560 0.792 0.791
## logl unrestricted.logl aic bic
## -1441.240 -784.141 2944.479 3048.969
## ntotal bic2 rmsea rmsea.ci.lower
## 215.000 2950.736 0.354 0.338
## rmsea.ci.upper rmsea.pvalue rmr rmr_nomean
## 0.371 0.000 0.144 0.144
## srmr srmr_bentler srmr_bentler_nomean crmr
## 0.099 0.099 0.099 0.107
## crmr_nomean srmr_mplus srmr_mplus_nomean cn_05
## 0.107 0.099 0.099 11.470
## cn_01 gfi agfi pgfi
## 12.852 0.568 0.283 0.342
## mfi ecvi
## 0.052 6.401
From the re-run output, we can find the the model converging. However, the fitness of model is not impressive. Below table with evaluation of critical indices show that this model is not a good fit.
| Index | Actual Value | Preferred Value | Conclusion |
|---|---|---|---|
| Chi-square | 1314.198(p=0) | p > 0.05 | Not fit |
| RMSEA | 0.354 | < 0.05 | Not fit |
| GFI | 0.659 | > 0.9 | Not fit |
| NFI | 0.786 | > 0.9 | Not fit |
| IFI | 0.792 | > 0.9 | Not fit |
| CFI | 0.791 | > 0.9 | Not fit |
We also tried several other models with different manifest/observed variable selection. Model 2 considers lu_max_temp_summer, lu_mean_rain_wet, lu_mean_solar_annual and lu_mean_solar_wet as manifest variables for climate latent variable, and lu_ph, lu_carbon and lu_nitrogen as observed variables for the latent variable representing soil condition.
From below lavaan output, we can find that the Chi-square test, CFI, IFI, NFI, RMSEA and GFI all fail again.
df_2 <- df1 %>%
dplyr::select(lu_max_temp_summer, lu_mean_rain_wet,
lu_mean_solar_annual, lu_mean_solar_wet,
lu_ph, lu_carbon, lu_nitrogen,
yield)
df_2[,1:7] <- scale(df_2[,1:7])
df_2 <-as.data.frame((df_2))
model_2 = lm(yield ~.,data=df_2)
boxcox(model_2, lambda = seq(0.5, 0.7, by = 0.05), plotit = TRUE)lambda = 0.6
df_2$yield_tf <- (df_2$yield^(lambda) - 1)/lambda
model <- '
# latent variable model
soil =~ lu_ph + lu_carbon + lu_nitrogen
climate =~ lu_max_temp_summer + lu_mean_solar_annual + lu_mean_rain_wet + lu_mean_solar_wet
# regressions
yield_tf ~ soil + climate
# corvariance
lu_mean_solar_annual ~~ lu_mean_solar_wet
'
fit <- sem(model, data=df_2)
summary(fit, standardized=TRUE, )## lavaan 0.6-7 ended normally after 47 iterations
##
## Estimator ML
## Optimization method NLMINB
## Number of free parameters 19
##
## Number of observations 215
##
## Model Test User Model:
##
## Test statistic 412.134
## Degrees of freedom 17
## P-value (Chi-square) 0.000
##
## Parameter Estimates:
##
## Standard errors Standard
## Information Expected
## Information saturated (h1) model Structured
##
## Latent Variables:
## Estimate Std.Err z-value P(>|z|) Std.lv Std.all
## soil =~
## lu_ph 1.000 0.745 0.747
## lu_carbon -1.220 0.088 -13.931 0.000 -0.909 -0.911
## lu_nitrogen -1.237 0.088 -14.106 0.000 -0.921 -0.923
## climate =~
## lu_mx_tmp_smmr 1.000 0.970 0.972
## lu_men_slr_nnl 0.953 0.037 25.625 0.000 0.924 0.926
## lu_mean_ran_wt -0.526 0.063 -8.395 0.000 -0.510 -0.512
## lu_mean_slr_wt 0.896 0.043 20.977 0.000 0.868 0.870
##
## Regressions:
## Estimate Std.Err z-value P(>|z|) Std.lv Std.all
## yield_tf ~
## soil 1.697 0.428 3.967 0.000 1.264 0.551
## climate -0.406 0.315 -1.290 0.197 -0.394 -0.172
##
## Covariances:
## Estimate Std.Err z-value P(>|z|) Std.lv Std.all
## .lu_mean_solar_annual ~~
## .lu_mean_slr_wt 0.148 0.025 5.916 0.000 0.148 0.801
## soil ~~
## climate 0.603 0.076 7.923 0.000 0.835 0.835
##
## Variances:
## Estimate Std.Err z-value P(>|z|) Std.lv Std.all
## .lu_ph 0.441 0.047 9.429 0.000 0.441 0.443
## .lu_carbon 0.169 0.027 6.372 0.000 0.169 0.170
## .lu_nitrogen 0.147 0.026 5.728 0.000 0.147 0.147
## .lu_mx_tmp_smmr 0.055 0.022 2.516 0.012 0.055 0.056
## .lu_men_slr_nnl 0.142 0.024 5.960 0.000 0.142 0.143
## .lu_mean_ran_wt 0.735 0.072 10.216 0.000 0.735 0.738
## .lu_mean_slr_wt 0.241 0.030 7.937 0.000 0.241 0.242
## .yield_tf 4.340 0.432 10.039 0.000 4.340 0.825
## soil 0.555 0.088 6.287 0.000 1.000 1.000
## climate 0.940 0.098 9.572 0.000 1.000 1.000
## npar fmin chisq df
## 19.000 0.958 412.134 17.000
## pvalue baseline.chisq baseline.df baseline.pvalue
## 0.000 2014.684 28.000 0.000
## cfi tli nnfi rfi
## 0.801 0.672 0.672 0.663
## nfi pnfi ifi rni
## 0.795 0.483 0.802 0.801
## logl unrestricted.logl aic bic
## -1814.291 -1608.224 3666.582 3730.624
## ntotal bic2 rmsea rmsea.ci.lower
## 215.000 3670.417 0.329 0.302
## rmsea.ci.upper rmsea.pvalue rmr rmr_nomean
## 0.357 0.000 0.164 0.164
## srmr srmr_bentler srmr_bentler_nomean crmr
## 0.105 0.105 0.105 0.119
## crmr_nomean srmr_mplus srmr_mplus_nomean cn_05
## 0.119 0.105 0.105 15.392
## cn_01 gfi agfi pgfi
## 18.428 0.740 0.449 0.349
## mfi ecvi
## 0.399 2.094
Model 3 removes lu_mean_rain_wet and lu_mean_solar_wet for climate from Model 2, and keep the same group of variables as observed variables for soil condition latent variable. From below lavaan output, we can observe that Model 3 pass the model fitness test for CFI, IFI, NFI and GFI, but still does not satisfy the Chi-square test.
df_3 <- df1 %>%
dplyr::select(lu_max_temp_summer,
lu_mean_solar_annual,
lu_ph, lu_carbon, lu_nitrogen,
yield)
df_3[,1:5] <- scale(df_3[,1:5])
df_3 <-as.data.frame((df_3))
model_3 = lm(yield ~.,data=df_3)
boxcox(model_3, lambda = seq(0.5, 0.7, by = 0.05), plotit = TRUE)lambda = 0.6
df_3$yield_tf <- (df_3$yield^(lambda) - 1)/lambda
model <- '
# latent variable model
soil =~ lu_ph + lu_carbon + lu_nitrogen
climate =~ lu_max_temp_summer + lu_mean_solar_annual
# regressions
yield_tf ~ soil + climate
# corvariance
'
fit <- sem(model, data=df_3)
summary(fit, standardized=TRUE, )## lavaan 0.6-7 ended normally after 42 iterations
##
## Estimator ML
## Optimization method NLMINB
## Number of free parameters 14
##
## Number of observations 215
##
## Model Test User Model:
##
## Test statistic 64.919
## Degrees of freedom 7
## P-value (Chi-square) 0.000
##
## Parameter Estimates:
##
## Standard errors Standard
## Information Expected
## Information saturated (h1) model Structured
##
## Latent Variables:
## Estimate Std.Err z-value P(>|z|) Std.lv Std.all
## soil =~
## lu_ph 1.000 0.746 0.748
## lu_carbon -1.220 0.087 -13.977 0.000 -0.910 -0.912
## lu_nitrogen -1.233 0.087 -14.110 0.000 -0.920 -0.922
## climate =~
## lu_mx_tmp_smmr 1.000 0.986 0.989
## lu_men_slr_nnl 0.922 0.038 24.317 0.000 0.910 0.912
##
## Regressions:
## Estimate Std.Err z-value P(>|z|) Std.lv Std.all
## yield_tf ~
## soil 1.601 0.400 4.000 0.000 1.195 0.521
## climate -0.320 0.287 -1.113 0.266 -0.315 -0.137
##
## Covariances:
## Estimate Std.Err z-value P(>|z|) Std.lv Std.all
## soil ~~
## climate 0.604 0.076 7.930 0.000 0.821 0.821
##
## Variances:
## Estimate Std.Err z-value P(>|z|) Std.lv Std.all
## .lu_ph 0.439 0.047 9.412 0.000 0.439 0.441
## .lu_carbon 0.167 0.027 6.268 0.000 0.167 0.168
## .lu_nitrogen 0.150 0.026 5.776 0.000 0.150 0.150
## .lu_mx_tmp_smmr 0.022 0.025 0.884 0.377 0.022 0.022
## .lu_men_slr_nnl 0.168 0.027 6.286 0.000 0.168 0.169
## .yield_tf 4.353 0.432 10.082 0.000 4.353 0.827
## soil 0.557 0.088 6.300 0.000 1.000 1.000
## climate 0.973 0.099 9.812 0.000 1.000 1.000
## npar fmin chisq df
## 14.000 0.151 64.919 7.000
## pvalue baseline.chisq baseline.df baseline.pvalue
## 0.000 1080.445 15.000 0.000
## cfi tli nnfi rfi
## 0.946 0.884 0.884 0.871
## nfi pnfi ifi rni
## 0.940 0.439 0.946 0.946
## logl unrestricted.logl aic bic
## -1498.662 -1466.202 3025.323 3072.512
## ntotal bic2 rmsea rmsea.ci.lower
## 215.000 3028.149 0.196 0.154
## rmsea.ci.upper rmsea.pvalue rmr rmr_nomean
## 0.241 0.000 0.141 0.141
## srmr srmr_bentler srmr_bentler_nomean crmr
## 0.064 0.064 0.064 0.076
## crmr_nomean srmr_mplus srmr_mplus_nomean cn_05
## 0.076 0.064 0.064 47.588
## cn_01 gfi agfi pgfi
## 62.187 0.913 0.739 0.304
## mfi ecvi
## 0.874 0.432
After several more experiments, We finally reach a variable selection where the model is further simplified by only considering summer temperature and annual solar radiation as observed variables for climate, and removing carbon and nitrogen from soil related observed variables. This time, the model converges and shows a good fitness. Below table shows a summary of the fitness test result.
df_n <- df1 %>%
dplyr::select(lu_max_temp_summer,
lu_mean_solar_annual,
lu_ph,
yield)
df_n[,1:3] <- scale(df_n[,1:3])
df_n <-as.data.frame((df_n))
model_n = lm(yield ~.,data=df_n)
boxcox(model_n, lambda = seq(0.5, 0.7, by = 0.05), plotit = TRUE)lambda = 0.6
df_n$yield_tf <- (df_n$yield^(lambda) - 1)/lambda
model_n_bc = lm(yield_tf ~.,data=df_n[,c(1:3,5)])model <- '
# latent variable model
soil =~ lu_ph
climate =~ lu_max_temp_summer + lu_mean_solar_annual
# regressions
yield_tf ~ soil + climate
# corvariance
'
fit <- sem(model, data=df_n)
summary(fit, standardized=TRUE, )## lavaan 0.6-7 ended normally after 27 iterations
##
## Estimator ML
## Optimization method NLMINB
## Number of free parameters 9
##
## Number of observations 215
##
## Model Test User Model:
##
## Test statistic 0.789
## Degrees of freedom 1
## P-value (Chi-square) 0.374
##
## Parameter Estimates:
##
## Standard errors Standard
## Information Expected
## Information saturated (h1) model Structured
##
## Latent Variables:
## Estimate Std.Err z-value P(>|z|) Std.lv Std.all
## soil =~
## lu_ph 1.000 0.998 1.000
## climate =~
## lu_mx_tmp_smmr 1.000 1.034 1.036
## lu_men_slr_nnl 0.839 0.051 16.306 0.000 0.868 0.870
##
## Regressions:
## Estimate Std.Err z-value P(>|z|) Std.lv Std.all
## yield_tf ~
## soil 1.383 0.151 9.137 0.000 1.380 0.602
## climate -0.104 0.139 -0.744 0.457 -0.107 -0.047
##
## Covariances:
## Estimate Std.Err z-value P(>|z|) Std.lv Std.all
## soil ~~
## climate 0.575 0.078 7.339 0.000 0.558 0.558
##
## Variances:
## Estimate Std.Err z-value P(>|z|) Std.lv Std.all
## .lu_ph 0.000 0.000 0.000
## .lu_mx_tmp_smmr -0.074 0.054 -1.371 0.170 -0.074 -0.074
## .lu_men_slr_nnl 0.242 0.044 5.482 0.000 0.242 0.243
## .yield_tf 3.511 0.339 10.371 0.000 3.511 0.667
## soil 0.995 0.096 10.368 0.000 1.000 1.000
## climate 1.069 0.110 9.758 0.000 1.000 1.000
## npar fmin chisq df
## 9.000 0.002 0.789 1.000
## pvalue baseline.chisq baseline.df baseline.pvalue
## 0.374 537.473 6.000 0.000
## cfi tli nnfi rfi
## 1.000 1.002 1.002 0.991
## nfi pnfi ifi rni
## 0.999 0.166 1.000 1.000
## logl unrestricted.logl aic bic
## -1128.941 -1128.547 2275.883 2306.218
## ntotal bic2 rmsea rmsea.ci.lower
## 215.000 2277.699 0.000 0.000
## rmsea.ci.upper rmsea.pvalue rmr rmr_nomean
## 0.173 0.491 0.022 0.022
## srmr srmr_bentler srmr_bentler_nomean crmr
## 0.009 0.009 0.009 0.012
## crmr_nomean srmr_mplus srmr_mplus_nomean cn_05
## 0.012 0.009 0.009 1047.821
## cn_01 gfi agfi pgfi
## 1809.050 0.998 0.982 0.100
## mfi ecvi
## 1.000 0.087
| Index | Actual Value | Preferred Value | Conclusion |
|---|---|---|---|
| Chi-square | 0.798(p=0.374) | p > 0.05 | Good fit |
| RMSEA | 0.000 | < 0.05 | Good fit |
| GFI | 0.998 | > 0.9 | Good fit |
| NFI | 0.999 | > 0.9 | Good fit |
| IFI | 1.000 | > 0.9 | Good fit |
| CFI | 1.000 | > 0.9 | Good fit |
We also run semCors function to understand the difference between the implied and observed fits of the correlations in our model vs. the data. The only difference observed is the causation between annual mean of solar radiation and yield, which implies a relatively good fit of our model.
We also compared the model-implied covariance matrix and the original covariance matrix of the data. It seems the difference between the two covariance matrices are not large, which indicates a relatively good model fit (not sure).
## $cov
## lu_ph l_mx__ l_mn__ yld_tf
## lu_ph 0.995
## lu_max_temp_summer 0.575 0.995
## lu_mean_solar_annual 0.483 0.897 0.995
## yield_tf 1.317 0.685 0.575 5.262
## lu_max_temp_summer lu_mean_solar_annual lu_ph
## lu_max_temp_summer 1.0000000 0.9014171 0.5774924
## lu_mean_solar_annual 0.9014171 1.0000000 0.4828236
## lu_ph 0.5774924 0.4828236 1.0000000
## yield_tf 0.6711426 0.5111696 1.3231109
## yield_tf
## lu_max_temp_summer 0.6711426
## lu_mean_solar_annual 0.5111696
## lu_ph 1.3231109
## yield_tf 5.2863341
## lhs op rhs mi epc sepc.lv sepc.all
## 13 soil =~ lu_max_temp_summer 0.788 0.447 0.446 0.447
## 14 soil =~ lu_mean_solar_annual 0.788 -0.376 -0.375 -0.376
## 16 lu_ph ~~ lu_max_temp_summer 0.788 -0.050 -0.050 NA
## 17 lu_ph ~~ lu_mean_solar_annual 0.788 0.042 0.042 NA
## 20 lu_max_temp_summer ~~ yield_tf 0.788 0.059 0.059 0.116
## 21 lu_mean_solar_annual ~~ yield_tf 0.788 -0.050 -0.050 -0.054
## sepc.nox
## 13 0.447
## 14 -0.376
## 16 NA
## 17 NA
## 20 0.116
## 21 -0.054
We also tried improving our model by following the recommendation of modification indices. Unfortunately, no model with valid fit is identified.
model_revised <- '
# latent variable model
soil =~ lu_ph
climate =~ lu_max_temp_summer + lu_mean_solar_annual
soil =~ lu_max_temp_summer
soil =~ lu_mean_solar_annual
# regressions
yield_tf ~ soil + climate
# corvariance
'
fit_test <- sem(model_revised, data=df_n)
summary(fit_test, standardized=TRUE)## lavaan 0.6-7 ended normally after 42 iterations
##
## Estimator ML
## Optimization method NLMINB
## Number of free parameters 12
##
## Number of observations 215
##
## Model Test User Model:
##
## Test statistic NA
## Degrees of freedom -2
## P-value (Unknown) NA
##
## Parameter Estimates:
##
## Standard errors Standard
## Information Expected
## Information saturated (h1) model Structured
##
## Latent Variables:
## Estimate Std.Err z-value P(>|z|) Std.lv Std.all
## soil =~
## lu_ph 1.000 0.964 0.966
## climate =~
## lu_mx_tmp_smmr 1.000 0.772 0.774
## lu_men_slr_nnl 1.175 NA 0.907 0.909
## soil =~
## lu_mx_tmp_smmr 0.922 NA 0.888 0.890
## lu_men_slr_nnl 0.873 NA 0.842 0.844
##
## Regressions:
## Estimate Std.Err z-value P(>|z|) Std.lv Std.all
## yield_tf ~
## soil 1.330 NA 1.282 0.559
## climate -0.287 NA -0.222 -0.097
##
## Covariances:
## Estimate Std.Err z-value P(>|z|) Std.lv Std.all
## soil ~~
## climate -0.282 NA -0.378 -0.378
##
## Variances:
## Estimate Std.Err z-value P(>|z|) Std.lv Std.all
## .lu_ph 0.066 NA 0.066 0.067
## .lu_mx_tmp_smmr 0.129 NA 0.129 0.130
## .lu_men_slr_nnl 0.042 NA 0.042 0.042
## .yield_tf 3.353 NA 3.353 0.637
## soil 0.929 NA 1.000 1.000
## climate 0.596 NA 1.000 1.000
## npar fmin chisq df
## 12.000 0.000 NA -2.000
## pvalue cfi tli nnfi
## NA NA NA NA
## rfi nfi pnfi ifi
## NA NA NA NA
## rni logl unrestricted.logl aic
## NA -1128.547 -1128.547 2281.094
## bic ntotal bic2 rmsea
## 2321.541 215.000 2283.516 NA
## rmsea.ci.lower rmsea.ci.upper rmsea.pvalue rmr
## NA NA NA 0.000
## rmr_nomean srmr srmr_bentler srmr_bentler_nomean
## 0.000 0.000 0.000 0.000
## crmr crmr_nomean srmr_mplus srmr_mplus_nomean
## 0.000 0.000 0.000 0.000
## cn_05 cn_01 gfi agfi
## NaN NaN 1.000 1.000
## pgfi mfi ecvi
## -0.200 NA NA
By now, we have achieved a much simplified model with only 3 predictor variables involved. The interpretation of this model is that the determine soil condition factor can be predicted from pH value, and climate factor can be indicated by the max temperature during summer season and the the solar radiation across the year.
Following the same principles used in previous group work, we evaluate the prediction performance of the new model by evaluating the RMSE metrics on both of the train and test datasets. The new RMSE is 3.82 on train dataset and 4.48 on test dataset, not as outstanding as the benchmark model. Besides, the RMSE ratio between training set and test set is 0.851. As a rule of thumb, when such ratio has deviation of more than 15% from 1.0, the model can be concluded as underfitting or overfitting in its predictions. In our scenario, it implies the new model is still a reasonable fit.
set.seed(10)
train_pct = 0.7
# identify train dataset size (number of rows)
train_size <- floor(train_pct * nrow(df_n))
# use sample function to form a random train dataset
train_n <- sample(seq_len(nrow(df_n)), size = train_size)
# generate train dataset
train <- df_n[train_n, ]
# test dataset is the supplementary of train dataset
test <- df_n[-train_n, ]
model_n =lm(yield_tf ~.,data=train[,c(1,2,3,5)])
test_feat <- dplyr::select(test, -yield_tf)
test$predict=predict(model_n,test_feat)
# Predict Train Set
train_feat <- dplyr::select(train, -yield_tf)
train$predict=predict(model_n,train_feat)
rmse_train <- rmse(((train$predict)*lambda+1)^(1/lambda), train$yield)
rmse_test <- rmse(((test$predict)*lambda+1)^(1/lambda), test$yield)
rmse_ratio <- rmse_train/rmse_testIn this study, we further analyzed how environmental factors affect grape yield. To ensure the derived model satisfying the critical assumptions of linear regression model, Box-Cox Transformation on response variable (yield) and standardisation of predictor variables are applied for data pre-processing. In addition, Causal Inference Analysis is conducted to enhance the model interpretation alignment with business insights. As a final result, we have designed a model consistent with previous group work results and research directions. This confirms that maximum temperature in summer, annual average solar radiation and soil pH values are highly correlated with grape yield in Australia which simplifies the best performance model from earlier group assignment. Although RMSE
Structural Equation Modelling (SEM) is commonly justified in theoretical models to explain relationships among different variables. These variables are not only observed variables and also unobserved constructs (Latent variables) which can be independent or dependent in nature [3]. SEM starts with estimation of a model and aims at assessing whether the model is of absolute, incremental and parsimony fit. It is noticeable that many advantages are introduced by SEM, including (1) better understanding of the problem while verifying relationships simultaneously; (2) modelling of measurement errors and unexplained variables; (3) ability to choose best fit models and evaluate the assumptions or theories. However, it is not always performing well in practices as many challenges can be faced (1) model does not converge due to approximately negative variance or covariance matrix showing not positive definite (2) A derived model with good fitness indices does not demonstrate forecast advantage. To achieve a model showing both solid theoretical foundation and promising prediction accuracy, it is very important to have a comprehensive understanding of the background knowledge of the problem before SEM analysis. Such understanding does not only impact the conceptual model construction, but also should influence the data collection process.
[1] David D.(2020). Applied Statistics with R, viewed 8 November 2020, < http://daviddalpiaz.github.io/appliedstats/>
[2] Webb,L,B. P.H.Whetton and E.W.R Barlow (2005). Impact on Australian Viticulture fromGreenhouse Induced Temperature Change, viewed 20 September 2020, <https://www.researchgate.net/publication/237556186_Impact_on_Australian_Viticulture_from_Greenhouse_Induced_Temperature_Change >
[3] Thakkar, J. (2020). Structural equation modelling : application for research and practice (with AMOS and R) . Springer.