Introduction

Historically, agriculture has played an important role in the Australian economy and society (Australian Government Productivity Commission, 2005). With the development in social environment, innovation and intelligent planning in the agricultural industry are vital to ensure that food supply demands are met across the country and exported overseas. Various researches indicate that some of the key factors shaping the industry could be summarized in consumer demands, government policies, technological advances and innovation and emerging environment concerns.


Background

In assessment task 2, the initial hypothesis of our study was to provide insights into what are the most influential geographical factors and how does yield react to changes of these factors by using linear regression. The team attempted to apply different types of approaches to find out the best fitting model to predict crop yield based on multiple environmental variables. The final model revealed that the mean temperature between September and February in the next year had the most influence on the wine grapes yield and other variables such as latitude also have some degree of influence on yield. However, at that time the complexities of the topics were unknown, and the team only had the comparatively simple tool of linear regression to build up the model and most of features are manually selected. Even though some of our members used Boruta package to choose variables, the types of models waived resulted in a lower train set / test set RMSE ratio but higher test set RSME number. In addition, it was hard to evaluate whether variables selected were or were not related.


Current Study and Aiming

In order to expand the exploration in agriculture and enhance the insights concluded in assessment 2, two question were raised up:

1. Why the other variables should be removed?
2. How do the different variables interact?

For the first question, the current study attempts to test variables derived from our previous study by employing Structural Equation Modelling (SEM) method. Literally, the SEM is a statistical modelling technique which is widely used to analyses the structural relationship between measured variables and latent construct. The relationship between the theoretical constructs (i.e. latent factors) are represented by regression or path coefficients between the factors (J.J. Hox, 1998). In another word, SEM model enables us to test these structures, observing how well they reflect the observational data, and therefore how well our assumptions may fit to the actual interactions taking place (Joshua McCarthy, 2020).

For the second question, this project will apply Exploratory Factor Analysis (EFA) method to find out whether these variables have similar patterns of responses. The basic assumption of factor analysis is that for a collection of observed variables there are a set of underlying variables called factors that can explain the interrelationships among measured variables.


Modelling Methodology

In order to expand the exploration in agriculture and enhance the insights concluded in assessment 2, two question were raised up:

Dataset Preparation

The data is reusing the datasets collected in assignment 2 which contains:

  • Crop yield (mainly focusing on wine grapes)
  • Region information (including region code, longitude, latitude and state)
  • Temperature
  • Rainfall
  • Solar radiation
  • Soil condition
  • Water usage
  • Agricultural land use (mainly focusing on vineyard)
  • Wine grape varieties

However, regarding to the requirements for this project, the selected data sources were pre-processed by using R. The climate values were narrowed down to only one variable which aims to reduce the weights of climate variables’ impact on the new model. Grape varieties data is initially in forms of ratio within SA2 region. Currently, they were re-calculated by incorporating with land use data which results in cultivated area of various grapes across states. Further, the scale of varieties was also narrowed down to top five wine grapes cultivated in Australia.

After removing the outliers, the number of records in this dataset is 220 and the number of initial observed variables in the model is 12. According to Kline (2011), a typical sample size in studies where SEM is used is about 200 cases with an ideal sample size-to-parameter ratio of 20:1. From J.C.F de Winter, et al. (2009), they pointed that EFA is generally regarded as a technique for large sample sizes (N), with N = 50 as reasonable absolute minimum. Thus, the sample size of the dataset is adequate for both EFA and SEM.

Model Specification

Yi Fan et, al. (2016) noted that model specification defines the hypothesized relationship among the variables in an SEM based on one’s knowledge. Model identification is to check if the model is over-identified, just-identified or under-identified. Model coefficients can be only estimated in the just-identified or over-identified mode. In another word, SEM requires us to develop a literature and theory supported theoretical model which indicates the relationship between the latent variables and observed indicators. In final, the observed variables were selected by supports from multiple researches and life experiences.

Intuitively, it would seem likely that climatic changes, especially the temperature are important to grapevines and grape production, which is concluded from our previous model. It also proves that fruitfulness would be improved by high temperatures during planting season (i.e. summer). Further, the region with a higher mean temperature across the whole year is more likely to get solar radiation than the region with a lower value. Likewise, it could be less possible to get a higher level of mean rainfall. EDA identified that there is a decreasing trend between crop yield and rainfall. But lack of rainfall can be a severe influence on grape productivity in the absence of good quality irrigation water (Ted Goldammer, 2018). Soil conditions should be taken into consideration as well. A. Lynn Cochran reported that grapes could tolerate a fairly wide range of soil chemistry and conditions while some soil types are more ideal for grape-growing in terms of pH, organic content, water capacity and soil fertility.

Another relevant factor which may affect the yield is land use. Obviously, vineyards with a larger growing area for viticulture could get a higher level of grape yield every year. The last potential factor is the variety of grape since due to differing viticultural practices and diverse composition of different varietals, the amount of grape harvested in hectolitre per hectare vary greatly.

The Directed Acyclic Graph (DAG) below shows a conceptual model of the correlations between all observed variables and latent variables. SEM model is constructed from the DAG.

Figure 1. Conceptual Directed Acyclic Graph

The table below shows the latent variables in SEM model and their identifiers.

Table 1. Latent Variables and Identifiers

S. No. Latent Variables Identifiers
1 mean_temp l_mn_t_ lu_mean_temp_annual
2 mean_rainfall l_mn_r_ lu_mean_rain_annual
3 soil_cond

l_w_

l_s_

l_p_

l_c_

lu_water_capacity

lu_sand

lu_ph

lu_carbon

4 mean_solar l_mn_s_ lu_mean_solar_annual
5 variety_area

ch_

sh_

cl_

tm_

c_

chardonnay_area

shiraz_area

colombard_area

tempranillo_area

cabernet_sauvignon_area

6 land_use g_ grape_area_landuse
7 crop_yield cr_ yield

Model Analysis

Model #1 – Structural Equation Model

Multiple SEMs were created and analysed while unfortunately, each iteration of the model resulted in a slight improvement. The final SEM model analysed below is the best one that represents the potential relationship among these observed variables.

This step is for SEM model estimation. The graph below shows the distribution of the data which indicates that there is variation in distributions. Originally, SEM relied on all multivariate distribution being normal, while the estimator Maximum Likelihood Robust (MLR) is a recommended starting point if data quality is questionable (Joshua McCarthy, 2020).

Figure 2. Distribution of Data

Note: R codes refer to get distribution of data section in Appendix.

Then, the table below shows the fit indices referring to the report from Thakkar (2020). The SEM model is still not well fitted. All of six parameters do not meet its own preferred value significantly, even though the Incremental fit (IFI) and Comparative fit (CFI), with numbers of 0.863 and 0.862 respectively, are almost close to 0.9. The result is surprisingly similar to our previous study in assessment task 2, which tells us that apart from temperature, a better fitting model is less likely generated based on other observed variables selected.

Table 2. Fit Indices

Index Result Value Preferred Value Conclusion
Chi-square 0.000 > 0.05 Not fit
Root mean square error of approximation (RMSEA) 0.186 < 0.05 Not fit
Goodness of fit (GFI) 0.785 > 0.9 Not fit
Normal fit (NFI) 0.848 > 0.9 Not fit
Incremental fit (IFI) 0.863 > 0.9 Not fit
Comparative fit (CFI) 0.862 > 0.9 Not fit

Figure 3. SEM Paths

Note: R codes refer to SEM model section in Appendix.

Model #2 – Exploratory Factor Analysis

Here, EFA model is used to reduce the data to a smaller set of summary variables and to explore the correlations among these variables. The graph below shows three factors could be used to describe nearly the whole datasets. In detail, temperature, rainfall, solar radiation, soil conditions (including water capacity, sand and carbon) could be summarised into one factor. The previous three are relatively correlated with a number over 0.7.

Figure 4. EFA Model with Three Factors

Note: R codes refer to EFA model section in Appendix.


Conclusion

From the SEM model analysis, it was found that the final model created was not able to manifest that various environmental variables could make great impacts on the wine grapes yield. It also proves that why in the previous study the team would choose a few temperature variables to train the model and made the prediction rather than incorporated with multiple variables such as soil conditions and grape varieties. Some of the reason could be as following:

  1. The crop yield data mainly came from the census conducted in 2015-2016, which means the model generated is relatively specific and limited
  2. Most of variables were re-calculated into an average amount, which means variables with a large variance might have little effect on the overall modelling
  3. Another potential addition to the model could be financial effects (i.e. supply and demands in the market), farming employment, government policies, even extreme weather conditions and fertilizer use, etc.

Thus, the results demonstrate that various factors might have influence on crop yield, even though they are inconsistent with our initial expectations and different from the conclusions from the reports we researched.

Speaking of the EFA model, surprisingly, it proves the assumption that there is a significant relationship between temperature, solar radiation and rain. In addition to this, carbon and water capacity from soil conditions are highly related to those climatic values which conforms to our common knowledge.


Reflection

Through this assignment, I was introduced to Structural Equation Modelling and Exploratory Factor Analysis model based on the topic about wine grapes in Australia. However, the deeper I drove into these models, the more I realised that the topic of agriculture is tremendous but extremely complicated. Thus, not many significant results were gained from analysing the data. In conclusion, the whole agricultural market is affected by several factors which is much broader than the observed variables we have found in this case and the complexity of establishing relationships is far harder than we thought. In addition, my skills in statistical modelling and relevant knowledge are still inadequate, while the good point is that difficulties and frustrations bring motivations to encourage me work harder.

Reference

[1] Australian Government Productivity Commission (2005). Trends in Australian Agriculture.
[2] Yi Fan et al. (2016). Applications of structural equation modeling (SEM) in ecological studies: an updated review. Ecological Processes.
[3] J.C.F. de Winter et al. (2009). Exploratory Factor Analysis With Small Sample Sizes.
[4] Ted Goldammer. (2018). Grape Grower’s Handbook. Apex Publishers.
[5] A.Lynn Cochran. The Best Soil Types to Grow Grapes. SFGATE.
[6] Joshua McCarthy (2020). Understanding Implied Structure in Past COVID-19 Research using SEM and CFA.
[7] Dr. Jitesh J. Thakkar (2020). Structural Equation Modelling.

Appendices

R codes

load libraries
library(here)
## here() starts at /Users/kevin/Downloads/test/STDS_Data_Geeks_Agriculture
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(readr)
library(lavaan)
## This is lavaan 0.6-7
## lavaan is BETA software! Please report any bugs.
library(semPlot)
## Registered S3 methods overwritten by 'huge':
##   method    from   
##   plot.sim  BDgraph
##   print.sim BDgraph
library(semTools)
## 
## ###############################################################################
## This is semTools 0.5-3
## All users of R (or SEM) are invited to submit functions or ideas for functions.
## ###############################################################################
## 
## Attaching package: 'semTools'
## The following object is masked from 'package:readr':
## 
##     clipboard
library(ggplot2)
library(cowplot)
library(psych)
## 
## Attaching package: 'psych'
## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha
## The following object is masked from 'package:semTools':
## 
##     skew
## The following object is masked from 'package:lavaan':
## 
##     cor2cov
pre-process dataset
yield_by_region <- read_csv(here::here("project/src/output", "yield_by_region.csv")
                            , col_types = cols(`cabernet_sauvignon _ratio` = col_double())
)
## Warning: 164 parsing failures.
## row                       col    expected      actual                                                                                             file
##   3 cabernet_sauvignon _ratio a double    FALSE       '/Users/kevin/Downloads/test/STDS_Data_Geeks_Agriculture/project/src/output/yield_by_region.csv'
##   3 NA                        178 columns 179 columns '/Users/kevin/Downloads/test/STDS_Data_Geeks_Agriculture/project/src/output/yield_by_region.csv'
##  10 cabernet_sauvignon _ratio a double    FALSE       '/Users/kevin/Downloads/test/STDS_Data_Geeks_Agriculture/project/src/output/yield_by_region.csv'
##  10 NA                        178 columns 179 columns '/Users/kevin/Downloads/test/STDS_Data_Geeks_Agriculture/project/src/output/yield_by_region.csv'
##  11 cabernet_sauvignon _ratio a double    FALSE       '/Users/kevin/Downloads/test/STDS_Data_Geeks_Agriculture/project/src/output/yield_by_region.csv'
## ... ......................... ........... ........... ................................................................................................
## See problems(...) for more details.
yield_by_region$`cabernet_sauvignon _ratio` <- as.numeric(yield_by_region$`cabernet_sauvignon _ratio`)
colnames(yield_by_region)[colnames(yield_by_region) == 'cabernet_sauvignon _ratio'] <- 'cabernet_sauvignon_ratio'
yield_by_region[is.na(yield_by_region)] <- 0

# remove outlier
yield_by_region <- yield_by_region[-222,]
yield_by_region <- yield_by_region[-180,]

model_data <- yield_by_region %>%
  dplyr::mutate(chardonnay_area = round(chardonnay_ratio * grape_area_landuse / 1000, 3)) %>%
  dplyr::mutate(shiraz_area = round(shiraz_ratio * grape_area_landuse / 1000, 3)) %>%
  dplyr::mutate(colombard_area = round(colombard_ratio * grape_area_landuse / 1000, 3)) %>%
  dplyr::mutate(tempranillo_area = round(tempranillo_ratio * grape_area_landuse / 1000, 3)) %>%
  dplyr::mutate(cabernet_sauvignon_area = round(cabernet_sauvignon_ratio * grape_area_landuse / 1000, 3)) %>%
  dplyr::mutate(lat = -lat) %>%
  dplyr::mutate(grape_area_landuse = round(grape_area_landuse / 1000, 3)) %>%
  dplyr::select(lon, lat, state, yield,
                lu_mean_temp_annual,
                lu_mean_rain_annual,
                lu_mean_solar_annual,
                grape_area_landuse,
                lu_bulk_density, lu_carbon, lu_clay, lu_nitrogen, lu_ph, lu_phosphorus, lu_sand, lu_silt, lu_water_capacity,
                chardonnay_area, shiraz_area, colombard_area, tempranillo_area, cabernet_sauvignon_area,
              )
get distribution of data
cvd_vars <- model_data %>%
  dplyr::select(-lon, -lat, -state, -lu_bulk_density, -lu_clay, -lu_ph, -lu_phosphorus, -lu_silt, -lu_nitrogen)

dist_plots <- lapply(names(cvd_vars), function(var_x){
  p <-
    ggplot(cvd_vars) +
    aes_string(var_x)

  if(is.numeric(cvd_vars[[var_x]])) {
    p <- p + geom_density()

  } else {
    p <- p + geom_bar()
  }

})

plot_grid(plotlist = dist_plots)

SEM model
sem_model <- '
  # latent variable model "latent variable =~  observed variable"
    crop_yield =~ yield
    mean_temp =~ lu_mean_temp_annual
    mean_rainfall =~ lu_mean_rain_annual
    soil_cond =~ lu_water_capacity + lu_carbon + lu_sand
    mean_solar =~ lu_mean_solar_annual
    variety_area =~ chardonnay_area + shiraz_area + colombard_area + tempranillo_area + cabernet_sauvignon_area
    land_use =~ grape_area_landuse

  # regressions "latent variable ~ latent variable => regression fomula"
    crop_yield ~ mean_temp + mean_solar + mean_rainfall + land_use + soil_cond
    mean_temp ~ mean_solar + mean_rainfall
    mean_solar ~ mean_rainfall
    variety_area ~ land_use + soil_cond

  # residual correlations (covariances) "observed variable ~~ observed variable => residuals and covariances"
    lu_mean_temp_annual ~~ lu_mean_rain_annual
    lu_mean_rain_annual ~~ lu_mean_solar_annual
    grape_area_landuse ~~ chardonnay_area
    grape_area_landuse ~~ shiraz_area
    grape_area_landuse ~~ colombard_area
    grape_area_landuse ~~ tempranillo_area
    lu_mean_rain_annual ~~ lu_water_capacity
    lu_water_capacity ~~ lu_carbon
'

fit_1 <- sem(sem_model, data = model_data, check.gradient = FALSE)
## Warning in lav_data_full(data = data, group = group, cluster = cluster, : lavaan
## WARNING: some observed variances are (at least) a factor 1000 times larger than
## others; use varTable(fit) to investigate
## Warning in lav_object_post_check(object): lavaan WARNING: some estimated ov
## variances are negative
## Warning in lav_object_post_check(object): lavaan WARNING: some estimated lv
## variances are negative
# varTable(fit_1)
summary(fit_1, standardized = TRUE)
## lavaan 0.6-7 ended normally after 984 iterations
## 
##   Estimator                                         ML
##   Optimization method                           NLMINB
##   Number of free parameters                         43
##                                                       
##   Number of observations                           220
##                                                       
## Model Test User Model:
##                                                       
##   Test statistic                               414.599
##   Degrees of freedom                                48
##   P-value (Chi-square)                           0.000
## 
## Parameter Estimates:
## 
##   Standard errors                             Standard
##   Information                                 Expected
##   Information saturated (h1) model          Structured
## 
## Latent Variables:
##                    Estimate   Std.Err  z-value  P(>|z|)   Std.lv  Std.all
##   crop_yield =~                                                          
##     yield              1.000                               5.251    1.000
##   mean_temp =~                                                           
##     lu_men_tmp_nnl     1.000                               2.081    1.000
##   mean_rainfall =~                                                       
##     lu_mean_rn_nnl     1.000                             234.515    1.000
##   soil_cond =~                                                           
##     lu_water_cpcty     1.000                               0.630    0.829
##     lu_carbon         -1.133    0.087  -12.950    0.000   -0.714   -0.834
##     lu_sand            2.904    1.124    2.584    0.010    1.830    0.161
##   mean_solar =~                                                          
##     lu_men_slr_nnl     1.000                               1.557    1.000
##   variety_area =~                                                        
##     chardonnay_are     1.000                               0.636    1.013
##     shiraz_area        0.142    0.062    2.300    0.021    0.090    0.151
##     colombard_area     0.388    0.016   24.075    0.000    0.247    0.866
##     tempranillo_ar     0.425    0.017   25.443    0.000    0.270    0.879
##     cabrnt_svgnn_r     0.044    0.020    2.216    0.027    0.028    0.096
##   land_use =~                                                            
##     grape_are_lnds     1.000                               1.972    1.000
## 
## Regressions:
##                    Estimate   Std.Err  z-value  P(>|z|)   Std.lv  Std.all
##   crop_yield ~                                                           
##     mean_temp         -0.007    0.234   -0.030    0.976   -0.003   -0.003
##     mean_solar        -0.585    0.395   -1.481    0.139   -0.173   -0.173
##     mean_rainfall      0.158    2.510    0.063    0.950    7.065    7.065
##     land_use          -0.045   12.063   -0.004    0.997   -0.017   -0.017
##     soil_cond         64.528  944.853    0.068    0.946    7.745    7.745
##   mean_temp ~                                                            
##     mean_solar         0.941    0.085   11.061    0.000    0.704    0.704
##     mean_rainfall     -0.001    0.001   -2.136    0.033   -0.151   -0.151
##   mean_solar ~                                                           
##     mean_rainfall     -0.005    0.001   -9.079    0.000   -0.715   -0.715
##   variety_area ~                                                         
##     land_use           0.619    0.177    3.506    0.000    1.919    1.919
##     soil_cond         -0.110    0.094   -1.176    0.240   -0.109   -0.109
## 
## Covariances:
##                          Estimate   Std.Err  z-value  P(>|z|)   Std.lv  Std.all
##  .lu_mean_temp_annual ~~                                                       
##    .lu_mean_rn_nnl         161.882   23.099    7.008    0.000  161.882      Inf
##  .lu_mean_rain_annual ~~                                                       
##    .lu_men_slr_nnl         133.260   17.934    7.430    0.000  133.260      Inf
##  .chardonnay_area ~~                                                           
##    .grape_are_lnds          -1.462    0.643   -2.274    0.023   -1.462     -Inf
##  .shiraz_area ~~                                                               
##    .grape_are_lnds           0.495    0.136    3.651    0.000    0.495      Inf
##  .colombard_area ~~                                                            
##    .grape_are_lnds          -0.537    0.250   -2.144    0.032   -0.537     -Inf
##  .tempranillo_area ~~                                                          
##    .grape_are_lnds          -0.637    0.274   -2.323    0.020   -0.637     -Inf
##  .lu_mean_rain_annual ~~                                                       
##    .lu_water_cpcty          14.671   10.136    1.447    0.148   14.671      Inf
##  .lu_water_capacity ~~                                                         
##    .lu_carbon                0.005    0.067    0.073    0.942    0.005    0.024
##   mean_rainfall ~~                                                             
##     soil_cond             -146.948   17.125   -8.581    0.000   -0.994   -0.994
##     land_use               -78.316   23.553   -3.325    0.001   -0.169   -0.169
##   soil_cond ~~                                                                 
##     land_use                 0.258    0.075    3.439    0.001    0.207    0.207
##  .crop_yield ~~                                                                
##    .variety_area            -0.577    0.479   -1.204    0.229   -0.556   -0.556
## 
## Variances:
##                    Estimate   Std.Err  z-value  P(>|z|)   Std.lv  Std.all
##    .yield              0.000                               0.000    0.000
##    .lu_men_tmp_nnl     0.000                               0.000    0.000
##    .lu_mean_rn_nnl     0.000                               0.000    0.000
##    .lu_water_cpcty     0.181    0.066    2.761    0.006    0.181    0.313
##    .lu_carbon          0.223    0.076    2.941    0.003    0.223    0.304
##    .lu_sand          125.999   12.002   10.498    0.000  125.999    0.974
##    .lu_men_slr_nnl     0.000                               0.000    0.000
##    .chardonnay_are    -0.010    0.007   -1.431    0.153   -0.010   -0.026
##    .shiraz_area        0.351    0.033   10.516    0.000    0.351    0.977
##    .colombard_area     0.020    0.002    9.273    0.000    0.020    0.250
##    .tempranillo_ar     0.021    0.002    8.984    0.000    0.021    0.227
##    .cabrnt_svgnn_r     0.084    0.008   10.545    0.000    0.084    0.991
##    .grape_are_lnds     0.000                               0.000    0.000
##    .crop_yield         1.022  244.550    0.004    0.997    0.037    0.037
##    .mean_temp          1.430    0.139   10.318    0.000    0.330    0.330
##     mean_rainfall  54997.374 5241.285   10.493    0.000    1.000    1.000
##     soil_cond          0.397    0.082    4.835    0.000    1.000    1.000
##    .mean_solar         1.184    0.159    7.446    0.000    0.489    0.489
##    .variety_area      -1.055    0.794   -1.328    0.184   -2.609   -2.609
##     land_use           3.889    0.366   10.628    0.000    1.000    1.000
semPaths(fit_1, style="lisrel",
         whatLabels = "std", edge.label.cex = .6, label.prop=0.9,
         rotation = 4, equalizeManifests = FALSE, optimizeLatRes = TRUE,
         node.width = 1.5, edge.width = 0.5, shapeMan = "rectangle",
         shapeLat = "ellipse", shapeInt = "triangle",
         sizeMan = 4, sizeInt = 2, sizeLat = 4,
         curve=2, unCol = "#070b8c")

fitMeasures(fit_1, c("pvalue", "rmsea", "gfi", "nfi", "ifi", "cfi"))
## pvalue  rmsea    gfi    nfi    ifi    cfi 
##  0.000  0.186  0.785  0.848  0.863  0.862
# modindices(fit_1)
EFA model
efa_data <- model_data %>%
  dplyr::select(-lon, -lat, -state, -lu_bulk_density, -lu_clay, -lu_ph, -lu_phosphorus, -lu_silt, -yield, -lu_nitrogen)

# cost too much time
# parallel <- fa.parallel(cvd_vars_2, fm = 'minres', fa = 'fa')
# plot(parallel)

# Exploratory Factor Analysis (EFA) is a statistical technique that is used to identify the latent relational structure
# among a set of variables and narrow down to smaller number of variables. This essentially means that the variance of
# large number of variables can be described by few summary variables
twofactor <- fa(efa_data, nfactors = 2, rotate = "oblimin", fm="minres")
## Loading required namespace: GPArotation
## Warning in fa.stats(r = r, f = f, phi = phi, n.obs = n.obs, np.obs = np.obs, :
## The estimated weights for the factor scores are probably incorrect. Try a
## different factor score estimation method.
## Warning in fac(r = r, nfactors = nfactors, n.obs = n.obs, rotate = rotate, : An
## ultra-Heywood case was detected. Examine the results carefully
print(twofactor)
## Factor Analysis using method =  minres
## Call: fa(r = efa_data, nfactors = 2, rotate = "oblimin", fm = "minres")
## Standardized loadings (pattern matrix) based upon correlation matrix
##                           MR1   MR2    h2     u2 com
## lu_mean_temp_annual      0.73 -0.09 0.512  0.488 1.0
## lu_mean_rain_annual     -0.69 -0.11 0.528  0.472 1.1
## lu_mean_solar_annual     0.78 -0.09 0.586  0.414 1.0
## grape_area_landuse       0.15  0.66 0.512  0.488 1.1
## lu_carbon               -0.92  0.01 0.845  0.155 1.0
## lu_sand                  0.32 -0.02 0.096  0.904 1.0
## lu_water_capacity        0.72  0.15 0.601  0.399 1.1
## chardonnay_area         -0.05  1.03 1.031 -0.031 1.0
## shiraz_area              0.18  0.15 0.068  0.932 2.0
## colombard_area           0.04  0.82 0.696  0.304 1.0
## tempranillo_area        -0.01  0.83 0.678  0.322 1.0
## cabernet_sauvignon_area  0.14  0.12 0.044  0.956 2.0
## 
##                        MR1  MR2
## SS loadings           3.22 2.98
## Proportion Var        0.27 0.25
## Cumulative Var        0.27 0.52
## Proportion Explained  0.52 0.48
## Cumulative Proportion 0.52 1.00
## 
##  With factor correlations of 
##      MR1  MR2
## MR1 1.00 0.27
## MR2 0.27 1.00
## 
## Mean item complexity =  1.2
## Test of the hypothesis that 2 factors are sufficient.
## 
## The degrees of freedom for the null model are  66  and the objective function was  11.63 with Chi Square of  2489.85
## The degrees of freedom for the model are 43  and the objective function was  5.03 
## 
## The root mean square of the residuals (RMSR) is  0.15 
## The df corrected root mean square of the residuals is  0.19 
## 
## The harmonic number of observations is  220 with the empirical chi square  674.32  with prob <  2.4e-114 
## The total number of observations was  220  with Likelihood Chi Square =  1069.92  with prob <  3.7e-196 
## 
## Tucker Lewis Index of factoring reliability =  0.346
## RMSEA index =  0.329  and the 90 % confidence intervals are  0.313 0.348
## BIC =  838
## Fit based upon off diagonal values = 0.85
## Visualise the factor anaysis
fa.diagram(twofactor)

# ------------------------------------------------------------------------
threefactor <- fa(efa_data, nfactors = 3, rotate = "oblimin", fm="minres")
## Warning in fa.stats(r = r, f = f, phi = phi, n.obs = n.obs, np.obs = np.obs, :
## The estimated weights for the factor scores are probably incorrect. Try a
## different factor score estimation method.

## Warning in fa.stats(r = r, f = f, phi = phi, n.obs = n.obs, np.obs = np.obs, :
## An ultra-Heywood case was detected. Examine the results carefully
print(threefactor)
## Factor Analysis using method =  minres
## Call: fa(r = efa_data, nfactors = 3, rotate = "oblimin", fm = "minres")
## Standardized loadings (pattern matrix) based upon correlation matrix
##                           MR1   MR2   MR3    h2      u2 com
## lu_mean_temp_annual      0.76 -0.05 -0.08 0.542  0.4583 1.0
## lu_mean_rain_annual     -0.69 -0.10 -0.04 0.526  0.4744 1.0
## lu_mean_solar_annual     0.80 -0.06 -0.05 0.606  0.3937 1.0
## grape_area_landuse       0.02  0.51  0.75 0.960  0.0400 1.8
## lu_carbon               -0.91  0.02 -0.05 0.840  0.1598 1.0
## lu_sand                  0.31 -0.03  0.03 0.096  0.9037 1.0
## lu_water_capacity        0.72  0.13  0.05 0.598  0.4021 1.1
## chardonnay_area         -0.03  0.99  0.09 1.004 -0.0038 1.0
## shiraz_area              0.03 -0.14  0.96 0.904  0.0958 1.0
## colombard_area           0.08  0.90 -0.13 0.829  0.1708 1.1
## tempranillo_area         0.03  0.87 -0.09 0.759  0.2414 1.0
## cabernet_sauvignon_area  0.01 -0.12  0.78 0.588  0.4118 1.1
## 
##                        MR1  MR2  MR3
## SS loadings           3.17 2.93 2.15
## Proportion Var        0.26 0.24 0.18
## Cumulative Var        0.26 0.51 0.69
## Proportion Explained  0.38 0.35 0.26
## Cumulative Proportion 0.38 0.74 1.00
## 
##  With factor correlations of 
##      MR1  MR2  MR3
## MR1 1.00 0.25 0.16
## MR2 0.25 1.00 0.17
## MR3 0.16 0.17 1.00
## 
## Mean item complexity =  1.1
## Test of the hypothesis that 3 factors are sufficient.
## 
## The degrees of freedom for the null model are  66  and the objective function was  11.63 with Chi Square of  2489.85
## The degrees of freedom for the model are 33  and the objective function was  2.13 
## 
## The root mean square of the residuals (RMSR) is  0.07 
## The df corrected root mean square of the residuals is  0.09 
## 
## The harmonic number of observations is  220 with the empirical chi square  126.31  with prob <  7.6e-13 
## The total number of observations was  220  with Likelihood Chi Square =  452.89  with prob <  3e-75 
## 
## Tucker Lewis Index of factoring reliability =  0.65
## RMSEA index =  0.24  and the 90 % confidence intervals are  0.222 0.261
## BIC =  274.9
## Fit based upon off diagonal values = 0.97
fa.diagram(threefactor)

# ------------------------------------------------------------------------
fourfactor <- fa(efa_data, nfactors = 4, rotate = "oblimin", fm="minres")
## Warning in fa.stats(r = r, f = f, phi = phi, n.obs = n.obs, np.obs = np.obs, :
## The estimated weights for the factor scores are probably incorrect. Try a
## different factor score estimation method.

## Warning in fa.stats(r = r, f = f, phi = phi, n.obs = n.obs, np.obs = np.obs, :
## An ultra-Heywood case was detected. Examine the results carefully
print(fourfactor)
## Factor Analysis using method =  minres
## Call: fa(r = efa_data, nfactors = 4, rotate = "oblimin", fm = "minres")
## Standardized loadings (pattern matrix) based upon correlation matrix
##                           MR2   MR3   MR4   MR1   h2      u2 com
## lu_mean_temp_annual      0.01 -0.03  0.93  0.03 0.83  0.1713 1.0
## lu_mean_rain_annual     -0.01  0.01  0.09  1.05 1.01 -0.0093 1.0
## lu_mean_solar_annual    -0.01  0.01  0.88 -0.05 0.81  0.1910 1.0
## grape_area_landuse       0.51  0.76  0.02  0.01 0.97  0.0262 1.8
## lu_carbon                0.02 -0.06 -0.40  0.63 0.80  0.1996 1.7
## lu_sand                  0.01  0.06  0.42  0.08 0.16  0.8446 1.1
## lu_water_capacity        0.10  0.05  0.18 -0.67 0.67  0.3342 1.2
## chardonnay_area          0.99  0.09 -0.01  0.01 1.01 -0.0067 1.0
## shiraz_area             -0.14  0.96 -0.01 -0.04 0.90  0.1014 1.0
## colombard_area           0.90 -0.13  0.02 -0.08 0.83  0.1729 1.1
## tempranillo_area         0.87 -0.09  0.02 -0.03 0.76  0.2390 1.0
## cabernet_sauvignon_area -0.12  0.78  0.00 -0.01 0.59  0.4108 1.1
## 
##                        MR2  MR3  MR4  MR1
## SS loadings           2.91 2.16 2.13 2.12
## Proportion Var        0.24 0.18 0.18 0.18
## Cumulative Var        0.24 0.42 0.60 0.78
## Proportion Explained  0.31 0.23 0.23 0.23
## Cumulative Proportion 0.31 0.54 0.77 1.00
## 
##  With factor correlations of 
##       MR2   MR3   MR4   MR1
## MR2  1.00  0.17  0.15 -0.24
## MR3  0.17  1.00  0.09 -0.16
## MR4  0.15  0.09  1.00 -0.47
## MR1 -0.24 -0.16 -0.47  1.00
## 
## Mean item complexity =  1.2
## Test of the hypothesis that 4 factors are sufficient.
## 
## The degrees of freedom for the null model are  66  and the objective function was  11.63 with Chi Square of  2489.85
## The degrees of freedom for the model are 24  and the objective function was  0.79 
## 
## The root mean square of the residuals (RMSR) is  0.02 
## The df corrected root mean square of the residuals is  0.03 
## 
## The harmonic number of observations is  220 with the empirical chi square  9.89  with prob <  0.99 
## The total number of observations was  220  with Likelihood Chi Square =  166.52  with prob <  2.7e-23 
## 
## Tucker Lewis Index of factoring reliability =  0.836
## RMSEA index =  0.164  and the 90 % confidence intervals are  0.142 0.189
## BIC =  37.08
## Fit based upon off diagonal values = 1
fa.diagram(fourfactor)